Scalable Infrastructure & Distributed Systems: Core Concepts for System Design Interviews

As systems scale to serve millions or even billions of users, a single-server architecture no longer suffices. That’s where scalable infrastructure and distributed systems come into play—and that’s why top tech companies test your understanding of these topics in every senior-level system design interview.

This guide walks through the key distributed systems patterns and how they apply to real-world design problems.


🧱 Core Concepts of Scalable Infrastructure

🔹 1. Sharding and Partitioning

Sharding = splitting large datasets across multiple nodes to scale reads and writes.

Types of sharding:

  • Hash-based (e.g., hash(user_id) % N)
  • Range-based (e.g., user_id 0–1000 → shard A)
  • Geo-based (e.g., users in US vs. EU)

Challenges:

  • Hotspots & unbalanced traffic
  • Rebalancing and resharding
  • Cross-shard queries and joins

📌 Common Interview Example: “How would you shard a user database for 1 billion users?”


🔹 2. Replication

Replication ensures fault-tolerance and improves read throughput.

Types of replication:

  • Master-Slave (Leader-Follower): writes go to leader, reads from followers
  • Multi-Master: all nodes can write (conflict resolution needed)
  • Leaderless: used in systems like DynamoDB

Trade-offs:

  • Data consistency vs. availability
  • Lag in replicas (eventual consistency)
  • Conflict handling (last write wins, vector clocks)

📌 Common Interview Example: “Design a messaging app that doesn't lose messages if a node crashes.”


🔹 3. Distributed Queues

Message queues decouple services and help with asynchronous processing.

Popular tools: Kafka, RabbitMQ, AWS SQS

Design considerations:

  • At-least-once vs. exactly-once semantics
  • Message ordering and partitioning
  • Dead-letter queues and retries

📌 Common Use Cases: Email delivery, video processing, log ingestion, metrics pipelines


🔹 4. Consensus & Coordination

In distributed systems, you need coordination to:

  • Elect leaders
  • Agree on configuration changes
  • Maintain consistency across replicas

Common algorithms:

  • Raft: simpler to reason about than Paxos
  • Paxos: foundational but complex
  • ZooKeeper / etcd: production-ready coordination services

📌 Example: “How do microservices agree on a leader for writing to a shared DB?”


🔹 5. Eventual Consistency & Conflict Resolution

In high-availability systems, eventual consistency is preferred over strong consistency.

Techniques:

  • Vector clocks
  • CRDTs (Conflict-Free Replicated Data Types)
  • Read repair & hinted handoff (used in Cassandra)

📌 Example: “Design Dropbox. How would you resolve conflicting file updates?”


🛠️ Supporting Infrastructure

Load Balancing

  • DNS round robin, L4 (TCP), or L7 (HTTP) load balancing
  • Global load balancers (e.g., for multi-region deployments)

Service Discovery

  • Helps services locate each other dynamically
  • Tools: Consul, Eureka, Kubernetes DNS

Monitoring & Logging

  • Logging: ELK Stack, Fluentd
  • Metrics: Prometheus, Grafana
  • Tracing: Jaeger, OpenTelemetry

🧠 How These Appear in Interviews

TopicSample Interview Question
Sharding“Design a scalable database for user profiles”
Replication“Ensure high availability in a payment system”
Queues“Process millions of image uploads in near-real time”
Consensus“Design a distributed lock system for microservices”
Eventual Consistency“Design Amazon’s shopping cart”

⚖️ Trade-Offs to Always Consider

DimensionTrade-off Options
ConsistencyStrong vs. eventual
LatencySynchronous vs. asynchronous processing
AvailabilitySingle region vs. multi-region
DurabilityIn-memory vs. persistent storage
Fault toleranceSingle AZ vs. cross-region replication

✅ Summary Table

ConceptWhy It Matters
ShardingEnables horizontal scaling of storage
ReplicationImproves fault tolerance and read speed
QueuesDecouple services and enable parallelism
ConsensusEnables coordination and system reliability
Eventual ConsistencyIncreases availability in distributed systems