- Published on
Design a System for Sorting Large Datasets - Distributed Sorting at Scale.
- Authors
- Name
- Pankaj Tanwar
- @the2ndfloorguy
This is a classic distributed system design problem. It tests your understanding of distributed algorithms, fault tolerance and performance optimization.
At first look, it seems very straight forward. Divide the data, sort it locally and then merge. Sounds easy, right? Nope. The real fun begins when you dig deeper - handling node crashes, optimizing network calls, managing memory efficiently, and dodging performance bottlenecks that can slow everything down.
I won’t spoil the fun by giving everything away upfront. Let’s get started.
Problem Statement
You're given 1TB of unsorted 64-bit integers stored in a distributed database. You have access to a cluster of 1000 compute nodes, each having 1.5 GB of RAM. Design a system that can sort the entire dataset as efficiently as possible.
The problem is well defined, and quite detailed - which is a great start.
Clarifying Questions
Before you jump into the design, it's always worth pausing for clarifying questions. The answers can shape the entire direction of the solution.
Question | Answer | Notes |
---|---|---|
What kind of integer data are we sorting? What's the size of each data element? | 64-bit integer | Each element is 8 bytes. Much simpler to work with. Thank the interviewer. |
Is data completely random? Any known distribution or pattern? | Completely random | Bucket or radix sort is out. Need comparison-based sorting. Good thing: even work distribution. |
What to do with data once sorting is done? Store it somewhere or stream it? | Stream to a downstream processing system | We can implement a true streaming merge algorithm. |
What's the performance expectation? What are we optimizing for? | Sort as fast as possible | No resource constraint. Parallelize aggressively. |
Are all 1000 nodes in the same data center? Network latency? | Skip for now | Avoid chatty protocols just to be safe. |
Read throughput on source DB? | ~50 GB/s | So with parallel reads, we can finish in ~20 seconds. |
Fault tolerance expectation? | Handle node failures gracefully. Tolerate up to 10% failures. | Adds complexity. Need failure detection and work redistribution. |
Can we use the full 1.5 GB RAM? Any reservation for OS/processes? | Yes, full 1.5 GB available | Great. With 1 GB data, we can use up to 0.5 GB extra for sorting logic. |
Any preference for specific sorting algorithm? | No | Sorting algorithm isn't the focus. More time on system design aspects. |
Let’s quickly wrap up everything we know so far -
- 1TB of randomly distributed 64-bit integers (roughly 125 Billion elements)
- 1000 nodes, 1.5GB RAM each - fully memory available
- Minimize total sorting time
- Stream sorted data to downstream system
- Tolerate up to 10% node failures
- 50 GB/s read throughput
- Distributed coordination > sorting algorithm
Functional Requirements
- Even data distribution, 1GB per node
- Each node sorts data independently
- Combine sorted chunks into a single sorted stream
- Manage entire process with proper coordination
Non functional Requirements
- Complete sorting in minimal time
- Handle node failures gracefully
- High availability
- Memory efficient, stay within 1.5 GB per node
Folks, we’ll skip the back-of-the-envelope calculations here, as they’ve already been covered earlier in detail. Not worth repeating.
Core Entities
Let's talk about the core entities involved here. First, there’s a controller – it’s like the boss that manages everything. It splits the big dataset into smaller chunks and sends them to different worker nodes. Each worker sorts it in parallel locally. Then, the controller merges all chunks into a single sorted stream using a k-way merge.
Node
It's a compute unit. It loads the assigned data portion from the database, stored using a min-heap (log n extraction) locally. It uses min-heap for efficient sorting.
Node {
id: string
data: Array<int>
heap: MinHeap
status: IDLE | LOADING | SORTING | READY | FAILED
lastHeartbeat: timestamp
bufferSize: int
}
Controller
This is our man. It orchestrates entire distributed sorting process. It's responsible for data assignment strategy, failure handling and implementingk-way merge. The design will evolve as we dive deeper and tackle fault tolerance.
Controller {
id: string
isLeader: boolean
dataRanges: Map<nodeId, DataRange>
tournamentTree: TournamentTree
failedNodes: Set<string>
sortedCount: long
}
DataRange {
start: int
end: int
assignedNode: string
}
High Level Design
We will design this system in steps. We'll begin with a basic version – that just works. Then, we'll identify its limitations and improve upon it. This way, we will gradually evolve the design into a more robust solution.
This step-by-step iterative approach shows that you can start simple. You can think critically about trade-offs and improve the design. This earns you brownie points.
Version 1 - The Basic Setup
We start small. Controller divides 1 TB into 1000 equal chunks, one for each worker. Each worker pulls its part from the source and starts storing it locally in memory. Node makes use of heapsort, which is fast and space-efficient (O(1)
extra space).
Once all workers are done sorting, the controller starts the final merge. Here's where things get interesting – and a little fancy also. It performs a k-way merge using a tournament tree. It pulls the smallest number from each node and inserts them into the tournament tree. This helps us find the global minimum quickly – that will be the first number in the final sorted stream.
The controller streams the final result as it merges. So we don't need to store anything.
Good part
- Simple and straight forward approach
- Parallel processing
- Memory efficient, heap sort with o(1) space
- Failover via reassigning to existing nodes, no need to spin up new nodes
Problems
- Single controller, so single point of failure
- Not so good fault tolerance on node failures
- Controller alone handling 1000 connections
Design Decisions
- Why heap sort over quick sort? Heap sort guarantees o(1) space but quick sort has o(log n)
- Why fancy tournament tree instead of simple min-heap? Tournament tree needs only log(k) comparisons vs 2*log(k) for heap operations
It's a good start. But interviewer will ask, what happens if controller node itself fails? And how to handle merge bottleneck?
Version 2 - Making the System Fail-Safe with a Controller Group
Now we get serious about reliability. Let's fix problems with the initial version one-by-one.
Instead of 1 controller, we have a 3-node controller group. They use leader election via the Raft consensus algorithm to pick who is in charge. If one dies, the others take over.
Single point of failure – solved.
It handles node failures much better. While nodes are busy sorting, the controller keeps an eye on them. Each node sends regular heartbeat to it. If a node stops, the controller reassigns the work to others.
Trick – for example, if a worker dies, its 1GB chunk is split into 1MB slices and spread across the rest. As each node already has a buffer, it can fit.
For the merge phase bottleneck, we add buffering. Instead of pulling one value at a time from each worker, the controller pulls in batches of 100.
We can also add a message queue layer like Kafka or RabbitMQ between the controller and nodes. It handles heartbeats, data requests, retries, etc. And thanks to built-in durability, we don't lose messages due to network hiccups.
Good Part
- Achieves fault tolerance via leader election
- Handles node failures gracefully
- Reduced network overhead using batching, improves latency
- Message doesn't get lost
Problems
- A little high memory usage due to buffering across all nodes
- Still single merge node (leader)
- Increased network complexity
While, it still has some issues, but it solves critical reliability problems.
O(n log n) for local sorting + O(n log k) for k-way merge = O(n log n) overall time complexity with excellent parallelization.
Interviewer Follow-ups & Deep Dive Questions
At this stage, interviewer will try to check your understanding of distributed computing topics. Some of the follow up questions might look like this -
How will you handle to 50% node fails?
- Bad answer - Add more nodes
- Good answer - We will need to gracefully degrade. Reduce data scope or implement disk based sorting
- Best answer - System hits hard limit. I would implement graceful degradation. Checkpoint progress and pause, spin up new nodes to replace failed ones, and if that's not possible, switch to external merge sort using disk. I'll accept the performance hit.
Can we use mapreduce for this?
- Bad answer - Yeah sure
- Good answer - MapReduce could work. But it adds over head with intermediate disk writes. and doesn't leverage RAM advantage.
- Best answer - MapReduce is an overkill. It's better for complex transformation, but we have a basic sorting problem. It would add unnecessary disk I/O. Framework overhead for no reason and less control over memory management.
How would you test this system before taking it to production?
- Bad answer - Run with sample data
- Good answer - Unit test for algo, integration test and performance testing
- Best answer - Multi phase testing. Unit test - tournament tree, heap operations. Component tests, integration tests, performance testing, chaos engineering (random node failures, network and memory challenges), load testing etc
Conclusion
On the surface, it's "just sort numbers". But underneath, it’s a full-on distributed systems problem. What started as a simple idea quickly turned into building a real system, dealing with failures, balancing load, and making tricky trade-offs along the way.
Now that I’ve wrapped up my ramblings, I invite your comments on it. If you find any technical inaccuracies, let me know, please. I'm active on X (twitter) as @the2ndfloorguy and if you are interested in what an unfunny & strange programmer will do next, see you there!