LLD , Deduplicate huge data set.
Summary
I was asked to design a system to deduplicate a large dataset that cannot fit into memory, using disk-based processing and hashing techniques.
Full Experience
This was asked for SDE-2/3 role at non-FAANG US ( org related to data ) *(role is in Bangalore-BLR)
Time : 90 mins
Question :
The task is to de-duplicate a dataset present on disk. Dataset size is too large to load into in-memory (RAM) to process.
Input :
You have a file which contains, a ID on each line.
Every id can appear one or more times.
Example :
1
2
1
3
4
2
Output :
A new file which contains a ID on each line, where all IDs are unique.
No need to preserve the original order of IDs .
4
1
3
2
You are not allowed add more servers / in-memory (RAM) to solve this !
Hint :
Only in-memory (RAM) is limited , not your disk (you can have it as much as you need)
Solution :
Basic Idea :
You need a hash based approach (like consistent hashing) , where ID's are hashed & mapped onto a hash space / ring .
You create multiple small files on-disk (at any point you should be able to load one file on to disk) where each file is responsible for a part of the hash space.
You read the input file as a stream , hash the id & find the file on disk , load it onto memory (some set data struct) & then append to the file if this id is not alreayd present, close file & delete the data struct.
---
The expectation was to implement the solution in language of your choice, at minimum think at least come-up with the idea / solution.
Interview Questions (1)
Deduplicate Huge Dataset Using Disk-Based Hashing
Given a large file containing IDs (one per line), some appearing multiple times, generate a new file with only unique IDs. The constraint is that the entire dataset cannot fit into memory (RAM). You're allowed to use as much disk space as needed but cannot increase RAM or add more servers.
Input Format:
1
2
1
3
4
2
Output Format:
4
1
3
2
Constraints:
- Cannot load entire dataset into memory.
- Can use additional disk space freely.
- Order preservation is not required.
Expected Approach:
Use a hash-based strategy such as consistent hashing to partition IDs across multiple smaller temporary files. Process the input stream by hashing each ID and directing it to the appropriate temporary file. Load each temporary file into memory one at a time, deduplicate its contents using a set-like structure, and write back the unique values.