Data Engineer || NielsenIQ || 2 YOE
Summary
I interviewed for a Data Engineer position at NielsenIQ with 2 years of experience. The interview comprehensively covered various technical domains including data file formats, Airflow, advanced Python, SQL, and PySpark, involving several specific problem-solving challenges.
Full Experience
During my interview for the Data Engineer role, the discussion primarily revolved around my 2 years of experience and the core technologies used in data engineering. The interview session was structured around several key areas:
Data Engineering Concepts:
I was asked about different data file formats, specifically Delta files and ORC, and their characteristics. A key discussion point was the difference between SCD Type 2 and Delta files.
Airflow:
Questions focused on Airflow functions and how to effectively design Directed Acyclic Graphs (DAGs) to handle both sequential and parallel job execution.
Advanced Python:
The interviewer delved into advanced Python functionalities beyond basic lambda functions, such as Decorators and Generators. Additionally, I was tested on concepts like inheritance and static methods.
SQL Questions:
I was presented with a dataset containing trip information and asked to write several SQL queries:
- Identify the longest trip per vehicle in terms of time spent on the road.
- Compare each trip’s distance with the previous trip’s distance for the same vehicle. I was given a hint involving
unbounded preceding 1 row. - Provide the most optimized solution to drop duplicates from a table, indicating that solutions using
DISTINCTorINSERT OVERWRITEwere not considered sufficiently optimized.
PySpark Questions:
A CSV file structure was provided, and I was given a series of tasks to perform using PySpark:
- Load the dataset into a Spark DataFrame, inferring the schema.
- Convert
pickup_datetimeanddropoff_datetimecolumns to timestamp type. - Implement a User Defined Function (UDF) to classify trips as 'short', 'medium', or 'long' based on
trip_durationwith specific time thresholds.
Overall, the interview was quite thorough, covering a wide range of topics pertinent to a Data Engineer role.
Interview Questions (10)
Discuss different data file formats, specifically Delta files and ORC.
Explain the differences between Slowly Changing Dimension Type 2 (SCD2) and Delta Lake file format.
Explain Airflow functions and demonstrate how to design DAGs for sequential and parallel job execution.
Discuss advanced Python concepts such as Decorators, Generators, inheritance, and static methods.
Given a Trips table with columns TripID, VehicleID, StartTime, EndTime, and DistanceKM, write a SQL query to identify the longest trip (in terms of time spent on the road) for each vehicle.
Given a Trips table, write a SQL query to compare each trip’s distance with the previous trip’s distance for the same vehicle. I was also given a hint: select vehicleID, unbounded preceding 1 row.
Given a table, describe the most optimized solution to drop duplicates. The interviewer was not convinced by DISTINCT or INSERT OVERWRITE approaches, suggesting a need for a more efficient method (e.g., using ROW_NUMBER() or CTEs).
Given a CSV dataset (pickup_datetime,dropoff_datetime,passenger_count,trip_distance,trip_duration), load it into a Spark DataFrame and infer the schema from the data.
From the loaded Spark DataFrame, convert the pickup_datetime and dropoff_datetime columns to timestamp type.
Implement a User Defined Function (UDF) in PySpark to classify trips as 'short', 'medium', or 'long' based on the trip_duration column. The classification rules are: Short: duration ≤ 15 minutes; Medium: 15 < duration ≤ 30 minutes; Long: duration > 30 minutes.