Microsoft Senior Applied Scientist (L63) Interview Experience
Summary
I completed a multi‑round interview process for a Senior Applied Scientist role at Microsoft, passing the phone screen and four onsite rounds covering ML fundamentals, coding, and problem solving, and ultimately received an offer.
Full Experience
Background: Senior Data Scientist with ~3 YOE
Interview Process
-
Phone Screen (60 min) Format: Coding + Problem Solving Problem Solving: Behavioral scenarios and use cases Coding: Min Stack + follow-ups Outcome: Passed to onsite
-
Onsite Loop (4 rounds, 60 min each) Note: Recruiter's prep material was different from actual rounds for two rounds.
-
Round 1: ML Fundamentals + ML Coding Actual Format: As described ML Coding: Implement K-means from scratch Follow-up: How would you vectorize this implementation? (I struggled a bit with matrix broadcasting)
-
Round 2: ML Problem Solving + ML System Design Actual Format: ML fundamentals + coding (no system design) ML Questions (that I remember):
- Reinforcement learning: Thompson sampling vs epsilon-greedy, explore vs exploit tradeoffs
- Calibration: Platt scaling
- Imbalanced data: Downsampling majority class
Coding: Find max number of points on a line (2D array of points) I spent time handling floating point precision loss but got optimized solution
-
Round 3: Data Analysis + Applied Sciences Actual Format: ML questions + coding ML Questions:
- Offline metrics higher than online - why and how to address?
- Data drift: Covariate shift vs label drift
- Statistical tests for drift detection
- Cold start problem for new ads
- Explore/exploit tradeoffs
- BERT vs GPT architecture and differences
- Off-policy learning: "You have logged data from a model trained on an old policy, how would you fit a new model to update the policy?" (Found this confusing)
Coding: Implement self-attention and masked self-attention I got mask syntax slightly wrong but overall code was correct and optimal otherwise.
-
Round 4: Problem Solving + Coding (HackerRank) Format: As described Coding: Merge intervals ML Fundamentals: Bias-variance tradeoff, bagging, boosting, calibration, drift Behavioral: Standard behavioral questions (don't remember specifics)
-
Key Takeaways
- Prepare for coding in every round
- ML fundamentals are crucial - specific topics depend on the team and role, but prepare for those thoroughly
- Coding spans theory to implementation - be ready for everything from LeetCode to implementing ML algorithms from scratch
Outcome
Offer
Interview Questions (13)
Implement K-means from scratch
Implement the K‑means clustering algorithm from scratch. After the implementation, discuss how you would vectorize the algorithm to improve performance, especially handling matrix broadcasting.
Find maximum number of collinear points
Given a 2‑D array of points, find the maximum number of points that lie on a single straight line. The solution should handle floating‑point precision issues efficiently.
Thompson sampling vs. epsilon‑greedy
Explain the differences between Thompson sampling and epsilon‑greedy in reinforcement learning, and discuss the explore vs. exploit trade‑off.
Calibration with Platt scaling
Describe how Platt scaling works for calibrating the probabilities of a binary classifier.
Handling imbalanced data via downsampling
How would you handle an imbalanced dataset by downsampling the majority class?
Offline metrics higher than online metrics
Why might offline evaluation metrics be higher than online (A/B test) metrics, and how would you address this discrepancy?
Data drift: covariate shift vs. label drift
Differentiate between covariate shift and label drift as forms of data drift, and explain how each can be detected.
Statistical tests for drift detection
What statistical tests can be used to detect data drift in a production ML system?
Cold start problem for new ads
Explain the cold‑start problem when introducing new advertisements into a recommendation system and possible mitigation strategies.
BERT vs. GPT architectures
Compare the architectures of BERT and GPT models, highlighting key differences in design and typical use‑cases.
Off‑policy learning from logged data
Given logged data generated by an old policy, how would you train a new model to update the policy (off‑policy learning)?
Implement self‑attention and masked self‑attention
Implement the self‑attention mechanism and its masked variant as used in transformer models.
Merge intervals
Given a collection of intervals, merge all overlapping intervals and return the resulting list of non‑overlapping intervals.