My Experiences at Databricks
While working on the batch scheduling team at Facebook, I noticed something unusual. Facebook was migrating all of its data workloads from the in-house Hive system to a new system based on open-source Spark. This was strange because I’ve heard countless times from our engineers that open-source tools simply can’t operate at the Facebook scale. So why then was Facebook adopting Spark instead of building something from scratch like it usually does? Even though I had no idea what Spark was back then, I started to think there was something special about it. My interest in Spark eventually led me to join Databricks, which was founded by the creators of Spark.
It was obvious when I joined Databricks that my time there would be drastically different from Facebook. Most of my knowledge about Facebook’s internal ecosystem became worthless overnight. For 2 months I was completely lost when my colleagues discussed public tooling and related jargon like Kubernetes Daemonsets, NAT Gateways, and BoneCP. Onboarding was tough but I was learning so much, which is exactly what I wanted. And this time the knowledge I was gaining about open-source technology and public tools would serve me for the rest of my career. I realized in retrospect that relying on Facebook’s world-class tools weakened some of my engineering skills because these tools abstracted away so much detail and complexity. Ramping up at Databricks also made me fluent in Scala, which gave me an entirely new perspective on programming (see separate post).
I also noticed that Databricks heavily exposed me to trends in the data industry and other tools/companies in the market. By listening to coworker conversations I would learn about things like the JDBC protocol, dbt, and Apache Arrow. Everyone I met at Databricks was extremely sharp and invested in the company mission. Managers, PMs, and even sales people would have long technical debates and there was never any topic too detailed to discuss. Even our CEO would ask questions about details like JVM startup time. My mentors and managers stunned me with their knowledge and brilliance - I didn’t realize engineers could be so good! I was inspired to work hard so that one day I could become like them. In fact, the reason I even started this blog was because I saw the value of my mentor’s excellent technical communication skills and wanted to grow myself in this area.
I started my Databricks journey on the Serverless team. This was a natural team to join given my prior experience in resource management at Facebook. The Serverless team was born because customers were tired of babysitting the VMs in their Spark clusters. Before this team existed, customers would have to manually adjust the cluster size or terminate the cluster depending on resource utilization. Additionally customers had to manually recover machines in the cloud that became unresponsive due to various reasons like spot kills, OOMs, etc. The Serverless team built cluster autoscaling, autorecovery, autotermination, and driver health monitoring features to relieve customers from these operational tasks (see patent). I found these features really interesting and important, so I volunteered to help migrate them to a new microservice to improve scalability. Soon afterwards there was a re-org and I moved to the Clusters team along with these features that I now owned.
The Clusters team was responsible for the provisioning and management of VMs, disks, and other resources needed to create Spark clusters. Working on this team made me fluent with Kubernetes and cloud tooling like S3, Terraform, and IAM. Initially I was put on a new project which would move our VM acquisition and lifecycle management logic into a new microservice. This was needed to unlock the next level of scale because Databricks usage was skyrocketing and we were trying to keep up with customer demand. I got to design and build many interesting components like our VM garbage collector and our VM pool capacity management algorithm. After this project completed, I volunteered to upgrade all our primary databases from MySQL 5.6 → 5.7. This project was painful but it helped me internalize the best practices needed to migrate critical components. I also added the company’s first MySQL read replicas to help our persistence layer support growing traffic that was mostly read-heavy. This project taught me more about the MySQL binlog, managing replication lag, and improving database performance by tuning configurations like the transaction log disk flush delay.
In 2020 Databricks started to focus on supporting BI (Business Intelligence) workloads because Snowflake had shown this market was massive. Some of our customers were already using Databricks for BI, even though our platform wasn’t specifically designed for this use case. Low latency is a must-have for a good BI experience because a live human is waiting for the UI to respond. To make BI queries faster, my team started a new project responsible for provisioning compute more quickly. At the time, it took roughly 2 minutes to start a cluster and our goal was to reduce it to 10 seconds. We accomplished this by introducing a new type of cluster that was backed by VMs from a Databricks AWS account (as opposed to VMs from a customer’s AWS account). This meant we could make latency-reducing optimizations like caching pre-warmed VMs and eagerly starting Spark without needing customer approval or spending customer money. I was the second full-time contributor on this project, and the team quickly grew in size and scope. I can confidently say this was the most interesting and exciting project I’ve worked on in my career so far. I learned a lot about Kubernetes, multi-tenant security, and managing backwards compatibility issues.
Other Contributions Inspired By Facebook
My experiences at Facebook gave me opinions about how Databricks should scale certain things. Because I had previously worked at a much larger company with mature tooling and engineering practices, I felt like I could almost see the future at Databricks. Others who have previously worked at big tech companies likely have similar stories.
When joining Databricks, I couldn’t help but notice our configuration system was much worse than Facebook’s. Service configurations at Databricks were injected into our microservice containers via environment variables, which meant Kubernetes would restart the container upon any config change. K8s ConfigMaps were not an option due to some technical reasons I don’t remember. We didn’t have zero downtime service restarts at the time because we served some APIs using state stored in memory, so engineers would usually ship config changes with releases to minimize downtime. This meant it would typically take 2 weeks to flip a flag to enable some feature in production. At Facebook there were thousands of engineers merging config changes to prod within minutes every day. Databricks had briefly experimented with Launch Darkly, but quickly stopped because too many outages were caused by flipping flags without proper testing or diligence. I tried to persuade the company to reconsider using dynamic configurations but it was deemed too risky.
Then about a year later, I noticed our founding engineer publish a design doc for a tool to deploy emergency config changes without restarting the container. I reached out to him and we built a prototype during a hackathon that let engineers make emergency config changes to prod within 2 minutes. Our prototype made dynamic configuration changes safe because we merged all config changes into source control, which ensured auditability, peer review, and CI verification. We won a company stability prize for our prototype and it was soon productionized by another teammate. A few months later, our tool was commonly used to mitigate large outages across the company. This was a satisfying experience and made me grateful for the perspectives Facebook has given me.
Thoughts on Snowflake
Many people ask about my thoughts on Snowflake and if I worry about the competition. Obviously I’m extremely biased, but the short answer is no. What Snowflake has done is incredible, but there are a few reasons why I think Databricks will be more successful. Some of my reasons are objective and some are just personal opinions. Note the statements below might be based on outdated or even incorrect data. These are just my 2 cents and when in doubt you should trust more rigorous sources.
Databricks currently holds the world record for the TPC-DS data warehouse performance benchmark, which means that Databricks is beating Snowflake at its own game in both performance and cost. Databricks is not a data warehouse company - its primary focuses have been data science, ETL, and machine learning. Snowflake still claims they have faster benchmark results, but you can read this post to see why this isn’t true and verify things for yourself.
Talent + Expertise
Databricks engineering talent is unparalleled to any other company I’ve seen. The company has managed to recruit a good portion of the world’s top database talent. For example Databricks recently hired the director in charge of Google’s Spanner, which is probably the most sophisticated database in the world today. I’ve also seen Databricks consistently recruit and collaborate with top researchers specializing in databases. Access to top talent and the latest developments is crucial when competing in a technical field that is heavily disrupted by new innovations.
Vision + Architecture
In the long run, I think Databricks will win because they have the better architecture. Snowflake’s data warehouse architecture requires 2 copies of data to be stored and maintained. Databricks does analytics on data in-place and this data typically resides in blob storage. The Databricks architecture is inherently better because it is:
- Less expensive since only 1 copy of the data is required
- More realtime since analytics are done on the source of truth as opposed to an outdated copy that was synced to the warehouse
- More flexible since it can handle unstructured data. This especially important for enabling newer use cases like machine learning
- Less complex since there are no data syncing tools and associated operational overhead involved
If you want to hear a less biased perspective, this article provides a pretty fair comparison of Snowflake and Databricks. These days it appears that both companies are copying each other as Snowflake attempts to support ML and lakehouse while Databricks tries to improve its BI experience.
Reflections & Lessons Learned
Here are a few lessons I’ve internalized after reflecting on my time at Databricks.
- If you’re early in your career, optimize for learning and personal growth instead of promotions and money. Of course not everyone has the privilege to do this, but when possible I always recommend it. This approach is more fun and also more closely correlated with skill, which eventually leads to promotions and money anyways. Looking back now, joining Databricks was the best financial decision I’ve ever made even though that wasn’t my goal. Even if the company wasn’t a success, I would have no regrets because knowledge and skill are priceless to me. Luckily though, it seems that solving interesting and difficult problems in software tends to make money.
- The importance of working with top talent cannot be overstated. This factor is more important to me than the pay, company financials, and prestige. If the people are smart, I have more faith that the other factors will work themselves out.
- Don’t settle for work that isn’t interesting. Early in my career I became painfully aware that the technical problems I enjoyed weren’t important to my team. I kept switching teams until my interests aligned with the business needs. Each time I switched, I learned more about myself and got closer to the work I actually enjoyed. I’m glad I kept searching instead of settling because now I’m very satisfied with my career.