Managing software at scale is a challenge that every growing organization must confront. Spotify, with its 675 million users, 100 million tracks, and operations in over 180 markets, has faced these challenges head-on. In this article, we’ll further explore Spotify’s journey to improve developer experience, leverage automation, and manage software at scale through fleet management. Their story offers useful lessons for companies aiming to streamline operations, cut costs, and reduce engineering toil, and it also clarifies the technical requirements necessary to achieve this level of automation.
Since its launch in 2008, Spotify has grown to 2,700+ engineers working from all parts of the globe. The team performs an astounding average of 4,100+ production deployments every single day. In 2024 alone, Spotify processed 652,000 automated pull requests across its repositories, managed more than 42,000 data pipelines and captured 1.6 trillion events per day — illustrating the sheer scale and velocity of its development efforts.
However, operating at this scale brings significant challenges. Imagine tackling common scenarios like:
These types of migrations are critical but time-consuming, often spanning months. The burden of maintenance work — migrating dependencies, updating frameworks, patching vulnerabilities, and deprecating outdated technologies — was falling on individual teams, leading to fragmentation and inefficiencies.
In its early years, Spotify’s adoption of autonomous development teams, called squads, that operated like startups allowed for rapid innovation. Squads made their own decisions, built their own software, and moved fast. However, as the company grew and each team maintained its own tools, frameworks and processes, this autonomy created a fragmented technology ecosystem.
After some years, the volume of software components grew fast and created a maintenance burden. Engineers faced toil — necessary but repetitive work that detracted from building new features. Spotify refers to this type of engineering toil as “the work that gets between me and what I want to do.”
To address fragmentation, Spotify inverted the problem. Instead of squads handling maintenance individually, the platform organization automated it. The platform team adopted a new mantra: “We take the pain away, so our users (engineers) don’t have to perform this toil.”
This shift became a company objective and was guided by three core principles:
Standardizing technology involved a multifaceted approach. As part of the Golden Tech framework, Spotify built a tech radar tool defining a set of technology standards for everyone to build on, which they called golden technologies. Spotify also standardized how they build and deploy different types of software through golden paths, or self-paced guide books. To further enable and facilitate standardization, the team also implemented software templates within the Golden Tech framework that streamline creation of back-end services, pipelines, websites, or any other type of software.
To ensure company-wide adoption of Golden Tech and unify its technology stack, Spotify developed a tool that measures the quality, reliability, and alignment of the software ecosystem. The tool incentivizes teams to reach certification levels by passing checks for their software, progressing toward the "golden state." The closer teams get, the more automated, free maintenance updates they receive.
Beyond individual teams, the tool provides department-wide visibility into overall tech health and certification levels. It also tracks their progress and how each department aligns with organizational goals.
This and all other tools — like the Spotify tech radar, software templates, and CI/CD plugins — are built as plugins for Backstage. Backstage is Spotify’s developer portal which serves as a central hub for tools, creating a single pane of glass for platform engineering solutions. This interoperability unlocks compound value across the organization by simplifying the developer experience.
Once Spotify standardized and centralized its tech ecosystem, the next step was automating maintenance. Enter Fleet Shift, a tool designed to perform large-scale code changes across the fleet.
Fleet Shift automates updates across repositories by executing shifts — scripts that modify code at scale. Here’s how it works:
This automation has dramatically reduced migration timelines. For example, upgrading Spotify’s internal service framework used to take 200 days — Fleet Shift reduced this to less than seven days.
Achieving fleet management at scale requires a robust technical foundation that includes the following.
Spotify transitioned to declarative infrastructure to simplify configuration management across thousands of different services at scale.To be able to apply automated changes to the infrastructure, the configuration needed to be data (e.g. JSON, YAML) rather than code (e.g. TypeScript, HCL). Declarative infrastructure enables:
Managing dependencies across thousands of services requires:
With hundreds to thousands of migrations per year, Spotify reimagined its CI/CD platform to support:
One of the biggest challenges wasn’t building the software — it was changing the culture. Engineers had to trust automated updates happening without their direct involvement.
To make this work, Spotify emphasized strong engineering practices. Test coverage and automation are non-negotiable. Engineers must write tests to catch future issues, an internal principle known as the Beyoncé Rule: "If you liked it, then you should have put a test on it.”
Additionally, the infrastructure platform team had to shift from passive to active product ownership. Instead of merely introducing new platform capabilities, it now owns adoption and uses fleet management to roll out changes that delight the engineers who use their platform.
Fleet Shift has authored over 1.8 million contributions, with a 3:1 ratio of bot contributions to human contributions, and climbing.
Shifts range from simple version bumps to complex multi-repo transformations. For example, when Spotify discovered a critical security vulnerability in Log4j, Fleet Shift updated 80% of the fleet in less than 11 hours — a task that would have taken weeks or months manually.
Spotify continues to refine fleet management, focusing on:
Spotify’s journey offers useful insights for organizations managing software at scale, including:
Spotify’s fleet management journey is a testament to the power of standardization, automation, and cultural change in managing software at scale. By reducing engineering toil and empowering developers, you can build a foundation for long-term innovation.
If there’s one takeaway from Spotify’s story, it’s this: Automation isn’t just a tool — it’s a mindset. Embracing it can transform how your organization builds, maintains, and innovates at scale.
To learn more, check out the first article in this series, Fleet First for Better Developer Experience and Faster Software Delivery.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.