Beyond the Music: Lessons from Spotify's Fleet Management Journey

Managing software at scale is a challenge that every growing organization must confront. Spotify, with its 675 million users, 100 million tracks, and operations in over 180 markets, has faced these challenges head-on. In this article, we’ll further explore Spotify’s journey to improve developer experience, leverage automation, and manage software at scale through fleet management. Their story offers useful lessons for companies aiming to streamline operations, cut costs, and reduce engineering toil, and it also clarifies the technical requirements necessary to achieve this level of automation.

The scale of the challenge

Since its launch in 2008, Spotify has grown to 2,700+ engineers working from all parts of the globe. The team performs an astounding average of 4,100+ production deployments every single day. In 2024 alone, Spotify processed 652,000 automated pull requests across its repositories, managed more than 42,000 data pipelines and captured 1.6 trillion events per day — illustrating the sheer scale and velocity of its development efforts.

However, operating at this scale brings significant challenges. Imagine tackling common scenarios like:

Updating the Java runtime across all backend services
Upgrading the internal service framework

These types of migrations are critical but time-consuming, often spanning months. The burden of maintenance work — migrating dependencies, updating frameworks, patching vulnerabilities, and deprecating outdated technologies — was falling on individual teams, leading to fragmentation and inefficiencies.

The problem: Engineering toil and fragmentation

In its early years, Spotify’s adoption of autonomous development teams, called squads, that operated like startups allowed for rapid innovation. Squads made their own decisions, built their own software, and moved fast. However, as the company grew and each team maintained its own tools, frameworks and processes, this autonomy created a fragmented technology ecosystem.

After some years, the volume of software components grew fast and created a maintenance burden. Engineers faced toil — necessary but repetitive work that detracted from building new features. Spotify refers to this type of engineering toil as “the work that gets between me and what I want to do.”

The solution: A shift to fleet management

To address fragmentation, Spotify inverted the problem. Instead of squads handling maintenance individually, the platform organization automated it. The platform team adopted a new mantra: “We take the pain away, so our users (engineers) don’t have to perform this toil.”

This shift became a company objective and was guided by three core principles:

Standardize the technology stack through the Golden Tech framework
Help teams adopt Golden Tech to ensure consistency
Leverage automation to remove maintenance toil from engineering teams

/en_us/blog/fragments/perspectives-by-splunk-newsletter

Standardizing through the Golden Tech framework

Standardizing technology involved a multifaceted approach. As part of the Golden Tech framework, Spotify built a tech radar tool defining a set of technology standards for everyone to build on, which they called golden technologies. Spotify also standardized how they build and deploy different types of software through golden paths, or self-paced guide books. To further enable and facilitate standardization, the team also implemented software templates within the Golden Tech framework that streamline creation of back-end services, pipelines, websites, or any other type of software.

Measuring progress and adoption

To ensure company-wide adoption of Golden Tech and unify its technology stack, Spotify developed a tool that measures the quality, reliability, and alignment of the software ecosystem. The tool incentivizes teams to reach certification levels by passing checks for their software, progressing toward the "golden state." The closer teams get, the more automated, free maintenance updates they receive.

Beyond individual teams, the tool provides department-wide visibility into overall tech health and certification levels. It also tracks their progress and how each department aligns with organizational goals.

This and all other tools — like the Spotify tech radar, software templates, and CI/CD plugins — are built as plugins for Backstage. Backstage is Spotify’s developer portal which serves as a central hub for tools, creating a single pane of glass for platform engineering solutions. This interoperability unlocks compound value across the organization by simplifying the developer experience.

Fleet Shift: Automating large-scale code changes

Once Spotify standardized and centralized its tech ecosystem, the next step was automating maintenance. Enter Fleet Shift, a tool designed to perform large-scale code changes across the fleet.

How Fleet Shift Works

Fleet Shift automates updates across repositories by executing shifts — scripts that modify code at scale. Here’s how it works:

Engineers define a Shift, i.e. a set of instructions for code changes, and package it as a Docker container.
Fleet Shift executes the Shift as a Kubernetes job, cloning repositories, applying transformations, and creating pull requests.
Fleet Shift listens for errors or success. Automated checks ensure that changes are safe, and PRs can be merged automatically if CI/CD tests pass.

This automation has dramatically reduced migration timelines. For example, upgrading Spotify’s internal service framework used to take 200 days — Fleet Shift reduced this to less than seven days.

Technical requirements for fleet management

Achieving fleet management at scale requires a robust technical foundation that includes the following.

Declarative infrastructure

Spotify transitioned to declarative infrastructure to simplify configuration management across thousands of different services at scale.To be able to apply automated changes to the infrastructure, the configuration needed to be data (e.g. JSON, YAML) rather than code (e.g. TypeScript, HCL). Declarative infrastructure enables:

Automated updates across all services
Consistency in provisioning and configuration
Easier automation of infrastructure changes

Version control and dependency management

Managing dependencies across thousands of services requires:

Centralized version control and dependency management to track updates. An example of this is Spotify's BOM, which helps manage the versions of Spotify's most important dependencies.
Automated dependency upgrades to ensure that security patches and framework updates are applied seamlessly.
Golden Tech adoption, where standardized libraries and frameworks reduce fragmentation.

CI/CD at scale

With hundreds to thousands of migrations per year, Spotify reimagined its CI/CD platform to support:

Massive build and deployment volumes
Automated testing pipelines to validate changes
Continuous integration, ensuring services remain in a golden state

Cultural shift: Trusting automation

One of the biggest challenges wasn’t building the software — it was changing the culture. Engineers had to trust automated updates happening without their direct involvement.

To make this work, Spotify emphasized strong engineering practices. Test coverage and automation are non-negotiable. Engineers must write tests to catch future issues, an internal principle known as the Beyoncé Rule: "If you liked it, then you should have put a test on it.”

Additionally, the infrastructure platform team had to shift from passive to active product ownership. Instead of merely introducing new platform capabilities, it now owns adoption and uses fleet management to roll out changes that delight the engineers who use their platform.

The results: Automation at scale

Fleet Shift has authored over 1.8 million contributions, with a 3:1 ratio of bot contributions to human contributions, and climbing.

Shifts range from simple version bumps to complex multi-repo transformations. For example, when Spotify discovered a critical security vulnerability in Log4j, Fleet Shift updated 80% of the fleet in less than 11 hours — a task that would have taken weeks or months manually.

The future of fleet management at Spotify

Spotify continues to refine fleet management, focusing on:

Cleaning up production environments by retiring experimental and deprecated software to reduce cloud costs and security risks
Exploring monorepos and moving from a polyrepo world to a monorepo structure for better dependency management
Leveraging LLMs and AI and researching how AI can further reduce engineering toil

Lessons for technology leaders

Spotify’s journey offers useful insights for organizations managing software at scale, including:

Standardization accelerates automation. Define a Golden Tech framework.
Declarative infrastructure simplifies updates. Use data-based configurations.
CI/CD must scale with automation. Optimize pipelines for frequent deployments.
Test automation is non-negotiable. Ensure reliability across all changes.
Cultural change is just as important as technical change. Be disciplined about product thinking to earn the engineers’ trust in automation.
Invest in developer experience. Tools that centralize resources and streamline workflows make it easier for engineers to adopt new standards and practices. A great developer experience accelerates adoption and reduces resistance to change.

Spotify’s fleet management journey is a testament to the power of standardization, automation, and cultural change in managing software at scale. By reducing engineering toil and empowering developers, you can build a foundation for long-term innovation.

If there’s one takeaway from Spotify’s story, it’s this: Automation isn’t just a tool — it’s a mindset. Embracing it can transform how your organization builds, maintains, and innovates at scale.

To learn more, check out the first article in this series, Fleet First for Better Developer Experience and Faster Software Delivery.

Style

two-column

No results

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/perspectives-promo