We are back from Portland after a fantastic three days at Monitorama. It’s been amazing to be part of this community that has grown over the last few years. As we’ve worked to continually improve our monitoring solution, develop best practices for our on-call engineers, and deliver awesome features to our customers, it’s always nice to have a community to share key learnings and discuss new ideas.
Monitorama is one of the best events to connect, engage, and learn from our broader community. We had lots of great conversations ranging from monitoring in theory to monitoring in reality.
Many of our conversations were had over the idea of how monitoring evolves once you’ve figured about how to scale your infrastructure. Using a mix of open-source tools and commercial solutions, organizations are now dealing with the reality of monitoring and managing their cloud infrastructure, microservices, and apps. This requires ingesting and analyzing a high volume of metrics from hundreds of web services, while also dealing with high cardinality.
High cardinality is a technical hurdle for everyone. While already supporting class-leading scale, we’re working hard behind the scenes to make SignalFx the ultimate observability solution. New improvements to our TSDB storage systems, our metadata systems, and our real-time SignalFlowTM analytics will soon ensure that our customers never have to compromise on scale, context, granularity or visibility into their production environments.
We know (from our own experience!) what it takes to build a real-time monitoring and alerting solution. Getting to scale is time and resource intensive — and we applaud those looking to modernize their monitoring solutions. However, it would be remiss if we didn’t share some key considerations for those starting down that path of building their own cloud monitoring:
- Upfront and incremental infrastructure costs. The nature and quantity of data being ingested and stored requires a significant amount of infrastructure, especially storage. There are compromises that could be made to lower costs, but when users complain that queries take too long, it is common to add incremental spend on high-end hardware to improve performance.
- A dedicated monitoring team. We heard there are teams ranging from several to more than a dozen full-time engineers dedicated to infrastructure monitoring. Those engineers often end up focusing exclusively on the operational aspects of the system and not building out new features.
- Real-time data and insight. While there is a time and place for ingesting and aggregating logs, we’ve living in a world where fast-evolving anomalies can quickly turn into outages. For example, evaluating alert conditions against data once fully collected and stored in a database means it may be minutes before you can start taking action.