Splunk at Weill Cornell Medical College
|
Splunk helped us push the needle in the haystack to the top of the stack. - Josh Gluck, Assistant Director, Network and Communication Services Group |
Solution Areas: Security, Network Management, Server Management
Stop the bleeding - before it starts
Weill Cornell Medical College is among the top-ranked clinical and medical research centers in the country.
Weill Cornell's network operations and security teams are challenged with maintaining the availability and security of the network and telecoms infrastructure including WAN, LAN and VoIP services.
When a security incident is suspected, Weill Cornell's team needs to quickly assess the scope of the issue. Before Splunk, that often meant checking 500 separate files, during which attacks could potentially proceed unchecked.
With Splunk deployed, it's now possible to understand the full extent of exposure in just a few minutes.
About Weill Cornell Medical College
Founded in 1898, and affiliated with what is now New York-Presbyterian Hospital since 1927, Weill Cornell Medical College is among the top-ranked clinical and medical research centers in the country. In addition to offering degrees in medicine, Weill Cornell also has PhD programs in biomedical research and education at the Weill Graduate School of Medical Sciences, and with neighboring Rockefeller University and the Sloan-Kettering Institute, has established a joint MD-PhD program for students to intensify their pursuit of Weill Cornell's triple mission of education, research, and patient care.
Challenges
Josh Gluck is the Assistant Director of the Network and Communications Services Group at Weill Cornell. He and his team of 12 analysts, network engineers and architects are responsible for the college's telecoms infrastructure. This includes the WAN, LAN and VoIP networks.
In addition to Josh's team, there is a security team that owns policy and forensics while Josh's team operates security devices as part of the network infrastructure.
What's the prognosis?
When a security intrusion is suspected or IDS alerts fire, the priority is to assess the situation as quickly as possible. The network and security teams want to know the extent of exposure as quickly as possible. Is there an attacker present whose session needs to be terminated? Are there backdoors to be closed? Did the attacker access any systems? What did the do on those systems? Was confidential data seen?
Before Splunk, answering these questions involved hours of manual filtering through more than 500 different logfiles in different locations on different servers. This could take hours, during which the organization had no idea of its exposure and was potentially undergoing active harm.
What you don't know can hurt you
With IT data so scattered, naturally administrators would only look at it if they needed to investigate an incident. Yet IDS alerts and other correlation would only catch what they knew to look for.
Josh knew that if administrators could effectively review activity proactively and get to know the patterns of normal behavior, they would pick up on the clues of both security and operational problems much earlier, before any harm could occur. But first he had to get all this data into a tool that made reviewing it practical.
Splunk at Weill Cornell
Josh and team selected and purchased Splunk as part of an upgrade to their security process and infrastructure in 2006. It was clear that Splunk's instantaneous search would solve the problems of investigations while the interactive interface would make proactive log review a reality. However, a few weeks after the purchase they still hadn't gotten around to putting it into production due to the press of other projects (not unusual in a busy IT group). But one incident changed all that...
Wait, don't we have Splunk?
Late one afternoon, the team was in the midst of a serious security investigation of an incident in progress. The incident started at 2pm. Four hours later, everyone was still grepping through logfiles yet were no closer to figuring out the source, or impact of the intrusion. Then a member of the team suddenly remembered - they had just bought a Splunk license for exactly this sort of incident! Too bad it wasn't implemented yet - but maybe that wouldn't be a problem.
The team downloaded and installed Splunk on a server with some spare capacity and copied over all the relevant log files they had been manually reviewing for the last few hours.
The installation and data load took less than 15 minutes, and with a few searches they closed the discovery phase of the investigation, saving many hours of manual work.
On a night where no one expected to get much sleep, the entire team got home for a full night's rest.
"Splunk proved its value the first time we used it for a security incident."
Always live
After that first incident, it was a priority to get Splunk implemented live. Markus Bronnimann, resident Unix Architect, quickly installed a production Splunk server and brought up live data inputs from across the environment. Splunk currently takes in live syslog and tails application logfiles across more than 275 hosts, including Cisco switch infrastructure, the university firewall, Apple Xserve web servers, Solaris mail systems, Red Hat application servers, Windows File servers, and even some AIX-based clinical systems.
Trust your gut
Now that Splunk is implemented, ad hoc searches happen all the time based on hunches or suspicions that the team has about what's going on. Network seem a little slow today? The network administrator can run a search for recent events on the switch and maybe pick up on a configuration change or flaky interface.
Virus got to a machine? A quick search will show where it tried to go.
Under Observation
All operations personnel now make a morning practice of data reviews for their systems - especially those that are new or have had recent configuration changes.
For example, Weill Cornell recently rolled out a new phone system. Splunk let the team proactively look for errors, dropped calls and other quality problems, and fix them before users could complain.