Case Study: FreshDirect
|
Thanks to Splunk, issues are quickly identified and resolved before they become problems that affect our customers. - Robert Reilly, Manager of Systems Engineering |
Solution Areas: Application Management, Service Desk
The clock's ticking, customers are waiting
FreshDirect is the leading online fresh food and meals provider serving the metropolitan New York and New Jersey region. Before Splunk, FreshDirect struggled with a lot of the same challenges in maintaining availability as other online retailers.
- Regular changes to the site continually introduced new problems. Investigating the problems took too long because the evidence was scattered and developers lacked direct access.
- Content problems would go undetected. Customer order confirmation emails would go undelivered by their ISPs and FreshDirect's phones would start ringing.
- Back office application failures would impact operational efficiency and cash flow because they wouldn't be discovered fast enough.
But FreshDirect discovered that these challenges weren't insurmountable once they discovered Splunk. Today, failures still happen but they are found and fixed fast enough that the customers barely see a hiccup.
About FreshDirect
FreshDirect is the leading online fresh food and meals provider serving the metropolitan New York and New Jersey region. They operate a plant and headquarters in Long Island City and run their servers out of a Manhattan colocation facility. Their site runs on homegrown J2EE applications deployed on Weblogic with an Oracle database back end and SUSE Linux on VMware.
Challenges
Robert Reilly is FreshDirect's Manager of Systems Engineering. He leads a team of 3 administrators responsible for all production support. Before Splunk, his team spent a large proportion of its time logging in to one host after another looking for the source of site problems as well as capturing data for developers who didn't have the ability to access locked-down production systems themselves.
Robert also knew that other teams were similarly frustrated in their ability to effectively use logs and other IT data. The content team was troubled that typos resulting in broken links and missing images were not systematically detected despite the existence of access logs full of 404s. The plant team had a hard time chasing down failures in its automated conveyor belts and other systems.
It's gotta be in here somewhere
When customers would report intermittent server errors using FreshDirect's online store, customer support would escalate to Robert's systems administration team. The systems administrators would then have to log into each of dozens of virtual hosts separately and grep through the weblogic server logs at the commandline to find the errors in order to identify which hosts needed a restart or code rollback.
The calls are coming in
FreshDirect sends out order confirmations via email. When customers don't get their emails, they start calling to find out if their groceries are going to show up. Those calls cost money and customer goodwill. The OpenNMS monitoring system alerts systems administrators when the mail queues start filling up with undelivered email. But to actually fix the problem and release the email, systems administrators needed about 30 minutes to grep through qmail logs to identify major domains that are backed up, and find alternate mtas for those domains that are accepting email.
Rollout blues
Like most innovative online merchants, FreshDirect is constantly rolling out new features on its site. Unfortunately every change brings new issues. Developers who need to debug the problems don't have access to production systems for security, stability and compliance reasons. Before Splunk, Robert's team of administrators would spend 4-5 hours following each code deployment servicing dozens of requests from developers for access to logs, the output of diagnostic commands and copies of configuration files and scripts.
Death by a thousand little cuts
Customers make a lot of judgments about whether they trust an online retailer and whether they want to spend more time at their store based on the way the site looks.
FreshDirect's content team wanted to get alerts on all 404 errors in order to find and fix all of these little issues before they impacted the customer experience. But FreshDirect's web log analysis tool filters out errors to calculate site statistics. Lacking any other tool to report and alert on many millions of web requests every day, they had to instead waste many hours testing the site content manually.
Taking the house down
As a high profile site, FreshDirect is a target for hackers. The biggest nuisance security issue is attempted denial of service attacks - malicious traffic flooding the site and driving bandwidth cost and high load on the application servers that could cause legitimate customers to encounter server busy errors. Before Splunk, combing through the web server logs to find the source IPs for the attacks took 4-5 hours because of the huge volume of logs. Unlike an indexed search, grep has to scan every line for matches.
Putting the mortar in clicks & mortar
It's not just web applications that fail. FreshDirect orders are fulfilled by automated conveyor and pick and ship applications in its Long Island plant. Before Splunk, plant operators looked at logs from these systems to find problems manually.
Cash is king
The grocery business is known for razor thin margins where cash flow is critical on a daily basis. FreshDirect generates its cash via nightly settlement transactions recorded in custom application logs. Sometimes these would fail, and wouldn't be noticed until the morning. There wasn't any good way to monitor this before Splunk because the failures were silent.
Splunk at FreshDirect
Storefront deployment
The primary Splunk deployment runs in the Manhattan colocation facility that hosts FreshDirect's storefront. Splunk runs on every Weblogic virtual host and forwards the Weblogic server logs in real time to a central Splunk indexing and search server. Splunk also indexes web access logs from FreshDirect's Netscaler web proxies, qmail MTA logs, and the nightly settlement application logs.
Plant deployment
A second Splunk deployment runs in the Long Island City plant location, indexing logs from plant systems including Diamond Phoenix conveyors. Distributed search allows users to follow transactions from the storefront into the plant.
Daily Use
With Splunk in place, FreshDirect has been able to meet all of its IT data challenges by putting Splunk into the hands of all of its systems administrators, developers, content team, plant operators and management.
Finding application problems
Now, when customers report application errors, systems administrators run a single search to retrieve all errors and see what application servers are causing the error, shaving problem investigations from 30 minutes or more down to less than 5 minutes.
Investigating mail delivery issues
When OpenNMS alerts that the mail queues are full, a single search brings back a report of every domain with queued messages. Another click on each domain finds other MTAs for the domain that are accepting mail. Administrators usually reroute the messages before customers even notice, saving hundreds of customer service calls.
Direct developer access
Since rolling Splunk out, Robert hasn't had any more calls from developers needing production data. They log in to Splunk and proactively watch trends after each code rollout, discovering problems earlier than before. And if they get an escalation, they have all the data they need at their fingertips.
"Splunk allows us to concentrate on debugging the problem and not being log butlers for our support team."
Monitoring for content problems
Robert has scheduled some search-based alerts in Splunk to email the content team a list of URLs and referrers with 404s.
Alerting on settlement failures
Robert set up a deceptively simple search-based alert in Splunk to proactively notify him if the nightly settlement didn't happen. The search looks for the settlement event by its event type, and alerts when there are 0 events.
Investigating denial of service attacksIf OpenNMS alarms on a spike in the number of weblogic active sessions, Robert immediately searches his Netscaler web proxy logs in Splunk. The DNS resolution in these logs enables him to quickly find connections outside the New York / New Jersey area and hone in on the offenders so they can be blocked. Splunk's lightning fast indexed search cuts investigation time down to 15 minutes or less.
Physical plant monitoring
FreshDirect's plant operations team has set up some simple alerts based on searches of their conveyor belt logs for errors. Whenever a new problem arises, they configure a new alert in Splunk to catch it proactively the next time.
Exploring customer and application behavior
FreshDirect's VP in charge of reporting and architecture, one of the company's founders, has also picked up Splunk. He uses it as a window into customer and application behavior, both watching for trends and following individual user sessions to gain insight into ways to make the site experience better. While FreshDirect has other tools to do web stats, Splunk provides an entirely different perspective with the ability to follow the trail of any activity on an ad hoc basis.
Results
FreshDirect has reaped huge benefits from Splunk within the first 6 months of deployment.
Increased availability
- 90% reduction in application server error isolation time
- Faster settlement error correction improves cash flow
- Broken content fixed faster improving shopping experience
- Proactive incident identification following code rollouts
- Denial of service attacks isolated and blocked in minutes instead of hours
Improved customer service
- Problems found and fixed before customers complain
Lower costs
- 100's of customer service calls/week averted by rapid investigation of queued order confirmation emails
- 4-5 hours/week of SA time eliminated by providing direct developer access to production data
Bottom Line
Real-time search has made FreshDirect's entire team more self-sufficient and effective and enabled them drive major improvements in overall service quality and availability while controlling operational cost. And Robert, who found and implemented Splunk, has been promoted. Coincidence?