What is Machine Data?
At Splunk we talk a lot about machine data, and by that we mean all of the data generated by the applications, servers, network devices, security devices and other systems that power your business. Machine data contains a definitive record of the activity and behavior of your customers, users, transactions, applications, servers, networks, and so on.
Machine data is more than just logs -- it's configuration data, data from APIs and message queues, change events, the output of diagnostic commands and more. It's also a much wider variety of log data than network security and compliance-centric Log Management and Security Information and Event Management systems (SIEMs) are built to handle.
Splunk users know that there are thousands of distinct log formats, many from custom applications, that are critical to diagnosing service problems, detecting sophisticated security threats and demonstrating compliance. Here we've outlined some of the most important machine data sources and what they can tell you about your IT infrastructure and the behavior of your users or would-be attackers.
But remember, this list is just the starting point. Every environment has its unique footprint of machine data and logs are only part of the picture.
Application Logs
Most homegrown and packaged applications write local logfiles, often via logging services built into middlewareJ2EE application servers like WebLogic, WebSphere and JBoss, .Net, PHP and more. These files are critical for day-to-day debugging of production applications by developers and application support. They're also often the best way to report on business and user activity and detect fraud scenarios, since they have all the details of transactions. When developers put timing information into their log events, they can also be used to monitor and report on application performance.
Web Access Logs
Web access logs report every request processed by a web serverwhat client IP it came from, what URL was requested, what the referring URL was, and data regarding the success or failure of the request. They're most commonly processed to produce web analytics reports for marketingdaily counts of visitors, most requested pages, and the like.
They're also invaluable as a starting point to investigate a user-reported problem, since the log of a failed request can establish the exact time of an error. Web logs are fairly standard and well structured. The only challenge is sheer volume with busy websites experiencing billions of hits a day as the norm.
Web Proxy Logs
Nearly all enterprises, service providers, institutions and government organizations that provide employees, customers or guests with web access use some type of web proxy to control and monitor that access. Web proxies log every web request made by users through the proxy. They may include corporate usernames and URLs hit. These logs are critical to monitor and investigate 'terms of service' abuses or corporate web usage policy and are also a vital component of effective monitoring and the investigation of data-leakage.
Call Detail Records
Call Detail Records (CDRs), Charging Data Records, Event Data Records are some of the names given to events logged by telecoms and network switches. CDRs contain useful details of the call or service that passed through the switch, such as the number making the call, the number receiving the call, call time, call duration, type of call, etc. As communications services move to Internet Protocol-based services, this data is also be referred to as IPDRs, containing details such as IP address, port number, etc. The specs, formats and structure of these files vary enormously and keeping pace with all the permutations has traditionally been a challenge. Yet the data they contain is critical for billing, revenue assurance, customer assurance, partner settlements, marketing intelligence and more. Splunk can quickly index the data and combine it with other business data to enable users to derive new insights from this rich usage information.
Clickstream Data
Use of a webpage on a website is captured in clickstream data. This provides insight into what a user is doing, useful for usability analysis, marketing and general research. Formats for this data are non-standard and actions can be logged in multiple places, such as the web server, routers, proxy servers, ad servers, etc. Existing monitoring tools look at a partial view of the data, from a specific source. Existing web analytics and data warehouse products often sample the data, missing the complete view of behavior and provide no real-time analysis.
Message Queues
Message queuing technologies like TIBCO, JMS and AquaLogic are used to pass data and tasks between service and application components on a publish/subscribe basis. Subscribing to these message queues is a good way to debug problems in complex applicationsyou can see exactly what the next component down the chain received from the prior component. Separately, message queues are increasingly being used as the backbone of logging architectures for applications.
Packet Data
Data generated by networks is processed using tools such as tcpdump and tcpflow, which generate pcaps data and other useful packet-level and session-level information. This information is necessary to handle performance degradation, timeouts, bottlenecks or suspicious activity that indicates that the network may be compromised or the object of a remote attack.
Configuration Files
There's no substitute for actual, active system configuration to understand how the infrastructure has been set up. Past configs are needed when debugging failures that occurred in the past and which may recur in the future. When configs change, it's important to know what changed and when, whether the change was authorized, and whether a successful attacker compromised the system to backdoors, time bombs or other latent threats.
Database Audit Logs and Tables
Databases contain some of the most sensitive corporate data - customer records, financial data, patient records and more. Audit records of all database queries are vital to have in order to understand who accessed or changed what data when. Database audit logs are also useful to understand how applications are using databases to optimize queries. Some databases log audit records to files, while others maintain audit tables accessible via SQL.
Filesystem Audit Logs
The sensitive data that's not in databases is on filesystems, often being shared. In some industries such as healthcare, the biggest data leakage risk is consumer records on shared filesystems. Different operating systems, third party tools and storage technologies provide different options for auditing read access to sensitive data at the filesystem level. This audit data is a vital datasource for monitoring and investigating access to sensitive data.
Management and Logging APIs
Increasingly vendors are exposing critical management data and log events through both standardized and proprietary APIs rather than by logging to files. Checkpoint firewalls log via the OPSEC Log Export API (OPSEC LEA). Virtualization vendors including VMware and Citrix expose configurations, logs and system status via their own APIs.
OS Metrics, Status and Diagnostic Commands
Operating Systems expose critical metrics like CPU and memory utilization and status information using command-line utilities like ps and iostat on Unix and Linux and perfmon on Windows. This data is usually harnessed by server monitoring tools but rarely persisted, yet it is potentially invaluable for troubleshooting, analyzing trends to discover latent issues and investigating security incidents.
Syslog, WMI and More...
There are countless other useful and important machine data sources beyond this list - source code repository logs, physical security logs, etc. You still need your firewall and IDS logs to report on network connections and attacks. Your OS logs including Unix and Linux syslog and the Windows event logs record who's logged into your servers, what administrative actions they've taken, when services start and stop and when kernel panics happen. Logs from DNS, DHCP and other network services record who's assigned what IP address and how domains are resolved. Syslog from your routers, switches and network devices record the state of your network connections and failures of critical network components. The point is that there's more to machine data than just logs and a lot more diverse logs than traditional Log Management solutions can support.