This e-book will show you seven things to consider to ensure your containers are production-ready.
Published Date: March 16, 2023
Data warehouses are electronic systems for storing and integrating data, particularly corporate information, in central repositories where it can then be mined and analyzed to gain actionable business insights.
The concept originated in the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed what they called a “business data warehouse.” For decades afterward, data warehouses were seen as constantly expanding server farms where all company data would be aggregated, cleansed and served up to meet the needs of analysts and employees. They were seen as hosts for corporate “golden records,” meaning they were ground central for the optimization of all business-critical historical data.
Despite serving this very real purpose, however, the concept of data warehousing did develop a stigma over time. As the amount of data exploded, the cost and complexity of managing it in data warehouses similarly skyrocketed. Industry pundits began to question the data warehousing model and wondered if it would have to give way to something else.
Not surprisingly, when cloud computing erupted and offered more scalable, flexible, and cost-effective ways of data storage, data warehousing did, in fact, begin to change. Cloud service providers emerged about 10 years ago providing all but unlimited scalability, manageability, data analysis and security for organizations looking to offload the cost and complexity of running their own server farms. Today, an estimated 30 percent of data warehousing workloads run in the cloud, and that is expected to rise to over 60 percent by 2024.
(Want to learn more about how to secure, operate and innovate faster across your cloud landscape? Read about the Splunk platform here.)
In this article, we will discuss the different types of data warehouses, how they compare to similar technologies, and how to get started building the right data warehouse for your company’s needs.
How does a data warehouse work?
Data warehouses involve networks of on-premises and cloud-based servers wherein organizations store massive amounts of data. That information goes through a process known as ETL (extract, transform and load) in order to access, analyze and derive insights from it.
For example, if you had customer information sitting in a CRM database somewhere in a company, that data is likely written in that system’s format. It might be out of date. And it might even be redundant, meaning the same information is sitting on multiple servers.
When preparing raw data for use in a data warehouse, it’s first extracted from multiple data sources then transformed into a common digital format. During this “staging” process, it is also filtered, cleansed and sorted to eliminate inaccurate, imprecise and repetitive attributes. Once the transformed data is loaded and secured in the data warehouse, it is made available to authorized end users through a frontend that provides access to data analytics functions.
Why is server monitoring important?
There are three main types of data warehouses. They include:
- Enterprise Data Warehouses (EDWs): EDWs are heavy-duty central repositories for all an organization’s internal and external data sources. Housed on premises or in the cloud, they often tap into multiple data sources, such as enterprise resource planning (ERP), customer relationship management (CRM), finance and operations databases. EDWs also act as staging areas for data aggregation and cleansing.
- Operational Data Store (ODSs): An ODS is used on a temporary basis for lightweight data integration and processing for activities like operational reporting, online transaction processing and customer service. The ODS does not usually replace an EDW. Rather, it prepares data for eventual storage in an EDW.
- Data Marts: A data mart is a condensed repository customized for specific groups or lines of business within an organization. They usually sit on Windows, Unix, or Linux servers. As opposed to EDWs or even ODSs, data marts typically only use information from just one or very few sources. They are designed to provide easy access to frequently requested data. Think of them as subsets of larger EDWs.
What is data warehouse architecture?
Data warehouses typically depend on single-tier, two-tier or three-tier architectures. The objective of a single-tier approach is to minimize how much data is stored. A two-tier approach separates physically available sources from the data warehouse. Because it is not expandable and struggles to support large numbers of users, it is not commonly employed.
The most popular approach is the three-tier architecture, which includes:
- Bottom tier: In this tier, data warehouse servers collect, clean, and transform data from various sources across the organization. Metadata is created during transformation to speed search and query. Meanwhile, ETL elements help aggregate the processed data in standardized formats.
- Middle tier: This tier relies upon an online analytic processing (OLAP) computing model to organize and map large amounts of data in ways that enable analysts to view it from different perspectives using a simple querying language.
For instance, if an insurance industry analyst asks a system to compare the number of weather-related residential claims in Florida and Louisiana during May and October, this layer would make it possible to retrieve the data more quickly and efficiently. - Top tier: This final tier is the front-end client layer. It is often enabled by powerful, dashboard-driven software for visualizing, analyzing, and presenting insights from data mining efforts.

What are some data warehouse components?
There are at least five key components of a data warehouse to understand. These include:
- Central database: Many data warehouses rely upon databases built around a relational database management system (RDBMS). This is a collection of datasets that use Structured Query Language (SQL) to organize multiple datasets into tables, records, and columns so data can be easily accessed and consumed.
- ETL platforms or tools: Data warehouses typically use third-party platforms or tools to handle all the functions that occur during the staging process. ETL processes can be manually written in-house, but third-party platforms and tools often offer more features and better visualization and graphical user interfaces. This latter point is important because ETL processes can involve a variety of data scientists, analysts, engineers, and developers to transform and get data ready for widespread use.
- Metadata: Some say metadata is “data about data.” It tags certain kinds of data – much as an index which directs readers to specific information in a book – so it can be quickly and readily found and accessed. There are three categories of data warehouse-related metadata, including:
- Business metadata: Tagged information about the business itself.
- Technical metadata: This involves information like database table and column names, data types, indexes, ETL jobs, and so forth.
- Operational metadata: Describes how data is processed and accessed.
- Query tools: Data has no use if it cannot be found, so query or search tools are a necessary part of every data warehouse. Most major cloud service providers offer variations of these tools, and there are several open-source options available as well. There are four common types of query tools, including:
- Query and reporting tools, which are often built into data warehouse platforms to extract real time data and historical information and insights in report formats.
- Application development tools for writing custom programs to hunt for particular types of data.
- Data mining tools for discovering patterns and trends in data.
- OLAP tools, which are analytical processing applications for looking at data from multiple perspectives.
- Data warehouse bus architecture: This determines the flow of data in the warehouse. Data flow includes:
- Inflow, which helps with extracting, cleansing and loading new data into the warehouse.
- Downflow, a process tied to archiving and data backup.
- Upflow, a process for adding value to data in the warehouse to make it more useful for data analytics and decision support.
- Outflow, the process of making data warehouse information available through query and analysis tools.
- Metaflow, a coding library that helps data scientists and data engineers write and manage data science projects more efficiently.
What is the difference between a database and a data warehouse?
Databases and data warehouses both store data but their purposes differ.
The key difference is that databases are collections of related data from specific parts of an organization and are meant to record data that can be served up quickly and simply. For example, a bank might use a database for business analytics to serve up customer information and account-related information like deposits and withdrawals. Similarly, an airline might use a database to provide reservation and itinerary information to passengers. A healthcare company might use a database to review patient survey results.
Data warehouses are information systems for storing and analyzing data from numerous data sources. That same bank would turn to a data warehouse to manage its customer-impacting operations and resources more effectively. The airline in the example might use it for maximizing crew assignments or route planning. And, that healthcare company might use it to identify ways to improve patient experience.
From a technical standpoint, databases and data warehouses use different underlying technologies. Databases, for instance, rely upon Online Transactional Processing (OLTP) for real-time, data-dependent jobs, such as planning, budgeting, and forecasting. Data warehouses, conversely, use Online Analytical Processing (OLAP) to quickly analyze large volumes of sometimes disparate data. This is the technology that enables analysts to look at data from different perspectives and in multiple contexts.
For more on the differences between databases and data warehouses, check out Splunk’s /learn blog, authored by Austin Chia, Founder of AnyInstructor.com.
What is the difference between a data lake and a data warehouse?
A data lake is a data repository for large amounts of raw data stored in its original format — a term coined by James Dixon, then chief technology officer at Pentaho.
Data lakes can contain a mix of structured, semi-structured, and unstructured data, while a data warehouse usually contains only structured data. In a data lake, which is typically built on a big data platform, data is stored without being cleansed, tagged, or manipulated. They are much less organized, which offers specific advantages and disadvantages in use.
Want to learn more about the different types of big data storage like data lakes and data warehouses? Visit the learn blog article here, or the Data Insider article here.
What is a cloud data warehouse?
Cloud data warehouses are central repositories of information hosted and managed by service providers. They are usually created for business intelligence (BI) decision making and analytics purposes. Like the cloud itself, they offer organizations almost infinite scalability as well as global accessibility for distributed workforces. Cloud data warehouses also allow them to staff up or down on a pay-as-you-go subscription basis, so they never have to worry about not having access to critical IT skills. The downside that organizations traditionally cite is that moving data to the cloud means ceding some control of it. More risk-averse companies, therefore, tend to shift non-vital data to the cloud but keep private and sensitive information on premises.
Data in cloud warehouses can store information from almost any type of source, including CRM tools, internet of things (IoT) devices (more on IoT Monitoring here), and point of sale terminals. They also take advantage of BI tools, columnar data stores, data quality and integration tools, massively parallel processing, and structured query language (SQL), among other technologies.
Why is server monitoring important?
Servers are some of the most critical pieces of your IT infrastructure, so it stands to reason that monitoring their performance and uptime is vital to the health of your IT environment. If a web server is offline, running slowly, experiencing outages or other performance issues, you may lose customers who decide to visit elsewhere. If an internal file server is generating errors, key business data such as accounting files or customer records could be corrupted.
Server monitoring is designed to observe your systems and provide a number of key metrics to IT management about their operation. In general, a server monitor tests for accessibility (ensuring the server is alive and reachable) and measures response time (testing that it is fast enough to keep users happy), while alerting for errors (missing or corrupt files, security violations, and other problems). Server monitoring is also predictive: Is the disk going to reach capacity soon? Is memory or CPU utilization about to be throttled? Server monitoring is most often used for processing data in real time, but it also has value when evaluating historical data. By looking at previous weeks or months, an analyst can determine if a server’s performance is degrading over time — and may even be able to predict when a complete crash is likely to occur.
Why is server monitoring important?
Servers are some of the most critical pieces of your IT infrastructure, so it stands to reason that monitoring their performance and uptime is vital to the health of your IT environment. If a web server is offline, running slowly, experiencing outages or other performance issues, you may lose customers who decide to visit elsewhere. If an internal file server is generating errors, key business data such as accounting files or customer records could be corrupted.
Server monitoring is designed to observe your systems and provide a number of key metrics to IT management about their operation. In general, a server monitor tests for accessibility (ensuring the server is alive and reachable) and measures response time (testing that it is fast enough to keep users happy), while alerting for errors (missing or corrupt files, security violations, and other problems). Server monitoring is also predictive: Is the disk going to reach capacity soon? Is memory or CPU utilization about to be throttled? Server monitoring is most often used for processing data in real time, but it also has value when evaluating historical data. By looking at previous weeks or months, an analyst can determine if a server’s performance is degrading over time — and may even be able to predict when a complete crash is likely to occur.
Why is server monitoring important?
Servers are some of the most critical pieces of your IT infrastructure, so it stands to reason that monitoring their performance and uptime is vital to the health of your IT environment. If a web server is offline, running slowly, experiencing outages or other performance issues, you may lose customers who decide to visit elsewhere. If an internal file server is generating errors, key business data such as accounting files or customer records could be corrupted.
Server monitoring is designed to observe your systems and provide a number of key metrics to IT management about their operation. In general, a server monitor tests for accessibility (ensuring the server is alive and reachable) and measures response time (testing that it is fast enough to keep users happy), while alerting for errors (missing or corrupt files, security violations, and other problems). Server monitoring is also predictive: Is the disk going to reach capacity soon? Is memory or CPU utilization about to be throttled? Server monitoring is most often used for processing data in real time, but it also has value when evaluating historical data. By looking at previous weeks or months, an analyst can determine if a server’s performance is degrading over time — and may even be able to predict when a complete crash is likely to occur.
Why is server monitoring important?
Servers are some of the most critical pieces of your IT infrastructure, so it stands to reason that monitoring their performance and uptime is vital to the health of your IT environment. If a web server is offline, running slowly, experiencing outages or other performance issues, you may lose customers who decide to visit elsewhere. If an internal file server is generating errors, key business data such as accounting files or customer records could be corrupted.
Server monitoring is designed to observe your systems and provide a number of key metrics to IT management about their operation. In general, a server monitor tests for accessibility (ensuring the server is alive and reachable) and measures response time (testing that it is fast enough to keep users happy), while alerting for errors (missing or corrupt files, security violations, and other problems). Server monitoring is also predictive: Is the disk going to reach capacity soon? Is memory or CPU utilization about to be throttled? Server monitoring is most often used for processing data in real time, but it also has value when evaluating historical data. By looking at previous weeks or months, an analyst can determine if a server’s performance is degrading over time — and may even be able to predict when a complete crash is likely to occur.
What is server management? Why is it so hard to manage servers?
Server management is the ongoing process of operating a server in order to ensure uptime and reliability, high performance, and error-free operation. It represents the day-to-day activities required to administer and keep a server running, with a key focus on ensuring uninterrupted availability required for optimal user experience.
Server management can comprise a wide range of specific functions, depending on the organization, its IT structure, and the types and number of servers it operates. At a typical organization, server management includes daily monitoring, installing software updates, installation and setup of new equipment, and problem troubleshooting and triage. Server management also typically includes provisioning and capacity planning to ensure there are enough system resources to meet the organization’s needs. For example, if a firm may need enough web server power to support 10,000 simultaneous users, with a burst of up to 12,000 users, a server manager would ensure this capacity was available on demand.
Server management presents its own set of challenges in a virtual environment, as an IT manager can’t physically walk to the server hardware and check if there are any problems. A different set of challenges arise, however, if the servers are physical hardware devices. Servers in both environments need to be managed from a software and hardware perspective, as long as there is space, electrical power, network bandwidth and even cooling capacity to handle all of them.

Traditional data warehouses have been around for decades, but the emergence of cloud computing, new concepts in data warehousing design, and improvements in BI and analytics capabilities are taking this technology to a whole new level. While many organizations have the internal resources to identify, deploy, and manage these information supercenters, it is often advisable to consider bringing in outside consultant partners to help. Data warehouses, after all, are becoming more complex and costly. And no organization can afford poor performance, outages, and other mistakes that can often lead to data warehouses ineffectiveness.

Splunk Observability and IT Predictions 2023
Splunk leaders and researchers weigh in on the the biggest industry observability and IT trends we’ll see this year.