A data warehouse is typically used to connect and analyze business data from heterogeneous sources.

Data Warehousing and Online Analytical Processing

Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

4.1.6 Extraction, Transformation, and Loading

Data warehouse systems use back-end tools and utilities to populate and refresh their data (Figure 4.1). These tools and utilities include the following functions:

Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.

Data cleaning, which detects errors in the data and rectifies them when possible.

Data transformation, which converts data from legacy or host format to warehouse format.

Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.

Refresh, which propagates the updates from the data sources to the warehouse.

Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems usually provide a good set of data warehouse management tools.

Data cleaning and data transformation are important steps in improving the data quality and, subsequently, the data mining results (see Chapter 3). Because we are mostly interested in the aspects of data warehousing technology related to data mining, we will not get into the details of the remaining tools, and recommend interested readers to consult books dedicated to data warehousing technology.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123814791000046

Analytics Adoption Roadmap

Nauman Sheikh, in Implementing Analytics, 2013

Lesson 4: Efficient Data Acquisition

Data warehouse systems built robust capability in handling various forms of data coming from different systems at different schedules. This capability is called ETL—extract, transform, and load—but we will be using ETL as a noun referring to a capability of moving data between a source and a target and applying some data processing logic along the way. ETL is starting to become a vast field involving all aspects of data management and has become a small industry called data integration (Thoo, Friedman & Beyer, 2012). ETL teams got very good at linking to operational systems and have an existing integration with all operational systems and a mechanism established for receiving and sending data. Once this capability is in place, accessing data and serving the various data needs of IT and business teams becomes fairly efficient, removing one of the biggest obstacles in data analysis.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124016965000074

Introduction to Data Warehousing

Daniel Linstedt, Michael Olschimke, in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

1.1.2 Data Warehouse Systems

A data warehouse system (DWH) is a data-driven decision support system that supports the decision-making process in a strategic sense and, in addition, operational decision-making, for example real-time analytics to detect credit card fraud or on-the-fly recommendations of products and services [8]. The data warehouse provides nonvolatile, subject-oriented data that is integrated and consistent to business users on all targeted levels. Subject orientation differs from the functional orientation of an ERP or operational system by the focus on a subject area for analysis. Examples for subject areas of an insurance company might be customer, policy, premium and claim. The subject areas product, order, vendor, bill of material and raw materials, on the other hand, are examples for a manufacturing company [9, p29]. This view of an organization allows the integrated analysis of all data related to the same real-world event or object.

Before business users can use the information provided by a data warehouse, the data is loaded from source systems into the data warehouse. As described in the introduction of this chapter, the integration of the various data sources within or external to the organization is performed on the business keys in many cases. This becomes a problem if a business object, such as a customer, has different business keys in each system. This might be the case if a customer number in an organization is alphanumeric but one of the operational systems only allows numeric numbers for business keys. Other problems occur when the database of an operational system includes dirty data, which is often the case when invalid or outdated, or when no business rules are in place. Examples for dirty data include typos, transmission errors, or unreadable text that has been processed by OCR. Before such dirty data can be presented to a business user in traditional data warehousing, the data must be cleansed, which is part of the loading process of a data mart. Other issues include different data types or character encodings of the data across source systems [9, p30f]. However, there are exceptions to this data cleansing: for example, if data quality should be reported to the business user.

Another task that is often performed when loading the data into the data warehouse is some aggregation of raw data to fit the required granularity. The granularity of data is the unit of data that the data warehouse supports. An example of different granularity of data is the difference between a salesman and a sales region. In some cases, business users only want to analyze the sales within a region and are not interested in the sales of a given salesman. Another reason for this might be legal issues, for example an agreement or legal binding with a labor union. In other cases, business analysts actually want to analyze the sales of a salesman, for example when calculating the sales commission. In most cases, data warehouse engineers follow the goal to load at the finest granularity possible, to allow multiple levels for analysis. In some cases, however, the operational systems only provide raw data at a coarse granularity.

An important characteristic of many data warehouses is that historic data is kept. All data that has been loaded into the data warehouse is stored and made available for time-variant analysis. This allows the analysis of changes to the data over time and is a frequent requirement by business users, e.g., to analyze the development of sales in a given region over the last quarters. Because the data in a data warehouse is historic and, in most cases, is not available anymore in the source system, the data is nonvolatile [9, p29]. This is also an important requirement for the auditability of an information system [10, p131].

The next section introduces enterprise data warehouses, which are a further development of data warehouses, and provides a centralized view of the entire organization.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128025109000015

Master Data Management

Daniel Linstedt, Michael Olschimke, in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

9.4 Operational vs. Analytical Master Data Management

Master data can be used in two types of organizational systems. First, the data is used by operational systems as common reference data within the application. Then, the application uses the reference data from MDM as business objects within its business processes that are implemented in the operational system. It enriches the data by new information that has been collected during execution of these processes. It also uses business keys from MDM to manage its references to the business data when it stores the transactions within its own database.

However, this means that businesses must decide which departments or functional units are allowed to change the master data in MDM in order to avoid unwanted or unauthorized changes to the reference data used throughout the business [9]. The usage of master data in operational systems is shown in Figure 9.3.

A data warehouse is typically used to connect and analyze business data from heterogeneous sources.

Figure 9.3. Use of master data by operational systems [9].

The figure shows master data from an airport that is used by multiple operational systems. Each operational system uses a partial set of master data in the local scope of the application. Such an operational system is called a master data subscriber, because it subscribes to the master data, and the changes to its entities. Because operational systems share the master data from its central location, the data within the operational systems becomes integrated, often through a business key that was defined in the MDM application. In some cases, the operational system might update master data with new information. These changes occur within the business processes implemented in the operational system. No system, however, will write transactional information to the MDM application. Instead, the transactional data remains in the operational system only. In order to load it into the data warehouse, it has to be collected independently by ETL routines as part of the data warehouse loading process. We discuss this in more detail in Chapter 11, Data Extraction.

The data warehouse system is another subscriber of master data. Often, operational systems don’t use all master data, or modify it locally. Therefore, the data warehouse is interested in the centrally stored version of master data and the master data that is used and enriched in local applications. For that reason, it loads master data from both locations: the central MDM application and all operational source systems. The master data is often used to source dimensional entries, while the transactional data is used to source fact tables. However, there is some master data that is only created and maintained for the data warehouse itself. This case is called analytical master data and includes some of the following master data types:

Business rule parameters: Many business rules that are implemented in the data warehouse to transform raw data into useful information are based on parameters. For example, tax rates change over time and need to be adjusted frequently. Also, a flight delay is currently defined as a flight which arrives (or departs) the gate at least 15 minutes or more after the scheduled time [10]. Because this definition might change in the future, it is not a good practice to encode such parameter values in the ETL jobs or virtual views directly. Instead, the use of MDM allows business users to modify these definitions on their own, without IT involvement.

Defining groups and bins: In some cases, it is sufficient to identify if a flight is delayed or not (again, if the flight is 15 minutes late). In other cases, a more detailed analysis is required. For that reason, the Bureau of Transportation Statistics (BTS) has defined departure and arrival delay groups that measure the delay in intervals of 15 minutes. Table 9.1 shows their defined delay groups definition. In order to map the actual delay of a flight to this definition, the table needs to be stored in MDM and enriched with numerical limits (minimum and maximum number of minutes) that can be used in ETL to map the value to the definition. For example, if a flight is 65 minutes late, it would be between the 60 and 74 minutes limitation of group number 4 and therefore mapped to this delay group.

Table 9.1. BTS Delay Groups

CodeDescription
–2 Delay < − 15 minutes
–1 Delay between −15 and −1 minutes
0 Delay between 0 and 14 minutes
1 Delay between 15 to 29 minutes
2 Delay between 30 to 44 minutes
3 Delay between 45 to 59 minutes
4 Delay between 60 to 74 minutes
5 Delay between 75 to 89 minutes
6 Delay between 90 to 104 minutes
7 Delay between 105 to 119 minutes
8 Delay between 120 to 134 minutes
9 Delay between 135 to 149 minutes
10 Delay between 150 to 164 minutes
11 Delay between 165 to 179 minutes
12 Delay ≥ 180 minutes

Codes and descriptions: Often, source systems use codes that are easily understood by business users. Examples for such cases are IATA airport codes. However, the sheer number of those three-letter codes available makes it complicated or error-prone to handle reports using these codes. Therefore, such codes are often enriched with readable captions and other important attributes, such as a sort order.

Hierarchy definitions: MDM can be used to define organization hierarchies, product definitions (bill of materials) and other frequently used hierarchies. Having them in the MDM application allows business uses to modify these definitions without IT involvement.

Date and other calendar information: Date and other calendar information is another example of a hierarchy that can be defined using MDM. Using MDM allows the end user to modify the names and definitions of signing holidays or business seasons if needed. It is also possible to modify the beginning and ending dates of the financial calendar when required by organizational changes, e.g. a takeover.

Technical parameters: The data warehouse is installed on top of one or multiple servers, as discussed in Chapter 8, Physical Database Design. Some parts of this environment can be controlled by the business user and technical users, such as systems administrators, by using MDM. For example, it is possible to define external scripts that need to be called during loads, set the name of the environment (“Development,” “Test,” “Production”) that is displayed on technical reports and dashboards, source system names such as FTP address or database names, and date and time formats or time zones to be used. It is also possible to configure users who should be informed if problems during loads occur, such as the business owner or data steward.

Because this information is used only for the data warehouse, it is not fed back into operational systems. However, in some cases, it might actually become used by operational systems, transforming the analytical master data into operational master data.

In addition to these analytical use cases, it is also common practice to enrich operational data with additional attributes that are only required by analytical systems. This could be a classification number or tag that is attached to passengers. Instead of adding the new field, which is only used within the analytical system, to the operational system, which would require substantial time and effort, it is added to a new entity in the MDM application. If new passengers are added to the operational system, they are added to the MDM database using a staging process, similar to those used in data warehousing. Business users can classify those new passengers or other business entities and the added data is used in the data warehouse downstream.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012802510900009X

Automated Decisions and Business Innovation

Nauman Sheikh, in Implementing Analytics, 2013

Decision Automation and Intelligent Systems

Now that we have seen the process of how operations run, generate data, and the data is stored and analyzed, analytics models come into play. The use of analytics models is through decision strategies. Now let’s look at how these strategies go out of the analytics team and into the real world so actual business operations benefit from analytics—the culmination of value from data through democratization of analytics.

Learning versus Applying

The purpose of data warehouse and analytics systems is to analyze data, learn from that analysis, and use that knowledge and insight to optimize the business operations. The optimization of business operations really means changing the business processes so they yield greater value for the organization. The business processes are actually automated through operational systems, so these changes require operational systems changes. However, since we established in Chapter 1 that analytics has to do with a future event, the business process changes are limited to well-defined events and the response to those events. Just to ensure that there is no confusion, modifying or improving a business process can also mean improving the information flow in a process, the integration of various disconnected processes, or eliminating redundant or duplicate steps from the flow. That is, the field of business process management has nothing to do with analytics models and decision strategies. The business process optimization, redesign, or reengineering within the purview of analytics is limited to the automated decisions driven from business rules that take input from analytics models. These business rules are embedded in the operational system, and therefore the application of analytics input has to be implemented within the operational system or as an add-on component or extension of the operational system.

Decision automation has two dimensions: the learning and the application of that learning on business operations. The learning is where the data warehouse, data analysis, and analytics models come into play, while applying that learning on actual business activity is where decision strategies come into play, and they have to be embedded or tightly integrated into the operational system. There is a school of thought known as active data warehousing that suggests that this decision making should be done in the data warehouse since all relevant data is available to make a determination comprehensively, from all perspectives. This requires the data warehouse to be connected in real time to the operational system for receiving the event and responding with a decision. This just increases the complexity of integration, as the rules or knowledge used to make the decision must have been done beforehand, so why integrate with the data warehouse in real time? The active data warehousing approach works well in campaign management systems where the line between operational data and analytical data is blurred. It is not a recommended approach when additional layers of analytics are involved. If the decisions are totally based on some triggers or thresholds that the data warehouse is tracking in real time, then active data warehousing may work. The strategy integration is a superior and simpler approach and decouples the data warehouse from live events and decisions. The results still go in the data warehouse within the analytics datamart, but there is tighter monitoring control and a simpler interface for strategy modification and testing.

Figure 5.3 represents this learning versus applying. If we break this diagram in the left side (going top to bottom), that would be the learning dimension, and the right side of the diagram would be the applying dimension. From a system architecture perspective, the nature of the two is quite different, and therefore a modular approach allows for greater flexibility in choice of toolset, choice of monitoring and control, and the operational SLAs to support the business.

A data warehouse is typically used to connect and analyze business data from heterogeneous sources.

Figure 5.3. Decision strategy technical architecture.

Strategy Integration Methods

Earlier in this chapter we used a simple decision strategy example for a consumer car loan. That example was presented in its algorithmic form (a combination of nested IF_THEN_ELSE), as well as in visual form (see Figure 5.1). Looking at that simple business rule it should be obvious that adding that extended logic in the operational system workflow is not a technical challenge at all. The analytics tool will always stay outside as a black-box or a standalone component. However, if there are several scenarios that need their own strategies or there is a complex situation where multiple models are also involved, this type of extended coding within the operational system may get too complicated to maintain.

On the other hand, a pure-play fancy strategy management tool may be too expensive. Therefore, one method is to embed the strategy rules in the operational system software. Another is to buy a strategy tool. The downside of embedding the code is that monitoring and auditing, as well as modifications, testing, and what-if scenarios, will be extremely difficult to carry out. The strategy tools do solve this problem as they have visual interfaces for strategy design and internal version controls, but there is the integration with the analytics tool, data warehouse, and operational system that has its own costs and implementation overheads. Here is an innovative approach as an alternate method to the two methods described earlier.

ETL to the Rescue

ETL (extract, transform, and load) is a small industry comprising specialized software, comprehensive data management methodology, and human expertise existing within the data warehouse industry. It came into being with the data warehouse since data needed to be pulled out of the operational systems and loaded into the data warehouse systems. We refer to ETL as a noun encompassing all aspects of “data in motion,” including single record or large data sets, messaging or files, real-time or batch data. ETL then becomes the glue that holds the entire analytics solution together.

All ETL tools now have GUI-based development environments and provide all the capabilities of a modern software development tool. If we look closely to the two treelike strategies in Figures 5.1 and 5.2, they look very similar to how a data processing or dataflow program looks in ETL tools. Therefore, an ETL tool can be used to design and implement strategies. ETL has its own processing server and has integration interfaces to all source systems; there is plenty of expertise available within all data warehouse teams. So the recommended integration is through an ETL layer that receives a real-time event from the operational system, it prepares the input record around the event, and invokes the analytics model. The model returns an output that ETL will take into the strategy, run the data through the strategy, and reach a decision. It will then deposit that decision back into the operational system. This is a very simple integration for ETL developers who have been building the information value chain through the Information Continuum. Remember, if the prerequisite layers of the hierarchy are not in place, jumping into the analytics levels of the Information Continuum will only yield short-term and sporadic success. For a sustained benefit from analytics, the Information Continuum has to be followed through each level, and that will automatically bring the ETL maturity needed to implement decision strategies and integrate with the operational systems.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124016965000050

Data Cube Technology

Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

Publisher Summary

This chapter focuses on data cube technology. Data warehouse systems provide online analytical processing (OLAP) tools for interactive analysis of multidimensional data at varied granularity levels. OLAP tools typically use the data cube and a multidimensional data model to provide flexible access to summarized data. A data cube can interactively explore the data in a multidimensional way through OLAP operations like drill-down (to see more specialized data such as total sales per city) or roll-up (to see the data at a more generalized level such as total sales per country). Although the data cube concept was originally intended for OLAP, it is also useful for data mining. Multidimensional data mining is an approach to data mining that integrates OLAP-based data analysis with knowledge discovery techniques. It is also known as exploratory multidimensional data mining and online analytical mining (OLAM). It searches for interesting patterns by exploring the data in multidimensional space. Users can interactively drill down or roll up to varying abstraction levels to find classification models, clusters, predictive rules, and outliers. Methods for data cube computation and methods for multidimensional data analysis are focused on. Precomputing a data cube (or parts of a data cube) allows for fast accessing of summarized data. Given the high dimensionality of most data, multidimensional analysis can run into performance bottlenecks. Therefore, it is important to study data cube computation techniques. Data cube technology provides many effective and scalable methods for cube computation. Studying these methods also help in the understanding and further development of scalable methods for other data mining tasks such as the discovery of frequent patterns.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123814791000058

The Data Warehouse/Operational Environment Interface

W.H. Inmon, ... Mary Levins, in Data Architecture (Second Edition), 2019

Changed Data Capture

Yet, another variation on the classical interface between operational systems and data warehouse systems is that of what is termed the CDC option. “CDC” stands for “changed data capture.” For high-performance online transaction environments, it is difficult or inefficient to scan the entire database every time data need to be refreshed into the data warehouse environment. In these environments, it makes sense to determine what data need to be updated into the data warehouse by examining the log tape or journal tape. The log tape is created for the purposes of online backup and recovery in the eventuality of a failure during online transaction processing. But the log tape contains all the data that need to be updated into the data warehouse. The log tape is read offline and is used to gather the data that need to be updated into the data warehouse.

Fig. 8.3.5 depicts the CDC option.

A data warehouse is typically used to connect and analyze business data from heterogeneous sources.

Fig. 8.3.5. Transaction processing systems, changed data capture option.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128169162000280

Introduction

Philip A. Bernstein, Eric Newcomer, in Principles of Transaction Processing (Second Edition), 2009

Data Warehouse Systems

TP systems process the data in its raw state as it arrives. Data warehouse systems integrate data from multiple sources into a database suitable for querying.

For example, a distribution company decides each year how to allocate its marketing and advertising budget. It uses a TP system to process sales orders that includes the type and value of each order. The customer database tells each customer’s location, annual revenue, and growth rate. The finance database includes cost and income information, and tells which product lines are most profitable. The company pulls data from these three data sources into a data warehouse. Business analysts can query the data warehouse to determine how best to allocate promotional resources.

Data warehouse systems execute two kinds of workloads: a batch workload to extract data from the sources, cleaning the data to reconcile discrepancies between them, transforming the data into a common shape that’s convenient for querying, and loading it into the warehouse; and queries against the warehouse, which can range from short interactive requests to complex analyses that generate large reports. Both of these workloads are quite different than TP, which consists of short updates and queries. Also unlike TP, a data warehouse’s content can be somewhat out-of-date, since users are looking for trends that are not much affected by the very latestupdates. In fact, sometimes it’s important to run on a static database copy, so that the results of successive queries are comparable. Running queries on a data warehouse rather than a TP database is also helpful for performance reasons, since data warehouse queries would slow down update transactions, a topic we’ll discuss in some detail in Chapter 6. Our comparison of system styles so far is summarized in Figure 1.12.

A data warehouse is typically used to connect and analyze business data from heterogeneous sources.

Figure 1.12. Comparison of System Types. Transaction processing has different characteristics than the other styles, and therefore requires systems that are specially engineered to the purpose.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781558606234000019

Physical Data Warehouse Design

Daniel Linstedt, Michael Olschimke, in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

8.4.3 Memory Options

The database server uses physical memory for caching pages from disk. While operational database systems often deal with small transactions, data warehouse systems deal with large queries (referring to the amount of data being touched by the query). In addition, a query often requires multiple passes to deal with large tables; having the table already in memory can greatly improve the performance [23]. If SQL Server doesn’t have enough memory available to complete the operation, it uses hard disk storage, for example by using page files, tempdb or re-reading database pages from disk. Therefore, the more RAM the data warehouse system provides to Microsoft SQL Server, the better.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128025109000088

Scalable Data Warehouse Architecture

Daniel Linstedt, Michael Olschimke, in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

2.1.1 Workload

The enterprise data warehouse (EDW) is “by far the largest and most computationally intense business application” in a typical enterprise. EDW systems consist of huge databases, containing historical data on volumes from multiple gigabytes to terabytes of storage [4]. Successful EDW systems face two issues regarding the workload of the system: first, they experience rapidly increasing data volumes and application workloads and, second, an increasing number of concurrent users [5]. In order to meet the performance requirements, EDW systems are implemented on large-scale parallel computers, such as massively parallel processing (MPP) or symmetric multiprocessor (SMP) system environments and clusters and parallel database software. In fact, most medium- to large-size data warehouses could not be implementable without larger-scale parallel hardware and parallel database software to support them [4].

In order to handle the requested workload, there is more required than parallel hardware or parallel database software. The logical and physical design of the databases has to be optimized for the expected data volumes [6–8].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128025109000027

What are data warehouses used for?

A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction.

What is heterogeneous in data warehouse?

Heterogeneous data is integrated into the data warehouse. Clients or business end users interact with the data warehouse for analysis instead of the various data sources. Source publication. Data Warehousing, OLAP, and Data Mining: An Integrated Strategy for Use at FAA.

Which data warehouse function entails obtaining data from a variety of heterogeneous sources?

4.1. These tools and utilities include the following functions: Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.

What kind of data does a data warehouse usually process?

Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.