Data warehousing:
The analytic foundation

What is a data warehouse?

A data warehouse is a design pattern or data architecture that tracks integrated, consistent, and detailed data over time, establishing relationships between them using metadata and schema.

The origins of the data warehouse

The data warehouse concept started in 1988 when Barry Devlin and Paul Murphy published their groundbreaking paper in the IBM Systems Journal. Their vision sparked a need for more specific definitions of database implementations, which Bill Inmon and Ralph Kimball provided in the early 1990s – and Gartner further clarified definitions in 2005. Now any discussion on data warehousing also includes how or where a data warehouse solution is implemented, such as within the cloud, or spanning on-premises and cloud in a hybrid manner.

A data warehouse isn’t a collection of tables or measured in terabytes. It’s a design pattern, a data architecture with many characteristics:

Subject oriented
Reflects business entities and processes that the organization works with daily. The level of detail in the subject area is what is important: if detailed data is there, it is a data warehouse. If summary or only limited data is there, it is a data mart.

Integrated, consistent
Data formats and values are standardized across all tables to ensure complete, accurate data that users can understand. It must also have integrity: e.g., it cannot have purchasing transactions without a corresponding customer record.

Nonvolatile history
A warehouse captures data changes and tracks data changes over time. All data is kept and does not change with transactional updates. Whether traditional, hybrid, or cloud, a data warehouse is effectively the “corporate memory” of its most meaningful data.

Metadata, schema, dictionary

Organizations typically capture all there is to know about customers and sales transactions, which are linked and also connected with product descriptions, inventories, bills of material, support records, and marketing campaigns. Subject areas are made up of relational tables with column headers and rows of data. Column headers, called metadata because they describe the context for values in that column, contain the names of fields and their data type such as date, currency, time, integers, and text. Every record stored has at least one key field that allows for random direct access. All table definitions – the metadata – together are called a schema and are stored in the data dictionary. An entire, fully-built data warehouse is 4,000 to 7,000 relational tables organized by topic areas.

How does data get into the data warehouse?

Data sources
It’s not uncommon to have 200 or even 500 different applications sending data to the warehouse, which consolidates and integrates all such data into the subject areas. The warehouse gets input from applications such as enterprise resource planning (ERP), customer relationship management (CRM), and supply chain management (SCM).

Clickstream data from mouse clicks on web pages are another source, as is sensor data from machinery vehicles, and so on. There is also unstructured data such as JSON (JavaScript Object Notation) which does not conform to rows and columns but is still captured in the data warehouse.

Data loading
Data is loaded into the warehouse in a continuous process – typically all day long. Data loading leads to the business purpose of the warehouse: the foundation for finding answers to questions. Data scientists apply advanced mathematics to find patterns and anomalies, while business analysts use reports and dashboards with visualization.

Data loading segues to the business purpose of delivering data to the business users: getting insight and finding answers to business problems. Data scientists apply advanced mathematics to large amounts data to find patterns and anomalies. Multi-dimensional analysis is where all the data is highly summarized, enabling fast review of rollups by region, city, sales person, and product sold, for example. Executives and business analysts (or "citizen data scientists") use reports and dashboards with visualization, all pulling from the source of governed data: the data warehouse.

Data integration
Before the data goes into the data warehouse database, it passes through the data integration step, a complex process that rationalizes data from multiple sources into a single result. Originally this was called extract, transform, and load (ETL) because the data had to be pulled from the source, refined, then loaded into data warehouse relational tables.

Data cleansing
Modern integration processes include data cleansing, which involves detecting and correcting corrupt or inaccurate records. Errors occur due to faulty inputs, hardware corruption, or simple human error. The data integration task combines the best, most accurate and most complete data from multiple applications into a clean, reliable “golden record” in the warehouse.

The Teradata data warehouse

Since inception, Teradata has been a key player in the evolution of analytic databases and data warehousing. In fact, Teradata Vantage – formerly known as Teradata Database – and the approach that Teradata takes for data warehousing are often a central part of customers’ rise to the top of their respective industries, whether in communications, media and entertainment, financial services, healthcare and life sciences, retail, utilities, manufacturing, travel and transportation, and more.

The Cloud Data Warehouse and Teradata Vantage

Both Teradata and the industry at large evolved to incorporate the benefits of cloud deployment and scalability. Teradata Vantage, the company’s flagship offering, builds on the strong foundation of Teradata Database and incorporates advanced analytic capabilities acquired with Aster Data in 2011.

Vantage is available for Amazon Web Services (AWS), Microsoft Azure, Google Cloud, Teradata infrastructure (Teradata Cloud or Customer Cloud), and commodity hardware running VMware virtualization software.

Cloud Analytics - AWS Amazon Web Services

Use AWS infrastructure with Teradata Vantage

Cloud Analytics - Microsoft Azure Microsoft Azure

Combine Azure resources with Teradata Vantage

Cloud Analytics - Google Cloud Google Cloud

Leverage Google Cloud with Teradata Vantage

Take your analytics into the cloud