What is a Data Mart?
The data mart is a subject-oriented slice of the data warehouse logical model serving a narrow group of users. Many data marts only need a subset of data from the full tables in the data warehouse. For example, a mart may only have sales transactions, products, and inventory records. Most data marts only have 5-20 tables instead of 4,000.
Data Mart Fact Tables
The number of tables in a data mart has no relevance to the size of the database. The main tables–called fact tables–can be 100s of terabytes of call detail records for a telecommunications company, for instance. The data mart itself can be huge, but it is a narrow selection of all the data available in the data warehouse.
Data marts are often denormalized, capturing only summaries of data by sorting it and aggregating a result table, usually throwing away detail data. Some data marts are completely reloaded weekly or monthly; it is relatively easy to delete all the data and refresh it so that reports only look at the last 30 days of transactions.
Data Mart and Star Schema
Data marts and the star schema are inextricably linked. Imagine rows and columns of data in five spreadsheets. Four of the spreadsheets are connected via key fields that match the largest sheet called the fact table. Imagine the fact table is 50 million records, which really doesn’t fit in a spreadsheet, so they are stored in data mart tables. Most data marts have 5-10 tables in this star schema design, and the small tables on the star arms are called dimension tables.
Data Mart and Snowflake Schema
Dimensions are small tables with vital information. The fact table is where the bulk of the data is stored, maybe billions of records, and can be joined to the customer table to get actual customer name and address fields. A variation, the snowflake schema, has more than multiple fact tables linked together by key fields. Each fact table has only four or five dimension tables; a diagram of the set of tables and relationships resembles a snowflake–but is still a data mart.
What’s the difference between a data mart and data warehouse?
The distinction between data marts and data warehouses is about subject areas and integration, separated by schema complexity, not database size. That’s why the questions that can be asked are 100 times more sophisticated with a data warehouse than with a data mart because all the data is in the warehouse.
There are many “puzzle pieces” in the data warehouse, all the integrated tables grouped into subject areas. The data warehouse doesn’t have to be huge; it can be only five terabytes of storage. Or, it could have hundreds of terabytes of records. An alternative is storing three huge tables in a data mart.
Data Mart and Loading, Moving Data
There are many great tools for data integration and many great relational databases to hold the data – and dozens of excellent tools for analyzing the data. Fortunately, the movement of data to business intelligence (BI) tools is not labor intense nor is a lot of data transferred. Small amounts of data get sent to the BI tool for display in reports or dashboards.
The real costs are in moving data and transforming it for other purposes. Extracting and transforming data is costly and often slow. The integration step is intense in terms of labor and compute processing – but the alternative is to give business users broken, incomplete, or inaccurate data. To be clear, the fastest way to get business users to abandon a data warehouse or data mart is to give them dirty and incomplete data. If they can’t trust the data, they will revert to their spreadsheets. This explains why data in a data warehouse is so valuable – and why proliferation of data marts throughout an organization is risky when it comes to truly having the ability to get one’s arms around the actual state of the business.