Top down and bottom up approach data warehouse

There is a silly discussion about whether a top down or bottom up

approach is best when building the data warehouse environment. The top

down approach is commonly attributed to Bill Inmon and the Corporate

Information Factor (CIF). The bottom up approach is commonly attributed

to Ralph Kimball and the dimensional modeling crowd.

However, I fail to understand why the controversy exists.

Even the Data Warehouse Institute (http://www.dw-institute.com) has

gotten into the controversy. They published a TDWI Flashpoint on August

27, 2003 with comments from leading consultants on this topic. Pieter

Mimno (Mimno, Myers, and Holum) provides the bottom up approach and John

Ladley (KI Solutions) provides the rebuttal (so to speak). David Wells,

Director of Education at The Data Warehouse Institute, provides some

context setting about the debate.

The big question, it seems to me, is not so much which approach is best,

but which approach makes the most sense in your organization.

The top down approach has source systems feeding a data warehouse which

feeds the data marts.

The bottom up approach has source systems feeding a staging area which

feeds the data marts.

I really don't see much difference between the two. Both have source

systems being fed to a big storage area (call it a data warehouse, call

it a staging area) and the users access a subset of the data (call it a

data mart) to do their work. They both have the same problems in

development. The only question I think remains is whether that big

storage area is normalized or de-normalized. Since the user community

rarely, if ever, touches this big storage area, then who cares? To me,

it becomes a matter of preference, except to the two sides of the issue.

So, why the controversy? I think it allows debate in a community of

intellectuals. I like to debate the issues, but I build data warehouse

environments based on organizational needs and return on investment.

But the big reason I think the debate continues is the most common

answer to my question to ETL vendors - The answer is "We do data marts,

not data warehouses". Most of the tool vendors are not quite up to

dealing with large amounts of data in short time frames yet. To play in

this marketplace, you need to deal with data (lots and lots of data).

To move data from a big storage area to the data marts is not really a

time consuming process, and that is what these tools do best.

Having built data warehouses using both approaches, I still don't

understand the controversy. I prefer the half-way point. And if I were

a betting man, I would bet most people agree with me.

TELL ME WHAT YOU THINK

Please take a moment, and let me know what you think about this

newsletter, and what topics you would like to see covered. We've

included only six quick questions:

http://www.accelacomm.com/jlp/DMSNL/1/10000846/

This story, "Building a data warehouse: The top down, bottom up debate" was originally published by ITworld.

Data Warehouse design approaches are very important aspect of building data warehouse. Selection of right data warehouse design could save lot of time and project cost.

There are two different Data Warehouse Design Approaches normally followed when designing a Data Warehouse solution and based on the requirements of your project you can choose which one suits your particular scenario. These methodologies are a result of research from Bill Inmon and Ralph Kimball.

Bill Inmon – Top-down Data Warehouse Design Approach

“Bill Inmon” is sometimes also referred to as the “father of data warehousing”; his design methodology is based on a top-down approach. In the top-down approach, the data warehouse is designed first and then data mart are built on top of data warehouse.

The above image depicts how the top-down approach works.

Below are the steps that are involved in top-down approach:

Data is extracted from the various source systems. The extracts are loaded and validated in the stage area. Validation is required to make sure the extracted data is accurate and correct. You can use the ETL tools or approach to extract and push to the data warehouse.
Data is extracted from the data warehouse in regular basis in stage area. At this step, you will apply various aggregation, summerization techniques on extracted data and loaded back to the data warehouse.
Once the aggregation and summerization is completed, various data marts extract that data and apply the some more transformation to make the data structure as defined by the data marts.

Ralph Kimball – Bottom-up Data Warehouse Design Approach

Ralph Kimball is a renowned author on the subject of data warehousing. His data warehouse design approach is called dimensional modelling or the Kimball methodology. This methodology follows the bottom-up approach.

What is a Data Warehouse?

A Data Warehouse (DWH) is an essential component of a business intelligence system that enables users to perform various tasks and analyze large volumes of data. It typically contains multiple data sources such as transaction applications and log files. A DWH is generally considered an organization’s single source of truth.

It can be categorized as a specific area of comfort that supports the management’s process by storing large volumes of data. A DWH can also help an organization make informed decisions by analyzing the data.

Characteristics Of a Data Warehouse:

There are four characteristics of a DWH. They allow analysts to make informed decisions by analyzing the data.

It denotes that the data warehousing process is focused on a more specified issue. It focuses on a specific subject or topic instead of an organization’s ongoing operations. It also delivers a clear, precise, and concise summary of the issue by excluding information that isn’t relevant to the decision-making process.

2. Time-variant:

The DWH is consistent within a specified period. This means that the data in the DWH is uploaded monthly, weekly, hourly, etc., and it does not change within that period.

3. Non-volatile:

When you add new data, the previous data is not truncated. Since a DWH is separate from an operational database, any regular changes in the operational database are not reflected in the DWH. People will understand what has happened because of the lack of volatility. It clarifies the results of the investigation.

4. Integrated:

In a DWH, integration entails establishing a consistent unit of measurement for all related data from various databases. It reduces data redundancy. This means that if the same data has different names under different topics, DWH will identify all of them under one name.

Key Functions of a Data Warehouse:

A DWH serves as a data repository, with data held by an entity that provides data backup services. It lowers the cost of the storage system and even the backup data at the organizational level. It maintains information about the tables with high granular transaction levels that are monitored to define data warehousing approaches.

Here are some functions of a DWH:

Data consolidations	It refers to collecting data from all the sources in an organization, cleaning it, and combining it in a single location.
Data cleaning	It is a process that ensures the correctness of data. The accuracy, integration, and consistency are checked.
Data integration	It combines data from different sources in an organization under a single view to the users.
Data extraction	It is the process of extracting data from an organization for further use.
Data transformation	It is the process of converting some data into a usable format.
Data loading	It is the process of loading the data into a storage system.
Refreshing	It is the process of updating the data.

Important: Data cleaning and transformation are crucial to improving data quality and data mining results.

Implementation Of a Data Warehouse:

Implementing a DWH entails creating and deploying a DWH to collect and arrange company data for analytical querying and reporting.

In other words, DWH implementation refers to building and implementing a DWH system in a company. Data warehousing is one of the most critical components of an organization’s business intelligence process. The deployment of data warehousing necessitates a set of procedures that must be followed meticulously.

It aids in developing a path or road map that we must follow to reach the goals and objectives we have set forth. There’s a reasonable probability that the project will fail if there’s no good planning. While planning, we need to consider the technical, launch, and user requirements. Some important considerations are as follows:

– Data backup
– Restoration and recovery of data
– Data sources to be used
– Combining data
– Storing data
– Cost estimations

2. Data Gathering:

Although data is readily available, not all of it is useful to a business. Data gathering is collecting data from various sources to be utilized for data analysis and reporting. Some things to consider are:

– Data quality
– Data joining
– The health of the ETL tools

Data analysis is creating and extracting relevant insights from a day’s worth of data. Most things to consider during this step are the same as data gathering. Others are:

– Sources from which the data is collected
– ETL tools and process
– Processes to analyze the data
– Testing

The greater the number of insights, the greater the efficiency of business decisions, and these judgments will determine the organization’s future.

– Efficiency of data gathering
– Efficiency of data analysis
– Metadata
– Factors affecting the decisions

Different Approaches for Implementing a Data Warehouse:

There are usually two types of approaches used to implement a DWH: the Kimball approach and the Inmon approach.

A DWH can be designed and developed using the Kimball approach, which consists of established procedures, processes, and techniques. The Kimball data model, created by Ralph Kimball, takes a bottom-up approach to DWH architecture design, with data marts generated first based on business requirements. The principal data sources are then examined.

An Extract, Transform, and Load (ETL) tool retrieves various data formats from multiple sources and loads them into the relational database server’s staging area. After data is uploaded to the DWH’s staging area, the following step is to load it into a de-normalized dimensional DWH model. This architecture divides data into fact tables (numerical transactional data) and dimension tables (reference information that supports facts).

The star schema refers to the combination of a fact table and multiple-dimensional tables. Kimball dimensional modeling enables users to build a variety of star schemas to meet various reporting requirements. The benefit of a star structure is that simple dimensional-table queries may be executed instantly.

Kimball’s DWH lifecycle approach recommends conformed data dimensions to combine data. Within a DWH, it can be found as a primary dimension table shared across different fact tables (such as customer and product) or as the same dimension tables in several Kimball data marts. This ensures that a single piece of data is used consistently across all facts.

ADVANTAGES:

– The abstraction is at a higher level, so the DWH architecture requires a shorter time to implement
– Low initial cost
– Can be set up immediately
– The demand for data integration is specific to the business area in question

DISADVANTAGES:

– The idea of a single source of truth is lost
– Irregularities can occur
– Performance issues
– Incorporating vast amounts of old data into a data warehouse is time-consuming.

Bill Inmon, known as the “Father of Data Warehousing,” devised the idea of creating a DWH that begins with the creation of a corporate DWH data model, which specifies the primary topic areas and entities that the company deals with, such as customers, products, and vendors. According to Bill Inmon, a DWH is a “subject-oriented, non-volatile, integrated, time-variant collection of data in support of management’s choices.”

The normalized form is used to develop entity structure in the Bill Inmon design approach, reducing data repetition to the greatest extent possible. As a result, business needs are clearly identified, and data update irregularities are avoided. Furthermore, the advantage of this top-down approach in database architecture is that it is resistant to business changes and provides a dimensional view of data across data marts.

The physical model is then built, which adheres to the normalized structure. Bill Inmon’s model establishes a single source of truth for the entire organization. The model’s normalized structure makes data loading easier. However, querying with this setup is difficult due to the various tables and linkages.

ADVANTAGES:

– Data is integrated
– Low data redundancy
– Simplifies business processes
– Easier to update
– Greater flexibility

DISADVANTAGES:

– More complex
– Initial setup and delivery are time-taking
– Additional ETL operations required
– Requires experts

Wrap Up

Data warehouses are very crucial when it comes to big data management and analytics. It acts as the backbone of the infrastructure that lets you store and retrieve all the data. Not only do they provide you solutions to important problems such as data integration and ETL, but they’re also quite good at storing huge chunks of structured data efficiently.

Today in this article, we have learned what a data warehouse is and explored it in detail. We have also seen the characteristics of a data warehouse and the different approaches it can be implemented in. The Kimball and Inmon approaches are among the most used approaches in the industry, but feel free to explore other approaches as well.

That’s it for today! Hope you enjoyed reading the article.