This article describes the various stages of loading data into the data warehouse
Load data into data warehouse today!
There are several stages in the process:
Preparation for Extraction
Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if the connection is working.
This usually done automatically by ETL automation tool. In the case of files, we need to obtain the files first. They might be stored on the remote FTP server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.
Some ETL Tools offer complete automation of business processes including full support for file operations
Extraction.
SQL can be used to extract the data from the database or text parser to extract the data from fixed-width or delimited files.
Extracting data often involves the transfer of large amounts of data from source operational systems. Such operations can impose significant processing loads on the databases involved and should be performed during a period of relatively low system load or overnight.
Cleansing and Transforming data.
Data cleansing is a process of checking data against a predefined set of rules
For example:
- Checking date formats
- Checking field length
- Pattern validation
- Data type checks
A lot of data transformations can be performed during the process of extracting data from the source systems. However, there are often some additional tasks to execute before loading the data into the data warehouse. For example, reconciling inconsistent data from heterogeneous data sources after extraction and completing other formatting and cleansing tasks and generating surrogate keys.
Some typical data transformations include:
- Joining multiple fields into one field (Address 1 + Address 2 + Address 3).
- Generating calendar tables.
- Mapping data from one representation to another, such as Female to 1 and Male to 0
- Transforming data from multiple representations to a single representation, such as a common format for telephone numbers
- Creating surrogate keys
Loading the Data
Once all the data has been cleansed and transformed into a structure consistent with the data warehouse requirements, data is ready for loading into the data warehouse.
The initial load of the data warehouse consists of populating the tables in the data warehouse schema and then checking that the data is ready for use.
After the data has been loaded into the data warehouse database, verify the referential integrity between dimension and fact tables to ensure that all records relating to appropriate records in other tables. DBA should verify that every record in a fact table relates to a record in each dimension table that will be used with that fact table.
Data integrity in the reverse order is not necessary, however, in some cases, it might be necessary to remove unrelated data. For example when a dimension table has several times more records than the fact table
Dealing with null and empty strings.
Most queries that retrieve data from the data warehouse use inner joins between the fact and dimension tables. If you use a dimension table containing data that does not apply to all facts, you must include a record in the dimension table that can be used to relate to the remaining fact table values. For example, for null value 0 can be used as a surrogate key of the dimension table and for an empty string.
To verify referential integrity in a star schema simple SQL query can be used to count the rows returned when all appropriate dimension tables are joined to the fact table using inner joins. The number of rows returned by this query should match the number of rows in the fact table. With large data warehouses, it might have some performance implications and should be executed outside of normal working hours