Data Warehouse

  • Data quality

    What is data quality and why it is important?

    it is a measure of how well business data practices satisfy standards. Good and reliable data can be used to increase efficiency, support decision making, and skyrocket profitability.

    Poor data quality leads to wasting time, working with conflicting information resulting in bad decisions and massive decrease in efficiency.

    Many organizations implement strict data controls at the point of entry. However quite often it is not enough. For example, when data is loaded into the data-warehouse or moved from one application to another additional validation and transformation rules must be applied.

    Data Quality Strategy.

    The primary objective of any data integration solution is to assemble data from one or more data sources validate and transform it into a standard format.

    There are three major steps in implementing data quality strategy:

    • Profiling - helps you to check if a source data file or table extract meets the format such as file encoding fields format and order
    • Cleansing - Once data successfully satisfies profiling standards, it still necessary to cleanse the data, for example, remove empty values or irrelevant information
    • Auditing - one of the most important parts of data integration. It gives IT departments the history of data transformation, profiling, and cleansing which can be used in importer failure investigations.

    Advanced ETL Processor Data validator

    Advanced ETL Processor is able to check any data including date formats, postcodes, phone numbers, validating against a list of values, etc. It has more than 190 data validation functions, plus it can be extended by using regular expressions. It is an enterprise data integration solution that lets you quickly validate and process large volumes of data while preserving and enhancing data quality.

    Validation Editor

    Validation Editor

    Simple data cleansing rule

    Simple Data validation rule

    More about data validator...

    Download Free Trial Learn More About Our ETL Software
  • Example of data warehouse workflow

    Background:

    This article was created as an answer to one of our ETL forum posts, we hope it will help a lot of users in a similar situation.

    "The scenario is based on files landing in a monitored directory. At this stage, I am not worried about volume or frequency. But will be handling each file individually for internal reasons. Each file has a unique name but a common format of xxx-yyyymmddhhmmssnnnn.ext

    "xxx" represents a client prefix (eg. abc, bcd, cde). The format of each file for the same client will be the same. But the format is different for each client. The file extension '.ext' will generally be .csv, .xml, or even .xls.

    The file transformations between clients will be different. However, the output is always the same format and the target is a SQL Server database table.

    I am developing each of the transformations in isolation, but with the same basic structure.

    My idea is to have a single monitor looking at a single folder for *.* That then executes a single Package based on the creation of a file. The Monitor passes the file to the Package that then does some environment management, put the source file in a staging location, then based on the file prefix conditionally (Probably through a CASE action) call the associated transformation and load to the SQL server and then clean up."

    NOTES:

    The monitor (Directory monitor) is a windows service which is constantly monitoring a directory for changes. When it happens it adds predefined action to the execution queue. Monitor passes the file name as a variable into the action, this variable name is <Directory Monitor File Name>.

    Solution:

    Example of data warehouse workflow

    "Get ClientID" script: extracts first three characters of the file name. The Result is stored in '<ClientID>' variable.

    var
    FileName: String;
    ClientID: String;
    begin
    FileName := GetVariable('<Directory Monitor File Name>');
    ClientID:=UpperCase(SubString(FileName,1,3));
    SetVariable('<ClientID>',ClientID);

    Result:=true;
    end;

    Check Client directory: checks if client directory exist and if it does not it creates it together with subdirectories.

     Every transformation has an ID. We use this ID to execute specific transformation for a specific customer.

    Table Transformations holds mapping between ClientID and TransformationID:

    CREATE TABLE [dbo].[Transformations]
    ([ClientID] [char](3) NOT NULL,[TransformationID] [int] NULL, 
    CONSTRAINT [PK_Transformations]
    PRIMARY KEY CLUSTERED ([ClientID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
    ON [PRIMARY]) ON [PRIMARY]
    GO 

    Loookup

    "Insert Into Transformations" sql script inserts a record for missing clients:

    insert into Transformations(
    [ClientID],
    [TransformationID])
    values ('<ClientID>',0)

    GO

    Once file is ready for ready processing it is moved into Staging Area.

    Get transformation ID: executes the following SQL:

     Select TransformationID
    from Transformations
    where ClientID='<ClientID>'

    The result is saved into <TransformationID> variable.

    Execute Transformation script:

    begin
    ExecuteObject(GetVariable('<TransformationID>'));
    Result:=true;
    end;

    Upon completion the file is moved into Processed Area.

    Directory Structure:

     DirectoryStructure

    Variable as Mask:

    Mask

    Download Free Trial Learn More About Our ETL Software
  • Generating Insert Statements

    This article explains how to generate INSERT statements from any data source

    This solution is not database specific

    Most databases can generate DDL for any object but not a lot of them allow generation of INSERT statements for the table data.
    The workaround is to use of ETL Tools for transferring data across servers. However, there exists a need to generate INSERT statements from the tables for porting data.
    The simplest example is when a small or large amount of data needs to be taken out on a removable storage media and copied to a remote location, INSERT VALUES statements come handy.

    There is a number of scripts available to perform this data transformation task. The problem with those scripts that all of them are database specific and they do not work with textiles

    Advanced ETL Processor gives you a universal solution for creating Insert statements. It works with any database or file.

    Benefits and advantages

    • Data from both tables and views can be scripted
    • Support for text files
    • Table names and column names with spaces are handled
    • All datatypes are handled
    • NULLs are gracefully handled
    • Timestamp columns are handled
    • Identity columns are handled
    • Very flexible and configurable
    • Computed columns are handled
    • You can filter the rows/lines for which you want to generate INSERT's

    Advanced ETL Processor generating insert statements

    Generating Insert Statements

     

    We were very impressed when we bought this ETL tool. But now, we are even more impressed, because the company constantly adding new useful ETL features.
    You guys have got imagination and vision of how things work in real-life.


    Tim Jonson,
    Project Manager 
    Download Free Trial Learn More About Our ETL Software
  • Loading data into the Datawarehouse

    This article describes the various stages of loading data into the data warehouse

    Load data into data warehouse today!

    Download Advanced ETL Processor Enterprise now!

     

    There are several stages in the process:

    Preparation for Extraction

    Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if the connection is working.
    This usually done automatically by ETL automation tool. In the case of files, we need to obtain the files first. They might be stored on the remote FTP server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.

    Loading Data into the Datawarehouse

    Some ETL Tools offer complete automation of business processes including full support for file operations

    Extraction.

    SQL can be used to extract the data from the database or text parser to extract the data from fixed width or delimited files.
    Extracting data often involves the transfer of large amounts of data from source operational systems. Such operations can impose significant processing loads on the databases involved and should be performed during a period of relatively low system load or overnight.

    Cleansing and Transforming data.

    Data cleansing is a process of checking data against a predefined set of rules

    For example:

    • Checking date formats
    • Checking field length
    • Pattern validation
    • Data type checks

    A lot of data transformations can be performed during the process of extracting data from the source systems. However, there are often some additional tasks to execute before loading the data into the data warehouse. For example, reconciling inconsistent data from heterogeneous data sources after extraction and completing other formatting and cleansing tasks and generating surrogate keys.

  • Loading Images into Database

    We are constantly working on improving our ETL Software

    A file system as a data source

    In the latest version we've introduced file system as data-source so it is possible now:

    • Load images into a database
    • Process Separate XML files
    • Process HL7 messages from the folder
    • Process EDI Messages
    • Load context of a directory into the database
    • Load users' photos

    A file system as a data-target

    Perform following ETL Transformations

    • Export data from blob fields
    • Save every record as a separate XML file
    • Export email messages
    • Generate EDI messages

    Advanced ETL Processor Screenshot

    Loading Files into the Database
    Download Free Trial Learn More About Our ETL Software
  • Slow changing dimensions

    There several types of dimensions which can be used in the data warehouse

    Type 0 changing dimension:

    It is not really slowly changing dimension: We just load the data and keep it as it is. A typical example of it would be a list of postcodes

    Type 1 slowly changing dimension:

    Overwrite the old value with the new value. This is very useful when dealing with corrections. For example, the dimensions hold a list of countries some of them we misspelt so we want to current it
    We don’t care about history in this case because it was incorrect in the first place.

    Type 2 slowly changing dimension:

    Create a new record in the dimension with a new primary key. For example, when a customer changes the address we will create a new record. Sometimes some additional fields are populated as well such current record flag or effective date.

    Type 3 slowly changing dimension:

    Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change.
    This type of dimensions can be used for loads when the loan is fully paid.

    Now let's have a look at practical ways of population dimension tables:

    Dimension table Structure

    Customer Dimension

     

    Data is coming in as a huge text file, which holds orders together with customer details.

    Customer Dimension Data

     

    Customer details are duplicated so we have to deduplicate it first

    Customer Dimension Transformation

    Type 0 slowly changing dimension

    Check if the record exist if not insert a new record

    Type 1 slowly changing dimension

    For every incoming record
    Count number of records matching the key
    Than if Record Count <> 0
    Update Customers table
    or
    Insert new record

    Note:

    We might be performing unnecessary updates plus for large and wide tables it could be very slow. One of the ways of avoiding doing it is using CRC(MD5) for checks. If CRC is the same no changes were made to the record and therefore there is no need to update it

    Type 2 slowly changing dimension

    Deduplicate the data
    Calculate Record CRC
    If this CRC exist in the database discard the record
    If not create a new record

    Type 3 slowly changing dimension:

    Deduplicate the data
    Calculate Record CRC
    If this CRC exist in the database then do nothing
    If not update the record with new data

    More Information about Advanced ETL Processor

    Online tutorials

    Download Free Trial Learn More About Our ETL Software

We have 149 guests and no members online