it is a measure of how well business data practices satisfy standards. Good and reliable data can be used to increase efficiency, support decision making, and skyrocket profitability.
Poor data quality leads to wasting time, working with conflicting information resulting in bad decisions and a massive decrease in efficiency.Many organizations implement strict data controls at the point of entry. However quite often it is not enough. For example, when data is loaded into the data warehouse or moved from one application to another additional validation and transformation rules must be applied.
The primary objective of any data integration solution is to assemble data from one or more data sources validate and transform it into a standard format.
There are three major steps in implementing data quality strategy:
Advanced ETL Processor is able to check any data including date formats, postcodes, phone numbers, validating against a list of values, etc. It has more than 190 data validation functions, plus it can be extended by using regular expressions. It is an enterprise data integration solution that lets you quickly validate and process large volumes of data while preserving and enhancing data quality.
More about data validator...
This article was created as an answer to one of our ETL forum posts, we hope it will help a lot of users in a similar situation.
"The scenario is based on files landing in a monitored directory. At this stage, I am not worried about volume or frequency. But will be handling each file individually for internal reasons. Each file has a unique name but a common format of xxx-yyyymmddhhmmssnnnn.ext
"xxx" represents a client prefix (eg. abc, bcd, cde). The format of each file for the same client will be the same. But the format is different for each client. The file extension '.ext' will generally be .csv, .xml, or even .xls.
The file transformations between clients will be different. However, the output is always the same format and the target is a SQL Server database table.
I am developing each of the transformations in isolation, but with the same basic structure.
My idea is to have a single monitor looking at a single folder for *.* That then executes a single Package based on the creation of a file. The Monitor passes the file to the Package that then does some environment management, put the source file in a staging location, then based on the file prefix conditionally (Probably through a CASE action) call the associated transformation and load to the SQL server and then clean up."
The monitor (Directory monitor) is a windows service that is constantly monitoring a directory for changes. When it happens it adds predefined action to the execution queue. Monitor passes the file name as a variable into the action, this variable name is <Directory Monitor File Name>.
is a starting point, it extracts the first three characters of the file name. The Result is stored in '<ClientID>' variable.
varFileName: String;ClientID: String;beginFileName := GetVariable('<Directory Monitor File Name>');ClientID:=UpperCase(SubString(FileName,1,3));SetVariable('<ClientID>',ClientID);
Checks if the client directory exists and if it does not it creates it together with subdirectories.
Every transformation has an ID. We use this ID to execute a specific transformation for a specific customer.
Holds mapping between ClientID and TransformationID:
CREATE TABLE [dbo].[Transformations]([ClientID] [char](3) NOT NULL,[TransformationID] [int] NULL, CONSTRAINT [PK_Transformations] PRIMARY KEY CLUSTERED ([ClientID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]GO
Sql script inserts a record for missing clients:
insert into Transformations([ClientID],[TransformationID])values ('<ClientID>',0)
Once the file is ready for ready processing it is moved into Staging Area.
executes the following SQL:
Select TransformationIDfrom Transformations where ClientID='<ClientID>'
The result is saved into <TransformationID> variable.
Upon completion, the file is moved into Processed Area.
Acts as a default error handler
This solution is not database specific
Most databases can generate DDL for any object but not a lot of them allow generation of INSERT statements for the table data. The workaround is to use ETL Tools for transferring data across servers. However, there exists a need to generate INSERT statements from the tables for porting data. The simplest example is when a small or large amount of data needs to be taken out on a removable storage media and copied to a remote location, INSERT VALUES statements come in handy.
There is a number of scripts available to perform this data transformation task. The problem with those scripts that all of them are database specific and they do not work with textiles
Advanced ETL Processor gives you a universal solution for creating Insert statements. It works with any database or file.
Advanced ETL Processor generating insert statements
Load data into data warehouse today!
There are several stages in the process:
Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if the connection is working.This usually done automatically by ETL automation tool. In the case of files, we need to obtain the files first. They might be stored on the remote FTP server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.
Some ETL Tools offer complete automation of business processes including full support for file operations
SQL can be used to extract the data from the database or text parser to extract the data from fixed-width or delimited files.Extracting data often involves the transfer of large amounts of data from source operational systems. Such operations can impose significant processing loads on the databases involved and should be performed during a period of relatively low system load or overnight.
Data cleansing is a process of checking data against a predefined set of rules
A lot of data transformations can be performed during the process of extracting data from the source systems. However, there are often some additional tasks to execute before loading the data into the data warehouse. For example, reconciling inconsistent data from heterogeneous data sources after extraction and completing other formatting and cleansing tasks and generating surrogate keys.
We are constantly working on improving our ETL Software
In the latest version we've introduced file system as data-source so it is possible now:
Perform following ETL Transformations
Advanced ETL Processor Screenshot
There several types of dimensions which can be used in the data warehouse
It is not really slowly changing dimension: We just load the data and keep it as it is. A typical example of it would be a list of postcodes
Overwrite the old value with the new value. This is very useful when dealing with corrections. For example, the dimensions hold a list of countries some of them we misspelt so we want to current itWe don’t care about history in this case because it was incorrect in the first place.
Create a new record in the dimension with a new primary key. For example, when a customer changes the address we will create a new record. Sometimes some additional fields are populated as well such current record flag or effective date.
Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change.This type of dimensions can be used for loads when the loan is fully paid.Now let's have a look at practical ways of population dimension tables:
Check if the record exist if not insert a new record
For every incoming recordCount number of records matching the keyThan if Record Count <> 0Update Customers tableorInsert new record
We might be performing unnecessary updates plus for large and wide tables it could be very slow. One of the ways of avoiding doing it is using CRC(MD5) for checks. If CRC is the same no changes were made to the record and therefore there is no need to update it
Deduplicate the dataCalculate Record CRCIf this CRC exist in the database discard the recordIf not create a new record
Deduplicate the dataCalculate Record CRCIf this CRC exist in the database then do nothingIf not update the record with new data