Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if connection is working.
This usually done automatically by ETL automation tool. In case of files, we need to obtain the files first. They might be stored on the remote ftp server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.
Some ETL Tools offer complete automation of business processes including full support for file operations
Data cleansing is a process of checking data against predefined set of rules
For example:
Most queries that retrieve data from the data warehouse use inner joins between the fact and dimension tables. If you use a dimension table containing data that does not apply to all facts, you must include a record in the dimension table that can be used to relate to the remaining fact table values. For example, for null value 0 can be used as surrogate key of dimension table and for empty string 1.
To verify referential integrity in a star schema simple SQL query can be used to count the rows returned when all appropriate dimension tables are joined to the fact table using inner joins. The number of rows returned by this query should match the number of rows in the fact table. With large data warehouses it might have some performance implications and should be executed outside of normal working hours
More Information:
This article is not database specific
Basic Star Schema of a retail shop

In the previous article we have described the ways of calculating SQL query execution time.
Let's think about what can be done about Query performance.
We have only two options
There are a lot of ways of improving hardware. You can add memory processors increase your hardrive size etc. But quite often it does not help
We have seen some configurations with 64 gigs of memory and 32 processors
With all 32 processors doing nothing but the database is running very slow.
All this because the DBA decided to use raid 5 for better protection.
The performance of you disks this the most important part of the data warehouse.
It defines it. It tells how long do you have before the system becomes unusable due to pour performance
HDtach is very useful utility for checking disk performance, you would be very surprised when compare the laptop with the servers. Modern laptops quite often much faster.
The bottom line is:
Let's go back to the Star Schema Diagram
STATE CODE field is 5 bytes long
Everybody knows that the US state abbreviation is always two characters long
if we change our definition from nvarchar(2) to nchar(2) field size changes from 5 bytes to 4
Next well know thing, that there are no international characters in US state abbreviation. There is no need to use Unicode and we can change field type from nchar(2) to char(2)
The field length is 2 bytes now.
Everybody knows that there are only 52 states in US
What we can do next, is to move STATE CODE out of the fact table in to the dimension and use STATE_ID BYTE in fact table as state identifier.
By using common sense we were able to reduce the field size five times. The same technic can be applied to the rest of fields.
Make sure your disk fast and fact table record size as small as possible
Ever wonder why it takes so long to execute the Query? It works fine now but how long will it take to run in 3 years?
This article is not database specific
Below is the basic Star schema of a retail shop
It has one fact table and 3 dimensions

Let's have a close look at the Fact table structure and particularly fields types:
The STATE_CODE field: The type is nvarchar most likely it it uses 2 bytes to store one character, plus it has nil as terminator so in total it is 5 bytes
Some of the databases store the length of string, which could take up to 2-4 bytes
Lets do the same for the rest of fields| Field Name | Field Length |
| STATE_CODE | 5 |
| PRODUCT_CODE | 21 |
| PERIOD_CODE | 13 |
| AMOUNT | 6 |
| Total row size | 45 |
Select STATE_CODE, PRODUCT_CODE, PERIOD_CODE, Sum(AMOUNT) as AMOUNT
From FACT_TABLE
GROUP BY STATE_CODE, PRODUCT_CODE, PERIOD_CODE
To answer this question we need to know our HDD performance and number of rows in the table
HDtach can easily measure the hard drive performance
Let assume that we have 10 million rows and our RAID can do 50mb per sec sequential read
450 000 000 / (50 *1048576) = 8.58 sec
Those people who have actually run the queries against the table with 10 million records would say that this number is far too optimistic.
We believe that practical aproach gives the best results:
sp_spaceused ‘Tablename’
Than use the formula table size/hdd performance to calculate the time
and the last step ran the sql during peak times and low times
and compare the results.
If you have historical data this approach would also help you estimate how long will it take to run the SQL in 3 years.
This solution is not database specific
Most data bases can generate DDL for any object but not a lot of them allow generation of INSERT statements for the table data.
The workaround is to use of ETL Tools for transferring data across servers. However, there exists a need to generate INSERT statements from the tables for porting data.
Simplest example is when small or large amount of data needs to be taken out on a removable storage media and copied to a remote location, INSERT..VALUES statements come handy.
There is a number of scripts available to perform this data transformation task. The problem with those scripts that all of them database specific and they do not work with textiles
Advanced ETL Processor gives you an universal solution for creating Insert statements. It works with any database or file.
Advanced ETL Processor generating insert statements
ODBC stands for Open Data Base Connectivity, a standard database access method developed by Microsoft Corporation.
It makes it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a database driver, between an application and the DBMS.
The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them.
A DSN (Data Source Name) is an identifier which defines a data source for an ODBC driver.
For example DSN for MS SQL Server would consists of:
This information is stored in windows registry and this fact introduces additional work for system administrators. Imagine following situation you have CRM application and you use ODBC DSN to connect to it. It means that on every PC the Admistrator must create ODBC DSN manually. The problem becomes worse if you have a lot of different databases and large amount of users.
One of the benefits of using DB Software Laboratory ETL tools is the support for ODBC connection strings.
Using ODBC connection string gives you the ability to connect to the database without creating an ODBC DSN.
For example to connect to MS SQL Server 2008 you may use the following connection string.
Driver={SQL Server Native Client 10.0};Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;
You may find more information about connection strings at http://www.connectionstrings.com
What is Dirty Data anyway?
Other common causes of dirty data are:
Most of the problems comes when working with Text or Excel Files
![]()
In order to load data we need to make sure that format of Amount and Order_Date fields is consistent. For amount field we need to get rid of dollars, pounds and commas. It could easily done by using replace function of Advanced ETL Processor. ![]()
For ORDER_DATE field we will apply multiple date formats. Result of Date Format function is a string in 'YYYY-MM-DD HH:NN:SS.ZZZ' format ![]() Full Data Transformation:![]() Result of Data Transformation:![]()
This is just a small example how Advanced ETL Processor can help you to validate and transform data. |
||
About Advanced ETL Processor |
||
|
|
Look at the MS SQL Server code below and think what it will return
Hint:
There result is last two characters of a middle part -‘CD’
![]() |
|
|
|
|
|
|
|
|
![]() |
![]() |
You guys have got imagination and vision of how things work in real-life.
Tim Jonson,
Project Manager