You are here: HomeArticlesCalculating checksum in datawarehousing

Time dimension for PosgreSQL based data warehouse

This sql script will create an populate Time dimension for PosgreSQL based data warehouse


CREATE TABLE time_dim
(
time_key integer NOT NULL,
time_value character(5) NOT NULL,
hours_24 character(2) NOT NULL,
hours_12 character(2) NOT NULL,
hour_minutes character (2)  NOT NULL,
day_minutes integer NOT NULL,
day_time_name character varying (20) NOT NULL,
day_night character varying (20) NOT NULL,
CONSTRAINT time_dim_pk PRIMARY KEY (time_key)
)
WITH (
OIDS=FALSE
);

COMMENT ON TABLE time_dim IS 'Time Dimension';
COMMENT ON COLUMN time_dim.time_key IS 'Time Dimension PK';

insert into  time_dim

SELECT  cast(to_char(minute, 'hh24mi') as numeric) time_key,
to_char(minute, 'hh24:mi') AS tume_value,
-- Hour of the day (0 - 23)
to_char(minute, 'hh24') AS hour_24,
-- Hour of the day (0 - 11)
to_char(minute, 'hh12') hour_12,
-- Hour minute (0 - 59)
to_char(minute, 'mi') hour_minutes,
-- Minute of the day (0 - 1439)
extract(hour FROM minute)*60 + extract(minute FROM minute) day_minutes,
-- Names of day periods
case when to_char(minute, 'hh24:mi') BETWEEN '06:00' AND '08:29'
then 'Morning'
when to_char(minute, 'hh24:mi') BETWEEN '08:30' AND '11:59'
then 'AM'
when to_char(minute, 'hh24:mi') BETWEEN '12:00' AND '17:59'
then 'PM'
when to_char(minute, 'hh24:mi') BETWEEN '18:00' AND '22:29'
then 'Evening'
else 'Night'
end AS day_time_name,
-- Indicator of day or night
case when to_char(minute, 'hh24:mi') BETWEEN '07:00' AND '19:59' then 'Day'
else 'Night'
end AS day_night
FROM (SELECT '0:00'::time + (sequence.minute || ' minutes')::interval AS minute
FROM generate_series(0,1439) AS sequence(minute)
GROUP BY sequence.minute
) DQ
ORDER BY 1

Based on on information provided here

http://wiki.postgresql.org/wiki/Date_and_Time_dimensions

Loading current date into database field

Typical task pupulating database field with current date can be be done in several ways

First of all we can use GetSystemVariable calculation function:

GetSystemVariable(VariableName)

GetSystemVariable returns value of 'VARIABLENAME'.

Possible values for 'VARIABLENAME' are:

COMPUTERNAME
OSUSERNAME
DBUSERNAME
BLOCKNUMBER
LINENUMBER
RECORDNUMBER
SYSTEM_DATE
SOURCE_FILE_NAME
SOURCE_TABLE_NAME

GetSystemVariable('SYSTEM_DATE') function teturns date according to the regional settings of computer.
It very important to use currect date format form the regional settings screen we can see that the format is dd/mm/yyyy

Windows Vista Regional SettingsAnother way to determine date format of GetSystemVariable('SYSTEM_DATE') is to execute it against non date field

Mapping using MS SQL Server connection with no date format

GetSystemVariable-No-FormatData preview screen
GetSystemVariable-Data-Preview

Mapping using MS SQL Server connection with currect date format

GetSystemVariable-CurrectFormat

If we specify wrong date format I won’t be able to load the data
GetSystemVariable-WrongFormat

As you can see order date field is missing.
DataPreview-Null

Another way of getting data is by using GetSystemDate function

GetSystemDate(Format):String, Returns Current system date/time in format specified

Expression-Editor-GetSystemDate
And last way is to use database functions for example getdate() for sql server

GetDateNote:

It does not work for MS SQL Server Connections use ODBC connection instead.
You may get error message saying that field must have date format. Just assign any date format.

Slow changing dimensions

There several types of dimensions which can be used in the datawarehouse

Type 0 changing dimension:

It is not really slowly changing dimension: We just load the data and keep it as it is. Typical example of it woul be list of postcodes

Type 1 slowly changing dimension:

Overwrite the old value with the new value. This is very useful when dealing with corrections. For example the dimensions holds list of countries some of them we misspelled so we want to current it
We don’t care about the history in this case because it was incorrect at the first place.

Type 2 slowly changing dimension:

Create a new record in the dimension with a new primary key. For example when customer changes the address we will create new record. Sometimes some additional fields are populated as well such current record flag, or effective date.

Type 3 slowly changing dimension:

Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change.
This type of dimensions can be used for loads when loan is fully paid.

Now lets have a look at practical ways of population dimension tables:

Dimension table Structure

Customer Dimension


Data is coming in as huge text file, which holds orders together with customer details.

Customer Dimension Data

 

Customer details are duplicated so we have to deduplicate it first

Customer Dimension Transformation

Type 0 slowly changing dimension

Check if record exist if not insert new record

Type 1 slowly changing dimension

For every incoming record
Count number of records matching the key
Than if Record Count <> 0
Update Customers table
or
Insert new record

Note:

We might be performing unnecessary updates plus for large and wide tables it could be very slow. One of the ways of avoiding doing it is using CRC(MD5) for checks. If CRC is the same no changes were mage to the record and therefore there is no need to update it

Type 2 slowly changing dimension

Deduplicate the data
Calculate Record CRC
If this CRC exist in the database discard the record
If not create new record

Type 3 slowly changing dimension:

Deduplicate the data
Calculate Record CRC
If this CRC exist in the database than do nothing
If not update record with new data

More Information about Advanced ETL Processor

Online tutorials

Data quality

What is data quality and why it is important.

it is a measure of how well business data practices satisfy standards. Good and reliable data can be used to increase efficiency, support decision making, and sky rocket profitability.

Poor data quality leads to wasting time, working with conflicting information resulting in bad decisions and massive decrease in efficiency.

Many organizations implement strict data controls at the point of entry. However quite often it is not enough. For example, when data is loaded into the data-warehouse or moved from one application to another additional validation and transformation rules must be applied.

Data Quality Strategy.

The primary objective of any data integration solution is to assemble data from one or more data sources validate and transform it into standard format.

There are three major steps in implementing data quality strategy:

  • Profiling - helps you to check if a source data file or table extract meets the format such as file encoding fields format and order
  • Cleansing - Once data successfully satisfies profiling standards, it still necessary to cleanse the data for example remove empty values or irrelevant information
  • Auditing - one ot the most important parts of data integration. It gives IT departments the history of data transformation, profiling and cleansing which can be used in importer failure investigations.

Advanced ETL Processor Data validator

Advanced ETL Processor is able to check any data including date formats, post codes, phone numbers, validating against list of values etc. It has more than 190 data validation functions, plus it can be extended by using regular expressions. It is an enterprise data integration solution that lets you quickly validate and process large volumes of data while preserving and enhancing data quality.

Validation Editor

Validation Editor

Simple data cleasing rule

Simple Data validation rule

More about data validator...

IMEX=1

Solving problems with loading data from Excel files into databases

Common problem: trying to load the data from Excel file half of the data is coming as nulls, or columns with more than 255 characters are truncated

The logic behind Excel mixed data types

As partially explained here

http://support.microsoft.com/kb/257819

ODBC/MS Jet scans first TypeGuessRows to determine field type

Here how Excel ODBC/MS Jet works

(TypeGuessRows=8 IMEX=1)

In your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the provider returns five (5) numbers and three (3) null values.
In your eight (8) scanned rows, if the column contains three (3) numeric values and five (5) text values, the provider returns three (3) null values and five (5) text values.
In your eight (8) scanned rows, if the column contains four (4) numeric values and four (4) text values, the provider returns four (4) numbers and four (4) null values.
In your eight (8) scanned rows all of them less than 255 characters the provider will truncate all data to 255 characters
In your eight (8) scanned rows, if the column contains five (5) values with more length than 255 the provider will returm more than 255 characters

NOTE:

Setting IMEX=1 tells the driver to use Import mode. In this state, the registry setting ImportMixedTypes=Text will be noticed. This forces mixed data to be converted to text. For this to work reliably, you may also have to modify the registry setting, TypeGuessRows=8. The ISAM driver by default looks at the first eight rows and from that sampling determines the datatype. If this eight row sampling is all numeric, then setting IMEX=1 will not convert the default datatype to Text; it will remain numeric.

     Nobody wants to load half of the data, everybody wants to load data as it is

The only way to make import from Excel work is

Set IMEX=1 in connection string

Close any programs that are running.
On the Start menu, click Run. Type regedit and click OK.
In the Registry Editor, expand the following key depending on the version of Excel that you are running:
Excel 97
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\3.5\Engines\Excel
Excel 2000 and later versions
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel

Select TypeGuessRows and on the Edit menu click Modify.
In the Edit DWORD Value dialog box, click Decimal under Base.
Set the value to 1

Open Excel file

Make sure that the cells in the first line of the table have relevant data for example

  • mixed numbers and text characters for text fields
  • only numbers for numeric fields
  • If some of the data will be longer than 255 characters make sure that first line cell has more 255 characters otherwise it will be truncated

This solution apply to all versions of MS Excel ODBC driver, Ole DB, MS Jet, .NET, DTS and SSIS

We have spent enormous amount of time trying to get it fixed. So far we were not able to find a better solution.

The way Excel import works makes it not possible to automate it. You have to modify most of excel files manually in order to load them.

This why we are no longer using ODBC/OleDB/Ms Jet for Excel connections. Our ETL solutions work currecly with Excel all the time

  • Works directly with Excel 3.0-2007
  • No ODBC, OleDB or MS Jet Required
  • Works correctly with mixed data types
  • Works correctly with cells with more 255 characters
  • No need for IMEX=1, HDR=Yes or Registry hacks (TypeGuessRows)

  • Loads data correctly all the time + no need to edit Excel file
  • Can create Excel files in Excel 3.0-2007 format
  • Can insert data starting from specific cell
  • Can clear area before adding data into Excel
  • Can add headers

Related Microsoft KB's

  1. PRB: Excel Values Returned as NULL Using DAO OpenRecordset
  2. Data truncated to 255 characters with Excel ODBC driver
  3. Excel ODBC Driver May Determine Wrong Data Type
  4. BUG: Excel ODBC Driver Disregards the FirstRowHasNames or Header Setting
  5. How To Use ADO with Excel Data from Visual Basic or VBA

Creating Reports

Once data is went trough the Staging Area, cleansed, transformed and loaded into the Datawarehouse it is presented to the End Users.

There several different types of datawarehouse users.

  • Advanced Users
  • Regular Users
  • Casual Users

Advanced users design reports themselves and perform complex analytical tasks. They use very expensive reporting tools such as Business Objects, Crystal Reports or QlickView.
Regular users use reports designed by somebody else or they can use dashboards to monitor key parameters.
Casual users use reports from time to time or they may receive reports by email.

Quite often people just want to run one report per day for example print current warehouse stock level.

Cost of ownership

Divide licence cost of reporting software by number of reports use and take into account cost of hardware=Cost of ownership

Cost of ownership is very high for casual and some of regular users

One of the solutions is to use Active Table Editor

The administrator designs for Reports and Data Entry Forms. All this complexity is hidden from the end user. End users can only Print reports allowed by administrator

Active Table Editor allows the administrator to log in and design the look of the application for the end users. You can edit user menus, security settings, menu items, input forms and reports. All this complexity is left behind the scene for the end users. Once logged in, the end users see and edit the data, which was defined by the administrator.

Report Designer

Active Table Editor - Report Designer

Print Preview

Active Table Editor - Printing Report

Report creation wizard

Report Wizard

Loading data files from different operating systems

This article describes common problems of loading data into the database from different operating systems.

Typical situation:

You have a sales datawarehouse. Every day there are hundreds of small files to process. They come from various locations.
ETL tool downloads the files into one folder than loads them into the Datawarehouse using mask.

One morning you get a call from the Manager saying that the load failed and there several thousands of files to process...

Loading data files from different operating systems

So what could be the problem?

May be the file format has changed.

You open the file in the notepad and looks the same...

Most likely the format is different but you just cant see the difference


There are only two possible causes.

  • Different encoding  for example UTF8 instead of ASCI
  • Different end of line character for example LF instead of CR+LF

Data is coming from different data sources some users use Windows, some us MAC, some use UNIX/Lynux

Systems based on ASCII or a compatible character set use either LF  (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, 0x0D 0x0A). Some rare systems, such as QNX  before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.

LF:    Multics, Unix and Unix-like systems (GNU/Linux, AIX, Xenix, Mac OS X, FreeBSD, etc.), BeOS, Amiga, RISC OS, and others
CR+LF: DEC RT-11 and most other early non-Unix, non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC-DOS, etc.), OS/2, Microsoft Windows, Symbian OS, Palm OS
CR:    Commodore 8-bit machines, TRS-80, Apple II family, Mac OS up to version 9 and OS-9
RS:    QNX pre-POSIX implementation.

The Unicode  standard defines a large number of characters that conforming applications should recognize as line terminators:

LF:    Line Feed, U+000A
FF:    Form Feed, U+000C
CR:    Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL:   Next Line, U+0085
LS:    Line Separator, U+2028
PS:    Paragraph Separator, U+2029

The Solution

Use Advanced ETL Processor

Latest Version of Advanced ETL Processor and Visual Importer ETL can automatically determine source file encoding and works with multiple Line Ternimators

 

Data Reader

Page 5 of 7

Go to top