This sql script will create an populate Time dimension for PosgreSQL based data warehouse
CREATE TABLE time_dim
time_key integer NOT NULL,
time_value character(5) NOT NULL,
hours_24 character(2) NOT NULL,
hours_12 character(2) NOT NULL,
hour_minutes character (2) NOT NULL,
day_minutes integer NOT NULL,
day_time_name character varying (20) NOT NULL,
day_night character varying (20) NOT NULL,
CONSTRAINT time_dim_pk PRIMARY KEY (time_key)
COMMENT ON TABLE time_dim IS 'Time Dimension';
COMMENT ON COLUMN time_dim.time_key IS 'Time Dimension PK';
insert into time_dim
SELECT cast(to_char(minute, 'hh24mi') as numeric) time_key,
to_char(minute, 'hh24:mi') AS tume_value,
-- Hour of the day (0 - 23)
to_char(minute, 'hh24') AS hour_24,
-- Hour of the day (0 - 11)
to_char(minute, 'hh12') hour_12,
-- Hour minute (0 - 59)
to_char(minute, 'mi') hour_minutes,
-- Minute of the day (0 - 1439)
extract(hour FROM minute)*60 + extract(minute FROM minute) day_minutes,
-- Names of day periods
case when to_char(minute, 'hh24:mi') BETWEEN '06:00' AND '08:29'
when to_char(minute, 'hh24:mi') BETWEEN '08:30' AND '11:59'
when to_char(minute, 'hh24:mi') BETWEEN '12:00' AND '17:59'
when to_char(minute, 'hh24:mi') BETWEEN '18:00' AND '22:29'
end AS day_time_name,
-- Indicator of day or night
case when to_char(minute, 'hh24:mi') BETWEEN '07:00' AND '19:59' then 'Day'
end AS day_night
FROM (SELECT '0:00'::time + (sequence.minute || ' minutes')::interval AS minute
FROM generate_series(0,1439) AS sequence(minute)
GROUP BY sequence.minute
ORDER BY 1
Based on on information provided here
Typical task pupulating database field with current date can be be done in several ways
First of all we can use GetSystemVariable calculation function:
GetSystemVariable returns value of 'VARIABLENAME'.
Possible values for 'VARIABLENAME' are:
GetSystemVariable('SYSTEM_DATE') function teturns date according to the regional settings of computer.
It very important to use currect date format form the regional settings screen we can see that the format is dd/mm/yyyy
Another way to determine date format of GetSystemVariable('SYSTEM_DATE') is to execute it against non date field
Data preview screen
Mapping using MS SQL Server connection with currect date format
If we specify wrong date format I won’t be able to load the data
As you can see order date field is missing.
Another way of getting data is by using GetSystemDate function
GetSystemDate(Format):String, Returns Current system date/time in format specified
And last way is to use database functions for example getdate() for sql server
It does not work for MS SQL Server Connections use ODBC connection instead.
You may get error message saying that field must have date format. Just assign any date format.
There several types of dimensions which can be used in the datawarehouse
It is not really slowly changing dimension: We just load the data and keep it as it is. Typical example of it woul be list of postcodes
Overwrite the old value with the new value. This is very useful when dealing with corrections. For example the dimensions holds list of countries some of them we misspelled so we want to current it
We don’t care about the history in this case because it was incorrect at the first place.
Create a new record in the dimension with a new primary key. For example when customer changes the address we will create new record. Sometimes some additional fields are populated as well such current record flag, or effective date.
Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change.
This type of dimensions can be used for loads when loan is fully paid.
Now lets have a look at practical ways of population dimension tables:
Check if record exist if not insert new record
For every incoming record
Count number of records matching the key
Than if Record Count <> 0
Update Customers table
Insert new record
We might be performing unnecessary updates plus for large and wide tables it could be very slow. One of the ways of avoiding doing it is using CRC(MD5) for checks. If CRC is the same no changes were mage to the record and therefore there is no need to update it
Deduplicate the data
Calculate Record CRC
If this CRC exist in the database discard the record
If not create new record
Deduplicate the data
Calculate Record CRC
If this CRC exist in the database than do nothing
If not update record with new data
it is a measure of how well business data practices satisfy standards. Good and reliable data can be used to increase efficiency, support decision making, and sky rocket profitability.
Poor data quality leads to wasting time, working with conflicting information resulting in bad decisions and massive decrease in efficiency.
Many organizations implement strict data controls at the point of entry. However quite often it is not enough. For example, when data is loaded into the data-warehouse or moved from one application to another additional validation and transformation rules must be applied.
The primary objective of any data integration solution is to assemble data from one or more data sources validate and transform it into standard format.
There are three major steps in implementing data quality strategy:
Advanced ETL Processor is able to check any data including date formats, post codes, phone numbers, validating against list of values etc. It has more than 190 data validation functions, plus it can be extended by using regular expressions. It is an enterprise data integration solution that lets you quickly validate and process large volumes of data while preserving and enhancing data quality.
Common problem: trying to load the data from Excel file half of the data is coming as nulls, or columns with more than 255 characters are truncated
As partially explained here
ODBC/MS Jet scans first TypeGuessRows to determine field type
In your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the provider returns five (5) numbers and three (3) null values.
In your eight (8) scanned rows, if the column contains three (3) numeric values and five (5) text values, the provider returns three (3) null values and five (5) text values.
In your eight (8) scanned rows, if the column contains four (4) numeric values and four (4) text values, the provider returns four (4) numbers and four (4) null values.
In your eight (8) scanned rows all of them less than 255 characters the provider will truncate all data to 255 characters
In your eight (8) scanned rows, if the column contains five (5) values with more length than 255 the provider will returm more than 255 characters
Setting IMEX=1 tells the driver to use Import mode. In this state, the registry setting ImportMixedTypes=Text will be noticed. This forces mixed data to be converted to text. For this to work reliably, you may also have to modify the registry setting, TypeGuessRows=8. The ISAM driver by default looks at the first eight rows and from that sampling determines the datatype. If this eight row sampling is all numeric, then setting IMEX=1 will not convert the default datatype to Text; it will remain numeric.
Nobody wants to load half of the data, everybody wants to load data as it is
Set IMEX=1 in connection string
Close any programs that are running.
On the Start menu, click Run. Type regedit and click OK.
In the Registry Editor, expand the following key depending on the version of Excel that you are running:
Excel 2000 and later versions
Select TypeGuessRows and on the Edit menu click Modify.
In the Edit DWORD Value dialog box, click Decimal under Base.
Set the value to 1
Open Excel file
Make sure that the cells in the first line of the table have relevant data for example
This solution apply to all versions of MS Excel ODBC driver, Ole DB, MS Jet, .NET, DTS and SSIS
We have spent enormous amount of time trying to get it fixed. So far we were not able to find a better solution.
The way Excel import works makes it not possible to automate it. You have to modify most of excel files manually in order to load them.
This why we are no longer using ODBC/OleDB/Ms Jet for Excel connections. Our ETL solutions work currecly with Excel all the time
Once data is went trough the Staging Area, cleansed, transformed and loaded into the Datawarehouse it is presented to the End Users.
There several different types of datawarehouse users.
Advanced users design reports themselves and perform complex analytical tasks. They use very expensive reporting tools such as Business Objects, Crystal Reports or QlickView.
Regular users use reports designed by somebody else or they can use dashboards to monitor key parameters.
Casual users use reports from time to time or they may receive reports by email.
Quite often people just want to run one report per day for example print current warehouse stock level.
Divide licence cost of reporting software by number of reports use and take into account cost of hardware=Cost of ownership
Cost of ownership is very high for casual and some of regular users
One of the solutions is to use Active Table Editor
The administrator designs for Reports and Data Entry Forms. All this complexity is hidden from the end user. End users can only Print reports allowed by administrator
Active Table Editor allows the administrator to log in and design the look of the application for the end users. You can edit user menus, security settings, menu items, input forms and reports. All this complexity is left behind the scene for the end users. Once logged in, the end users see and edit the data, which was defined by the administrator.
You have a sales datawarehouse. Every day there are hundreds of small files to process. They come from various locations.
ETL tool downloads the files into one folder than loads them into the Datawarehouse using mask.
One morning you get a call from the Manager saying that the load failed and there several thousands of files to process...
May be the file format has changed.
You open the file in the notepad and looks the same...
Most likely the format is different but you just cant see the difference
There are only two possible causes.
Data is coming from different data sources some users use Windows, some us MAC, some use UNIX/Lynux
Systems based on ASCII or a compatible character set use either LF (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, 0x0D 0x0A). Some rare systems, such as QNX before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.
LF: Multics, Unix and Unix-like systems (GNU/Linux, AIX, Xenix, Mac OS X, FreeBSD, etc.), BeOS, Amiga, RISC OS, and others
CR+LF: DEC RT-11 and most other early non-Unix, non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC-DOS, etc.), OS/2, Microsoft Windows, Symbian OS, Palm OS
CR: Commodore 8-bit machines, TRS-80, Apple II family, Mac OS up to version 9 and OS-9
RS: QNX pre-POSIX implementation.
The Unicode standard defines a large number of characters that conforming applications should recognize as line terminators:
LF: Line Feed, U+000A
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029