One of our Insurance customers has contacted us recently and asked for immediate assistance. They were tryig to implement XML files processing for number of weeks using SSIS.
Our consultant was able to deliver working solution within one day using Advanced ETL Processor.
The XML files show up every day. The client places them on FTP server (which means that the file is locally found on our network via a UNC path. )
Each nested XML tag should be a new “related” table.
Main XML tag is "Contracts", Other relevant tags are Billing,Coverage,Person,Risk and Underwriting Question.
There are a couple of data issues as bad dates from the customer. They should be ignored and inserted into the table as a Null.
The process should be scheduled to check an UNC folder on the network at least 3 times per day.
If 1 or more XML files exist, they need to be loaded into the server.
Once loaded an Email should be sent to me indicating success, failure or nothing to do.
There are some duplicated data due to the relationships of the data.
For example, each contract has Agency information. So any contract MAY use a previously inserted Agency In this case Update would be appropriate.
Updating data, if same conract is loaded second time all relevant data must be deleted first
For every nested XML tag a relevant data transformation was created.
On the the main problem was that XML format was not stable some of the files have missing tags or additional tags.
That was addressed by using custom XSLT transformations
Than package was created to combine all those transformations together
We have been fighting this one with SSIS for about 2 weeks and got really behind schedule. Then I remembered we own Advanced ETL Processor so I thought we should give it a shot. We are just so far behind that we need this ASAP and I did not want to mess it up. The result was far beyond our expectations. They were able to do more in one day than entite team in two weeks.
In order to speed up the process please provide as much information as you can.
Checksum is a calculated value that is used to determine the integrity of data. Checksum serves as a unique identifier for the data (a file, a text string, or a table row). If the data changes then so does the checksum value. This makes it easy to verify the integrity of the data.
Checksum is very useful in data warehousing especially when working with very wide slow changing dimensions. It takes less processing power to compare one checksum field than to compare all fields within dimension.
dbSL ETL software can calculate check sum for entire file, string or row of data. it supports following algorithms CRC16, CRC24, CRC32, Adler32, CRC64, eDonkey, eMule, MD4, MD5, RIPEMD160, SHA1, SHA224, SHA256, SHA384, SHA512,Whirlpool. It can return the data in HEX or BASE64 format.
This sql script will create an populate Time dimension for PosgreSQL based data warehouse
CREATE TABLE time_dim
time_key integer NOT NULL,
time_value character(5) NOT NULL,
hours_24 character(2) NOT NULL,
hours_12 character(2) NOT NULL,
hour_minutes character (2) NOT NULL,
day_minutes integer NOT NULL,
day_time_name character varying (20) NOT NULL,
day_night character varying (20) NOT NULL,
CONSTRAINT time_dim_pk PRIMARY KEY (time_key)
COMMENT ON TABLE time_dim IS 'Time Dimension';
COMMENT ON COLUMN time_dim.time_key IS 'Time Dimension PK';
insert into time_dim
SELECT cast(to_char(minute, 'hh24mi') as numeric) time_key,
to_char(minute, 'hh24:mi') AS tume_value,
-- Hour of the day (0 - 23)
to_char(minute, 'hh24') AS hour_24,
-- Hour of the day (0 - 11)
to_char(minute, 'hh12') hour_12,
-- Hour minute (0 - 59)
to_char(minute, 'mi') hour_minutes,
-- Minute of the day (0 - 1439)
extract(hour FROM minute)*60 + extract(minute FROM minute) day_minutes,
-- Names of day periods
case when to_char(minute, 'hh24:mi') BETWEEN '06:00' AND '08:29'
when to_char(minute, 'hh24:mi') BETWEEN '08:30' AND '11:59'
when to_char(minute, 'hh24:mi') BETWEEN '12:00' AND '17:59'
when to_char(minute, 'hh24:mi') BETWEEN '18:00' AND '22:29'
end AS day_time_name,
-- Indicator of day or night
case when to_char(minute, 'hh24:mi') BETWEEN '07:00' AND '19:59' then 'Day'
end AS day_night
FROM (SELECT '0:00'::time + (sequence.minute || ' minutes')::interval AS minute
FROM generate_series(0,1439) AS sequence(minute)
GROUP BY sequence.minute
ORDER BY 1
Based on on information provided here
Typical task pupulating database field with current date can be be done in several ways
First of all we can use GetSystemVariable calculation function:
GetSystemVariable returns value of 'VARIABLENAME'.
Possible values for 'VARIABLENAME' are:
GetSystemVariable('SYSTEM_DATE') function teturns date according to the regional settings of computer.
It very important to use currect date format form the regional settings screen we can see that the format is dd/mm/yyyy
Another way to determine date format of GetSystemVariable('SYSTEM_DATE') is to execute it against non date field
Data preview screen
Mapping using MS SQL Server connection with currect date format
If we specify wrong date format I won’t be able to load the data
As you can see order date field is missing.
Another way of getting data is by using GetSystemDate function
GetSystemDate(Format):String, Returns Current system date/time in format specified
And last way is to use database functions for example getdate() for sql server
It does not work for MS SQL Server Connections use ODBC connection instead.
You may get error message saying that field must have date format. Just assign any date format.
There several types of dimensions which can be used in the datawarehouse
It is not really slowly changing dimension: We just load the data and keep it as it is. Typical example of it woul be list of postcodes
Overwrite the old value with the new value. This is very useful when dealing with corrections. For example the dimensions holds list of countries some of them we misspelled so we want to current it
We don’t care about the history in this case because it was incorrect at the first place.
Create a new record in the dimension with a new primary key. For example when customer changes the address we will create new record. Sometimes some additional fields are populated as well such current record flag, or effective date.
Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change.
This type of dimensions can be used for loads when loan is fully paid.
Now lets have a look at practical ways of population dimension tables:
Check if record exist if not insert new record
For every incoming record
Count number of records matching the key
Than if Record Count <> 0
Update Customers table
Insert new record
We might be performing unnecessary updates plus for large and wide tables it could be very slow. One of the ways of avoiding doing it is using CRC(MD5) for checks. If CRC is the same no changes were mage to the record and therefore there is no need to update it
Deduplicate the data
Calculate Record CRC
If this CRC exist in the database discard the record
If not create new record
Deduplicate the data
Calculate Record CRC
If this CRC exist in the database than do nothing
If not update record with new data
it is a measure of how well business data practices satisfy standards. Good and reliable data can be used to increase efficiency, support decision making, and sky rocket profitability.
Poor data quality leads to wasting time, working with conflicting information resulting in bad decisions and massive decrease in efficiency.
Many organizations implement strict data controls at the point of entry. However quite often it is not enough. For example, when data is loaded into the data-warehouse or moved from one application to another additional validation and transformation rules must be applied.
The primary objective of any data integration solution is to assemble data from one or more data sources validate and transform it into standard format.
There are three major steps in implementing data quality strategy:
Advanced ETL Processor is able to check any data including date formats, post codes, phone numbers, validating against list of values etc. It has more than 190 data validation functions, plus it can be extended by using regular expressions. It is an enterprise data integration solution that lets you quickly validate and process large volumes of data while preserving and enhancing data quality.
Common problem: trying to load the data from Excel file half of the data is coming as nulls, or columns with more than 255 characters are truncated
As partially explained here
ODBC/MS Jet scans first TypeGuessRows to determine field type
In your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the provider returns five (5) numbers and three (3) null values.
In your eight (8) scanned rows, if the column contains three (3) numeric values and five (5) text values, the provider returns three (3) null values and five (5) text values.
In your eight (8) scanned rows, if the column contains four (4) numeric values and four (4) text values, the provider returns four (4) numbers and four (4) null values.
In your eight (8) scanned rows all of them less than 255 characters the provider will truncate all data to 255 characters
In your eight (8) scanned rows, if the column contains five (5) values with more length than 255 the provider will returm more than 255 characters
Setting IMEX=1 tells the driver to use Import mode. In this state, the registry setting ImportMixedTypes=Text will be noticed. This forces mixed data to be converted to text. For this to work reliably, you may also have to modify the registry setting, TypeGuessRows=8. The ISAM driver by default looks at the first eight rows and from that sampling determines the datatype. If this eight row sampling is all numeric, then setting IMEX=1 will not convert the default datatype to Text; it will remain numeric.
Nobody wants to load half of the data, everybody wants to load data as it is
Set IMEX=1 in connection string
Close any programs that are running.
On the Start menu, click Run. Type regedit and click OK.
In the Registry Editor, expand the following key depending on the version of Excel that you are running:
Excel 2000 and later versions
Select TypeGuessRows and on the Edit menu click Modify.
In the Edit DWORD Value dialog box, click Decimal under Base.
Set the value to 1
Open Excel file
Make sure that the cells in the first line of the table have relevant data for example
This solution apply to all versions of MS Excel ODBC driver, Ole DB, MS Jet, .NET, DTS and SSIS
We have spent enormous amount of time trying to get it fixed. So far we were not able to find a better solution.
The way Excel import works makes it not possible to automate it. You have to modify most of excel files manually in order to load them.
This why we are no longer using ODBC/OleDB/Ms Jet for Excel connections. Our ETL solutions work currecly with Excel all the time