There are several ways of loading data into Oracle database
A conventional path load executes SQL INSERT statements to populate tables in an Oracle database.
When Advanced ETL Processor or Visual Importer performs a conventional path load, it competes equally with all other processes for buffer resources, which can significantly slow the load.
Let's consider loading data into customers table using conventional path loading.
Some of the data is already inserted and there are duplicates in our source data.
insert into customers
(customer_id,customer_name)
values
(:customer_id,:customer_name)
What would happen if primary key is violated?
If we run all our inserts within one transaction we won't be able to load any data
or we can insert one record at the time committing after every insert
This approach allows us to load as much as we can, however there is a problem with that it is slow because it involves a lot of OCI calls and consumes a lot of server resources.
We use slightly different approach at DB Software Laboratory.
We build an array of records in memory than insert entire array in one go.
if insert fails OCI lets us know that record number 52 within the array violates primary key.
If second insert failed we attempt to insert records one by one
Once we inserted 100 records one by one, we build array and try to insert entire array again
A direct path load eliminates much of the database overhead by formatting Oracle data blocks and writing the data blocks directly to the database files. It does not compete with other users for database resources, so it can usually load data at near disk speed.
There are several restrictions to using direct path load
During a direct path load
are automatically disabled
and
enabled
Note:
(Pronounced as separate letters) Short for Open DataBase Connectivity, a standard database access method developed by the SQL Access group in 1992. The goal of ODBC is to make it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a database driver , between an application and the DBMS. The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them
The current Oracle ODBC driver conforms to the ODBC 3.52 specifications and certified against MDAC 2.8 (Microsoft Data Access Components) on Windows 32bit and 64bit platforms. It supports all core APIs and a subset of Level 1 and Level 2 functions. Microsoft supplies the Driver Manager components for the Windows Platform. The Oracle ODBC Driver is also available as part of the Oracle Instant Client installation and can be found at
http://www.oracle.com/technology/tech/oci/instantclient.
All our ETL Tools provide full support ODBC including ODBC connection strings.
In case of loading data into Oracle it works slower than Conventional path load and Direct path load.
OLE DB (or Object Linking and Embedding, Database) is an API designed by Microsoft for accessing data from a variety of sources in a uniform manner. It is a set of interfaces implemented using the Component Object Model (COM); it is otherwise unrelated to OLE. It was designed as a higher-level replacement for, and successor to, ODBC, extending its feature set to support a wider variety of non-relational databases, such as object databases and spreadsheets that do not necessarily implement SQL.
OLE DB separates the data store from the application that needs access to it through a set of abstractions that include the datasource, session, command and rowsets. This was done because different applications need access to different types and sources of data and do not necessarily want to know how to access functionality with technology-specific methods. OLE DB is conceptually divided into consumers and providers. The consumers are the applications that need access to the data, and the provider is the software component that implements the interface and therefore provides the data to the consumer. OLE DB is part of the Microsoft Data Access Components (MDAC) stack. MDAC is a group of Microsoft technologies that interact together as a framework that allows programmers a uniform and comprehensive way of developing applications for accessing almost any data store. OLE DB providers can be created to access such simple data stores as a text file and spreadsheet, through to such complex databases as Oracle, MS SQL Server and Sybase ASE. It can also provide access to hierarchical datastores such as email systems.
However, because different data store technologies can have different capabilities, OLE DB providers may not implement every possible interface available to OLE DB. The capabilities that are available are implemented through the use of COM objects - an OLE DB provider will map the data store technologies functionality to a particular COM interface. Microsoft describes the availability of an interface as "provider-specific," as it may not be applicable depending on the database technology involved. Note also that providers may augment the capabilities of a data store - these capabilities are known as services in Microsoft parlance.
All our ETL Tools fully support OLE DB. However it is one of the slowest ways of accessing the data and when working with large datasets it tends to use a lot of memory.
More Information:
Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if connection is working.
This usually done automatically by ETL automation tool. In case of files, we need to obtain the files first. They might be stored on the remote ftp server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.
Some ETL Tools offer complete automation of business processes including full support for file operations
Data cleansing is a process of checking data against predefined set of rules
For example:
Most queries that retrieve data from the data warehouse use inner joins between the fact and dimension tables. If you use a dimension table containing data that does not apply to all facts, you must include a record in the dimension table that can be used to relate to the remaining fact table values. For example, for null value 0 can be used as surrogate key of dimension table and for empty string 1.
To verify referential integrity in a star schema simple SQL query can be used to count the rows returned when all appropriate dimension tables are joined to the fact table using inner joins. The number of rows returned by this query should match the number of rows in the fact table. With large data warehouses it might have some performance implications and should be executed outside of normal working hours
More Information:
This article is not database specific
Basic Star Schema of a retail shop

In the previous article we have described the ways of calculating SQL query execution time.
Let's think about what can be done about Query performance.
We have only two options
There are a lot of ways of improving hardware. You can add memory processors increase your hardrive size etc. But quite often it does not help
We have seen some configurations with 64 gigs of memory and 32 processors
With all 32 processors doing nothing but the database is running very slow.
All this because the DBA decided to use raid 5 for better protection.
The performance of you disks this the most important part of the data warehouse.
It defines it. It tells how long do you have before the system becomes unusable due to pour performance
HDtach is very useful utility for checking disk performance, you would be very surprised when compare the laptop with the servers. Modern laptops quite often much faster.
The bottom line is:
Let's go back to the Star Schema Diagram
STATE CODE field is 5 bytes long
Everybody knows that the US state abbreviation is always two characters long
if we change our definition from nvarchar(2) to nchar(2) field size changes from 5 bytes to 4
Next well know thing, that there are no international characters in US state abbreviation. There is no need to use Unicode and we can change field type from nchar(2) to char(2)
The field length is 2 bytes now.
Everybody knows that there are only 52 states in US
What we can do next, is to move STATE CODE out of the fact table in to the dimension and use STATE_ID BYTE in fact table as state identifier.
By using common sense we were able to reduce the field size five times. The same technic can be applied to the rest of fields.
Make sure your disk fast and fact table record size as small as possible
Ever wonder why it takes so long to execute the Query? It works fine now but how long will it take to run in 3 years?
This article is not database specific
Below is the basic Star schema of a retail shop
It has one fact table and 3 dimensions

Let's have a close look at the Fact table structure and particularly fields types:
The STATE_CODE field: The type is nvarchar most likely it it uses 2 bytes to store one character, plus it has nil as terminator so in total it is 5 bytes
Some of the databases store the length of string, which could take up to 2-4 bytes
Lets do the same for the rest of fields| Field Name | Field Length |
| STATE_CODE | 5 |
| PRODUCT_CODE | 21 |
| PERIOD_CODE | 13 |
| AMOUNT | 6 |
| Total row size | 45 |
Select STATE_CODE, PRODUCT_CODE, PERIOD_CODE, Sum(AMOUNT) as AMOUNT
From FACT_TABLE
GROUP BY STATE_CODE, PRODUCT_CODE, PERIOD_CODE
To answer this question we need to know our HDD performance and number of rows in the table
HDtach can easily measure the hard drive performance
Let assume that we have 10 million rows and our RAID can do 50mb per sec sequential read
450 000 000 / (50 *1048576) = 8.58 sec
Those people who have actually run the queries against the table with 10 million records would say that this number is far too optimistic.
We believe that practical aproach gives the best results:
sp_spaceused ‘Tablename’
Than use the formula table size/hdd performance to calculate the time
and the last step ran the sql during peak times and low times
and compare the results.
If you have historical data this approach would also help you estimate how long will it take to run the SQL in 3 years.
This solution is not database specific
Most data bases can generate DDL for any object but not a lot of them allow generation of INSERT statements for the table data.
The workaround is to use of ETL Tools for transferring data across servers. However, there exists a need to generate INSERT statements from the tables for porting data.
Simplest example is when small or large amount of data needs to be taken out on a removable storage media and copied to a remote location, INSERT..VALUES statements come handy.
There is a number of scripts available to perform this data transformation task. The problem with those scripts that all of them database specific and they do not work with textiles
Advanced ETL Processor gives you an universal solution for creating Insert statements. It works with any database or file.
Advanced ETL Processor generating insert statements
ODBC stands for Open Data Base Connectivity, a standard database access method developed by Microsoft Corporation.
It makes it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a database driver, between an application and the DBMS.
The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them.
A DSN (Data Source Name) is an identifier which defines a data source for an ODBC driver.
For example DSN for MS SQL Server would consists of:
This information is stored in windows registry and this fact introduces additional work for system administrators. Imagine following situation you have CRM application and you use ODBC DSN to connect to it. It means that on every PC the Admistrator must create ODBC DSN manually. The problem becomes worse if you have a lot of different databases and large amount of users.
One of the benefits of using DB Software Laboratory ETL tools is the support for ODBC connection strings.
Using ODBC connection string gives you the ability to connect to the database without creating an ODBC DSN.
For example to connect to MS SQL Server 2008 you may use the following connection string.
Driver={SQL Server Native Client 10.0};Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;
You may find more information about connection strings at http://www.connectionstrings.com
What is Dirty Data anyway?
Other common causes of dirty data are:
Most of the problems comes when working with Text or Excel Files
![]()
In order to load data we need to make sure that format of Amount and Order_Date fields is consistent. For amount field we need to get rid of dollars, pounds and commas. It could easily done by using replace function of Advanced ETL Processor. ![]()
For ORDER_DATE field we will apply multiple date formats. Result of Date Format function is a string in 'YYYY-MM-DD HH:NN:SS.ZZZ' format ![]() Full Data Transformation:![]() Result of Data Transformation:![]()
This is just a small example how Advanced ETL Processor can help you to validate and transform data. |
||
About Advanced ETL Processor |
||
|
|
![]() |
|
|
|
|
|
|
|
|
![]() |
![]() |
You guys have got imagination and vision of how things work in real-life.
Tim Jonson,
Project Manager