You are here: HomeArticlesMagic button for Business Intelligence consultant

Loading Data into Oracle

Load any data into Oracle in just a few clicks. Plus there is no need to create any scripts or learn command line tools just point and click.

Learn how you can do it now:

There are several ways of loading data into Oracle database

  • Conventional path load
  • Direct path load
  • Using Oracle ODBC Driver
  • Using Ole DB

Conventional path load

A conventional path load executes SQL INSERT statements to populate tables in an Oracle database.

When Advanced ETL Processor or Visual Importer performs a conventional path load, it competes equally with all other processes for buffer resources, which can significantly slow the load.

Let's consider loading data into customers table using conventional path loading.

Some of the data is already inserted and there are duplicates in our source data.  

insert into customers
(customer_id,customer_name)
values
(:customer_id,:customer_name)

 What would happen if primary key is violated? 

If we run all our inserts within one transaction we won't be able to load any data

  • lots of inserts
  • commit

or we can insert one record at the time committing after every insert 

  • insert one record
  • commit
  • insert one record
  • commit
  • ETC

This approach allows us to load as much as we can, however there is a problem with that it is slow because it involves a lot of OCI calls and consumes a lot of server resources.

We use slightly different approach at DB Software Laboratory.

We build an array of records in memory than insert entire array in one go.

  • insert 100 records
  • commit
  • insert 100 records
  • commit
  • insert 100 records
  • commit

if insert fails OCI lets us know that record number 52 within the array violates primary key.

  • our insert is rolled back
  • insert records from 1 till 51
  • commit
  • insert records from 53 till 100
  • commit

If second insert failed we attempt to insert records one by one

  • our insert is rolled back
  • insert record 53
  • commit
  • insert record 54
  • commit
  • ...
  • Insert record 100
  • commit

Once we inserted 100 records one by one, we build array and try to insert entire array again

Direct path load

A direct path load eliminates much of the database overhead by formatting Oracle data blocks and writing the data blocks directly to the database files. It does not compete with other users for database resources, so it can usually load data at near disk speed.

There are several restrictions to using direct path load

During a direct path load

  • CHECK constraints,
  • Referential constraints (FOREIGN KEYS),
  • Insert triggers

are automatically disabled

and

  • NOT NULL
  • UNIQUE
  • PRIMARY KEY (unique-constraints on not-null columns)

enabled 

Note:

NOT NULL constraints are checked at column array build time.
Any row that violates the NOT NULL constraint is rejected.
UNIQUE constraints are verified when indexes are rebuilt at the end of the load.
The index will be left in an Index Unusable state if a violation of a UNIQUE constraint is detected.

ODBC

(Pronounced as separate letters) Short for Open DataBase Connectivity, a standard database access method developed by the SQL Access group in 1992. The goal of ODBC is to make it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a database driver , between an application and the DBMS. The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them

The current Oracle ODBC driver conforms to the ODBC 3.52 specifications and certified against MDAC 2.8 (Microsoft Data Access Components) on Windows 32bit and 64bit platforms. It supports all core APIs and a subset of Level 1 and Level 2 functions. Microsoft supplies the Driver Manager components for the Windows Platform. The Oracle ODBC Driver is also available as part of the Oracle Instant Client installation and can be found at
http://www.oracle.com/technology/tech/oci/instantclient.

All our ETL Tools provide full support ODBC including ODBC connection strings.
In case of loading data into Oracle it works slower than Conventional path load and Direct path load.

OLE DB or ADO

OLE DB (or Object Linking and Embedding, Database) is an API designed by Microsoft for accessing data from a variety of sources in a uniform manner. It is a set of interfaces implemented using the Component Object Model (COM); it is otherwise unrelated to OLE. It was designed as a higher-level replacement for, and successor to, ODBC, extending its feature set to support a wider variety of non-relational databases, such as object databases and spreadsheets that do not necessarily implement SQL.

OLE DB separates the data store from the application that needs access to it through a set of abstractions that include the datasource, session, command and rowsets. This was done because different applications need access to different types and sources of data and do not necessarily want to know how to access functionality with technology-specific methods. OLE DB is conceptually divided into consumers and providers. The consumers are the applications that need access to the data, and the provider is the software component that implements the interface and therefore provides the data to the consumer. OLE DB is part of the Microsoft Data Access Components (MDAC) stack. MDAC is a group of Microsoft technologies that interact together as a framework that allows programmers a uniform and comprehensive way of developing applications for accessing almost any data store. OLE DB providers can be created to access such simple data stores as a text file and spreadsheet, through to such complex databases as Oracle, MS SQL Server and Sybase ASE. It can also provide access to hierarchical datastores such as email systems.

However, because different data store technologies can have different capabilities, OLE DB providers may not implement every possible interface available to OLE DB. The capabilities that are available are implemented through the use of COM objects - an OLE DB provider will map the data store technologies functionality to a particular COM interface. Microsoft describes the availability of an interface as "provider-specific," as it may not be applicable depending on the database technology involved. Note also that providers may augment the capabilities of a data store - these capabilities are known as services in Microsoft parlance.

All our ETL Tools fully support OLE DB. However it is one of the slowest ways of accessing the data and when working with large datasets it tends to use a lot of memory.

More Information:

Loading data into the Datawarehouse

This article describes the various stages of loading data into the data warehouse

There are several stages in the process:

Preparation for Extraction

Data can come in various formats. It can be extracted from the source database directly or it may be loaded from the files. When we extract data directly all we need to is to check if connection is working.
This usually done automatically by ETL automation tool. In case of files, we need to obtain the files first. They might be stored on the remote ftp server or somewhere in the web. Those files have to be copied to the location where the can be accessed by the ETL tool.

Loading Data into the Datawarehouse

Some ETL Tools offer complete automation of business processes including full support for file operations

Extraction.

SQL can be used to extract the data from the database or text parser to extract the data from fixed width or delimited files.
Extracting data often involves the transfer of large amounts of data from source operational systems. Such operations can impose significant processing loads on the databases involved and should be performed during a period of relatively low system load or overnight.

Cleansing and Transforming data.

Data cleansing is a process of checking data against predefined set of rules

For example:

  • Checking date formats
  • Checking field length
  • Pattern validation
  • Data type checks
A lot of data transformations can be performed during the process of extracting data from the source systems. However, there are often some additional tasks to execute before loading the data into the data warehouse. For example, reconciling inconsistent data from heterogeneous data sources after extraction and completing other formatting and cleansing tasks and generating surrogate keys.

Some typical data transformations include:
  • Joining multiple fields into one field (Address 1 + Address 2 + Address 3).
  • Generating calendar tables.
  • Mapping data from one representation to another, such as Female to 1 and Male to 0
  • Transforming data from multiple representations to a single representation, such as a common format for telephone numbers
  • Creating surrogate keys

Loading the Data

Once all the data has been cleansed and transformed into a structure consistent with the data warehouse requirements, data is ready for loading into the data warehouse.

The initial load of the data warehouse consists of populating the tables in the data warehouse schema and then checking that the data is ready for use.

After the data has been loaded into the data warehouse database, verify the referential integrity between dimension and fact tables to ensure that all records relate to appropriate records in other tables. DBA should verify that every record in a fact table relates to a record in each dimension table that will be used with that fact table.

Data integrity in the reverse order is not necessary, however in some cases it might be necessary to remove unrelated data. For example when a dimension table has several times more records than the fact table

Dealing with null and empty strings.

Most queries that retrieve data from the data warehouse use inner joins between the fact and dimension tables. If you use a dimension table containing data that does not apply to all facts, you must include a record in the dimension table that can be used to relate to the remaining fact table values. For example, for null value 0 can be used as surrogate key of dimension table and for empty string 1.

To verify referential integrity in a star schema simple SQL query can be used to count the rows returned when all appropriate dimension tables are joined to the fact table using inner joins. The number of rows returned by this query should match the number of rows in the fact table. With large data warehouses it might have some performance implications and should be executed outside of normal working hours

More Information:

Star Schema Optimisation

This article describes the ways to optimize star schemas and making your data warehouse run faster

This article is not database specific

Basic Star Schema of a retail shop

Star Schema


In the previous article we have described the ways of calculating SQL query execution time.

Let's think about what can be done about Query performance.

We have only two options

  • Improve hardware
  • Make the database server do less work

Improving Datawarehouse hardware

There are a lot of ways of improving hardware. You can add memory processors increase your hardrive size etc. But quite often it does not help

We have seen some configurations with 64 gigs of memory and 32 processors

With all 32 processors doing nothing but the database is running very slow.

All this because the DBA decided to use raid 5 for better protection.
The performance of you disks this the most important part of the data warehouse.

It defines it. It tells how long do you have before the system becomes unusable due to pour performance

HDtach is very useful utility for checking disk performance, you would be very surprised when compare the laptop with the servers. Modern laptops quite often much faster.


The bottom line is:

  • HDD performance is the key 
  • Do not use RAID 5 use RAID 10

Making the database server do less.


Let's go back to the Star Schema Diagram

STATE CODE field is 5 bytes long
Everybody knows that the US state abbreviation is always two characters long

if we change our definition from nvarchar(2) to nchar(2) field size changes from 5 bytes to 4

Next well know thing, that there are no international characters in US state abbreviation. There is no need to use Unicode and we can change field type from nchar(2) to char(2)

The field length is 2 bytes now.

Everybody knows that there are only 52 states in US

What we can do next, is to move STATE CODE out of the fact table in to the dimension and use STATE_ID BYTE in fact table as state identifier.

By using common sense we were able to reduce the field size five times. The same technic can be applied to the rest of fields.

Conclusion

Make sure your disk fast and fact table record size as small as possible

Datawarehouse Performance

This article describes the ways of estimating data warehouse performance

Ever wonder why it takes so long to execute the Query? It works fine now but how long will it take to run in 3 years?

This article is not database specific

Below is the basic Star schema of a retail shop
It has one fact table and 3 dimensions

Star Schema

Let's have a close look at the Fact table structure and particularly fields types: 

The STATE_CODE field: The type is nvarchar most likely it it uses 2 bytes to store one character, plus it has nil as terminator so in total it is 5 bytes

Some of the databases store the length of string, which could take up to 2-4 bytes

Lets do the same for the rest of fields
Field Name  Field Length
STATE_CODE 5
PRODUCT_CODE 21
PERIOD_CODE 13
AMOUNT 6
Total row size 45

So how long it will take to run the following query?

Select STATE_CODE, PRODUCT_CODE, PERIOD_CODE, Sum(AMOUNT) as  AMOUNT
From FACT_TABLE
GROUP BY STATE_CODE, PRODUCT_CODE, PERIOD_CODE

This sql will use the full scan of the table: EG it wiill read entire table and calculate summary  

To answer this question we need to know our HDD performance and number of rows in the table

HDtach can easily measure the hard drive performance

Let assume that we have 10 million rows and our RAID can do 50mb per sec sequential read

450 000 000 / (50 *1048576) = 8.58 sec

Those people who have actually run the queries against the table with 10 million records would say that this number is far too optimistic.

  • The actual row size is much bigger
  • It also holds some additional data such as row CRC or rowid's
  • Plus data is stored in blocks and they do have some additional data as well
  • Plus the table itself stores some additional information.
  • HDD performance in not realistic as well in multi-user environment

So what is the best way to estimate the time taken to execute the query?

We believe that practical aproach gives the best results:

  • Create table
  • Populate table with some dummy data
  • Count number of rows
  • Get actual table size using database specific functions

SQL Server provides a built-in stored procedure that you can run to easily show the size of a table, including the size of the indexes. Which might surprise you.

Syntax:

sp_spaceused ‘Tablename’

Than use the formula table size/hdd performance to calculate the time
and the last step ran the sql during peak times and low times
and compare the results.

If you have historical data this approach would also help you estimate how long will it take to run the SQL in 3 years.

Generating Insert Statements

This article explains how to generate INSERT statements from any data source

This solution is not database specific

Most data bases can generate DDL for any object but not a lot of them allow generation of INSERT statements for the table data.
The workaround is to use of ETL Tools for transferring data across servers. However, there exists a need to generate INSERT statements from the tables for porting data.
Simplest example is when small or large amount of data needs to be taken out on a removable storage media and copied to a remote location, INSERT..VALUES statements come handy.

There is a number of scripts available to perform this data transformation task. The problem with those scripts that all of them database specific and they do not work with textiles

Advanced ETL Processor gives you an universal solution for creating Insert statements. It works with any database or file.

Benefits and advantages

  • Data from both tables and views can be scripted
  • Support for text files
  • Table names and column names with spaces are handled
  • All datatypes are handled
  • NULLs are gracefully handled
  • Timestamp columns are handled
  • Identity columns are handled
  • Very flexible and configurable
  • Computed columns are handled
  • You can filter the rows/lines for which you want to generate INSERT's

Advanced ETL Processor generating insert statements

Generating Insert Statements

We were very impressed when we bought this ETL tool. But now, we are even more impressed, because the company constantly adding new useful ETL features.
You guys have got imagination and vision of how things work in real-life.


Tim Jonson,
Project Manager 

Using ODBC Connection strings in ETL Tools

This article describes using ODBC Connection strings in ETL Tools

What is ODBC?


ODBC stands for Open Data Base Connectivity, a standard database access method developed by Microsoft Corporation.

It makes it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a database driver, between an application and the DBMS.

The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them.

What is a DSN?


A DSN (Data Source Name) is an identifier which defines a data source for an ODBC driver.

For example DSN for MS SQL Server would consists of:

  • Database name
  • Server name
  • User Name
  • Password

The Problem

This information is stored in windows registry and this fact introduces additional work for system administrators. Imagine following situation you have CRM application and you use ODBC DSN to connect to it. It means that on every PC the Admistrator must create ODBC DSN manually. The problem becomes worse if you have a lot of different databases and large amount of users.

The Solution

One of the benefits of using DB Software Laboratory ETL tools is the support for ODBC connection strings.

Using ODBC connection string gives you the ability to connect to the database without creating an ODBC DSN.

For example to connect to MS SQL Server 2008 you may use the following connection string.

Driver={SQL Server Native Client 10.0};Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;

Advanced ETL Processor Data Reader using ODBC DSN

Advanced ETL Processor Data Reader using Connection String

You may find more information about connection strings at http://www.connectionstrings.com

Dealing With Dirty Data

This Article describes what is Dirty Data and how to Deal with it

What is Dirty Data anyway?


In reference to databases it is a data that contain errors. Dirty data can contain such mistakes as spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or even data that is duplicated in the database.

 Other common causes of dirty data are:

  • Wrong fields sizes
  • Wrong and inconsistent formats
  • Logical inconsistency like typing zipcode into phone number box
  • User Errors

Most of the problems comes when working with Text or Excel Files

Life is much easier when data source is ODBC compliant database however there are still some potential problems

Imagine that you are loading orders from different countries into your oracle data warehouse.

Part of  the data comes from text files, part from MS Excel files and some of the data is direct ODBC connection to the source database.
Some files are result of manual consolidation of multiple files

Data-warehouse Table Definition is

  • COUNTRY_ID INTEGER
  • ORDER_ID INTEGER
  • ORDER_DATE DATE
  • AMOUNT NUMBER(10.2)


Every country have different formats for ORDER_DATE and Amount field. This situation is far too familiar for many ETL Consultants

 

In order to load data we need to make sure that format of Amount and Order_Date fields is consistent.

For amount field we need to get rid of dollars, pounds and commas.

It could easily done by using replace function of Advanced ETL Processor. 

What you see is what you load

 

For ORDER_DATE field we will apply multiple date formats.

Result of Date Format function is a string in 'YYYY-MM-DD HH:NN:SS.ZZZ' format 

What you see is what you load
 

 Full Data Transformation:

 Result of Data Transformation:

 

 

This is just a small example how Advanced ETL Processor can help you to validate and transform data. 

About Advanced ETL Processor


Advanced ETL Processor
is an ETL tool designed to automate extracting data from ANY database, transform, validate it and load into ANY database . Typical usage of it would be extract data from Excel File,Validate Date Formats, Sort data, deduplicate it and load it into Oracle database, run stored procedure or Sql script, once loading is completed. Unlike Oracle SQL loader, BCP,  DTS or SSIS Advanced ETL Processor can also add new and update old records based on primary key.

More Information

More Articles...

Page 6 of 7

Go to top