Loading data files from different operating systems

This article describes common problems of loading data into the database from different operating systems.

Typical situation:

You have a sales data warehouse. Every day there are hundreds of small files to process. They come from various locations. ETL tool downloads the files into one folder than loads them into the Data warehouse using a mask.

One morning you get a call from the Manager saying that the load failed and there several thousands of files to process…

So what could be the problem?

  • Maybe the file format has changed.
  • You open the file in the notepad and looks the same…
Most likely the format is different but you just can't see the difference

There are only two possible causes.

  • Different encoding for example UTF8 instead of ASCI
  • Different end of line character for example LF instead of CR+LF

Data is coming from different data sources some users use Windows, some us MAC, some use UNIX/Lynux

Systems based on ASCII or a compatible character set use either LF (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, 0x0D 0x0A). Some rare systems, such as QNX before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.

LF: Multics, Unix and Unix-like systems (GNU/Linux, AIX, Xenix, Mac OS X, FreeBSD, etc.), BeOS, Amiga, RISC OS, and others
CR+LF: DEC RT-11 and most other early non-Unix, non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC-DOS, etc.), OS/2, Microsoft Windows, Symbian OS, Palm OS
CR: Commodore 8-bit machines, TRS-80, Apple II family, Mac OS up to version 9 and OS-9
RS: QNX pre-POSIX implementation.

The Unicode standard defines a large number of characters that conforming applications should recognize as line terminators:

LF: Line Feed, U+000A
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

The Solution

Use Advanced ETL Processor

Latest Version of Advanced ETL Processor and Visual Importer ETL can automatically determine source file encoding and works with multiple Line Terminators

For more technologies supported by our ETL Software see Advanced ETL Processor Versions and Visual Importer ETL Versions

Confused? Ask question on our ETL forum

  • knowledgebase/loading_data_files_from_different_operating_systems.txt
  • Last modified: 13/04/2019 17:24
  • by admin