This article describes common problems of loading data into the database from different operating systems.
You have a sales data warehouse. Every day there are hundreds of small files to process. They come from various locations.
ETL tool downloads the files into one folder than loads them into the Data warehouse using a mask.
One morning you get a call from the Manager saying that the load failed and there several thousands of files to process...
So what could be the problem?
Maybe the file format has changed.
You open the file in the notepad and looks the same...
Most likely the format is different but you just can't see the difference
There are only two possible causes.
- Different encoding for example UTF8 instead of ASCI
- Different end of line character for example LF instead of CR+LF
Data is coming from different data sources some users use Windows, some us MAC, some use UNIX/Linux
Systems based on ASCII or a compatible character set use either LF (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, 0x0D 0x0A). Some rare systems, such as QNX before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.
LF: Multics, Unix and Unix-like systems (GNU/Linux, AIX, Xenix, Mac OS X, FreeBSD, etc.), BeOS, Amiga, RISC OS, and others
CR+LF: DEC RT-11 and most other early non-Unix, non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC-DOS, etc.), OS/2, Microsoft Windows, Symbian OS, Palm OS
CR: Commodore 8-bit machines, TRS-80, Apple II family, Mac OS up to version 9 and OS-9
RS: QNX pre-POSIX implementation.
The Unicode standard defines a large number of characters that conforming applications should recognize as line terminators:
LF: Line Feed, U+000A
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029