Automate File Processing

A Question from the Customer:

An external PLC generates txt files and those must be downloaded and converted into an SQL DB.
The PLC generates 4 files each day, at a size of 5MB pr. File = 20MB a day.
The file name is date and time named like this : 14022719.txt, syntax is YYMMDDHH
The files are generated at the same 4 hours each day, 01, 07, 13, 19
In the files, first data row is a time stamp YYMMDDHHMMSS
It is important that one record with a specific date/time is written only once in the SQL.
The files would always be the same size.

ETL Job:

Each day, The ETL converter should copy the file to a local directory via FTP; convert the file to SQL, then delete the file from the local dir. It could be 4 times a day, maybe one hour after the txt file is completed on the PLC.
We can not delete the file from the PLC, since others should also have access to the data.
The data on the PLC will be deleted after 10 days using FIFO

The problem:

How to I keep track of which files I have copied ?
Could create a TXT file on local drive and write the file names, compare with whats on the FTP drive ?
But how do we do this ? That is file comparison..can ETL do this ?
Is there any other way to do this ?
Keep in mind, it could be that the internet connection is off line for a day or 2, the server could be down,
The PLC could be down. In some situations you will need to copy 4 files, some times 6 files..depents..
Or you could have gaps in the time stamps of the file names from the PLC if it has been down for a while.
I know we would then also have gaps in the SQL, but that is OK.

The solution:

There are several ways of performing this task

First of all you need to speak with the company which provides the files.
It takes time to upload files and you need to make sure that you are not downloading files while they are being uploaded.
For example you can ask them to put additional file once the process is completed.
If file is there the process completed and you can download the files.

Next step is to load data into the database. That is easy, there are several tutorials about it.

Avoiding loading same data twice.

First of all we need to get source file name, this can be done by using metadata transformation object

 

Metadata

Than we need to check if file was loaded before, we can use Validator "In List object" for that

InList

 

InList 2

and last step to load the data into the table and load the list of processed files into separate table

Writers

Overall logic as follows

  1. Download data from the ftp using mask into the local folder
  2. Run transformation from the folder using mask same time populating list of processed files and checking if file was not processed before
  3. Copy processed files into separate folder

FileProcessing

Please post your questions and comments on our support forum.

Processing EDIFACT Messages

What is EDI/EDIFACT message?

EDI provides a technical basis for commercial "conversations" between two entities, either internal or external. EDI constitutes the entire electronic data interchange paradigm, including the transmission, message flow, document format, and software used to interpret the documents. EDI standards describe the rigorous format of electronic documents.

In essence EDI message is just a small text file which can be send via email,http or ftp

Message Example:

UNH+199700001+INVOIC:D:93A:UN:EAN007'BGM+380+424876'DTM+137:199701111045:203'RFF+ON:334411'RFF+AAK:23149'NAD+BY+7080001000004::9++Hans Hansen AS+Storgata 1+TRONDHEIM++7005'NAD+SU+7080000366767::9++Børsterud AS+Industriveien 1+OSLO++0580'FII+RH+60731108042'RFF+VA:FORETAKSREGISTERET NO123456789MVA'CTA+AD+Lise Hansen'NAD+IV+7080001000011::9++Hans Hansen Øst AS+Grenseveien 1+HEBEKK++1406'NAD+DP+7080001000028::9++Hans Hansen Midt AS+Heggeveien 1+HEIMDAL++7080'NAD+SF+7080000000065::9++Børsterud AS, Vareutlevering+Industriveien 1+OSLO++0580'PAT+1'DTM+7:19970110:102'PAT+3'DTM+13:19970630:102'TOD+3++EXW'LIN+1++7030432630011:EN'PIA+1+7200018:SA::91'IMD+FL++:::RULLESYSTEM'IMD+C++TU'QTY+47:100'QTY+59:12'ALI+++6'MOA+66:25000,00'MOA+203:22000,00'PRI+AAB:250,00'PRI+AAA:220,00'PAC+++CT:KARTONG'TAX+7+VAT+++:::23+S'ALC+A++++PAD:::TILBUDSRABATT'PCD+3:10'MOA+8:2500,00'ALC+A++++PDE:::PALLERABATT'MOA+8:500,00'RTE+1:5,00'LIN+2++7030439770710:EN'PIA+1+2844001:SA::91'IMD+FL++:::BØRSTER'IMD+C++TU'QTY+47:200'QTY+59:36'ALI+++6'MOA+66:2000000'MOA+203:17900,00'PRI+AAB:100,00'PRI+AAA:89,50'PAC+++CT:KARTONG'ALC+A++++PAD:::TILBUDSRABATT'PCD+3:10'MOA+8:2000,00'ALC+A++++QD:::KVANTUMSRABATT'QTY+1:50'PCD+3:2,0'MOA+8:100,00'UNS+S'CNT+2:2'MOA+66:45000,00'MOA+203:39900,00'MOA+260:5100,00'MOA+131:5100,00'MOA+NET:39900,00'MOA+125:39900,00'MOA+150:9177,00'MOA+86:49077,00'MOA+129:39900,00'TAX+7+VAT++39900,00+:::23+S'MOA+176:9177,00'ALC+A++++PAD:::TILBUDSRABATT'MOA+8:4500,00'ALC+A++++PDE:::PALLERABATT'MOA+8:500,00'ALC+A++++QD:::KVANTUMSRABATT'MOA+8:100,00'UNT+76+199700001'

Here is the message format description:

EDIFACT File

Description

UNA:+.?

 

UNB+UNOA:1+9377778760643:ZZ+9399700000001:ZZ+050322:1237+000000001

Message header, with sender GLN, addressee GLN, date, time & interchange number

UNH+00000000000459+ORDERS:D:96A:UN:EAN008

Message Beginning

BGM+220+10910220+92

Purchase Order Number

DTM+137:20050322:102

Order Date

DTM+2:20050322:102

Requested Delivery Date

NAD+SU+1217::92

JB Supplier Code

NAD+ST+304::92

Ship to Store Code

NAD+IV+304::92

Invoice to Store Code

LIN+1

Item Line Number

PIA+5+711719347521:VN::91

Product Identifier

QTY+21:10.0000:EA

Quantity

PRI+AAA:40.0000::TU

Price Ex GST

UNS+S

Message trailer

CNT+2:1

Control Total (total number of item lines)

UNT+14+00000000000459

Message Trailer

UNZ+1+000000001

Message End

Every data part is delimited by single quote and prefixed with some characters and they indicate which kind of data it is.
for example LIN indicates line number

Processing EDI Data

In order order to load data into the database we need to transfer data into more readable format

Process

Data reader.

1. We can't use fixed width format because message length vary
2. Better option to use delimited format but use wrong delimiter so entire line is read and passed to the next step

Note: we can also use File System as reader connection type

Next step is to turn the data so every part will be in separate record.

UNH+199700001+INVOIC:D:93A:UN:EAN007'BGM+380+424876'DTM+137:199701111045:203' to

UNH+199700001+INVOIC:D:93A:UN:EAN007    [Record 1]
BGM+380+424876'DTM+137:199701111045:203 [Record 2]

Once data is rotated we want  UNH part to go into UNH field, BGM into BGM field and so on.
Since we have so many fields we will be using subtransformation

Subtransformation

The logic is as follows:

  1. When field value starts with UNH+
  2. We pass to next step value after UNH+
  3. if not we pass null

Keep value transformation allow us to fill in holes in the data

EG it converts:

Value1
nul
Value2
Null

into:

Value1
Value1
Value2
Value2

Next step is to get rid of empty values

Validation

And depuplicate the data

Deduplicator

The result of deduplication

Transformation

Here we can convert our data even further if we wish using various transformations

Please post your questions and comments on our support forum.

 Download sample

Library of objects

The ability to store transformation/package objects in the repository for reuse in other transformations was introduced in the latest release.
For example if there is a series of transformations (lookup, add, if null, if empty, etc) that are reused we could design once and reuse in other transformations. Works with both Visual Importer ETL and Advanced ETL Processor

Library Of Objects

LibraryOfObjects2

LibraryOfObjects3

 

The Price of ETL software

People quite often ask us why our software is so cheap.

There are several ways to grow the business.

  1. Keeping prices high and the same time destroying the competitors.
  2. Making products better and  helping others to grow

We prefer the second option.

As you can see from our support forum and website news our software is constantly getting better.

Helping others to grow.

Greed as bad. The partners want to make money as well. HIgh Prices=Limited number of partners and customers

Imagine the following situation:

You are expert in certain area, spend years doing similar tasks.
Now you are thinking: “Let’s have a look at what I have done so far and design something which covers 90 percent of the issues. Than I will have something to sell to my 20 potential customers”

The cost of most BI software is too high for small business.  There a lot of common tasks which can be automated but because the of cost of software and the cost of implementation is so high  they just can't afford it.

Our software allows our partners to create prepackaged solutions for specific tasks which are  easy to implement and support.

Automating sending HTML emails

There two ways of sending of HTML emails with Advanced ETL Processor by using the package or as part as transformation

Automation Package sample: Sending price list to the customers

Email Automation

Enabling support for HTML email is easy: just tick a check box

Send Email Action

More advanced way of sending emails is using transformation: 

  1. Anything can be used as datasource to send emails files tables or databases
  2. It is possible to send multiple emails in one go
  3. Complex traformations can be performed on the data before sending it

 

Email Transformation

Note:

Please use our support forum if you have any questions

Working with greenplum

About Greenplum

The Greenplum Database builds on the foundations of open source database PostgreSQL. It primarily functions as a data warehouse and utilizes a shared-nothing, massively parallel processing (MPP) architecture. In this architecture, data is partitioned across multiple segment servers, and each segment owns and manages a distinct portion of the overall data; there is no disk-level sharing nor data contention among segments.

Source: Wikipedia.

Unlike PostgreSQL, Greenplum database does not support binary option of copy command

Select Text Mode option to load data into greenplum

Greenplum

One day we will be able to process one million records per second

We did spend a lot of time optimizing our ETL software and here is what we were able to achieve so far:

Loading data into Oracle database via direct path loading from a textfile

Version 8.3.6.1: 28000 rec/second

8.3.6.1

Version 8.3.6.3: 46000 rec/second

8.3.6.3

Version 8.3.6.4: 84000 rec/second

8.3.6.4

Version 8.3.6.6: 94000 rec/second

8.3.6.6

Version 8.3.6.8: 101000 rec/second

8.3.6.8

There a lot of ways to make our software better, we will continue working on improving performance and will publish our results here.

More Articles...