Data Cleansing is a process of correcting data errors and removing invalid information
Bad data example
Inconsistent values: US and USA from a human point of view are the same but for computers they are different. This can happen when merging data from different data sources
Missing values: UK and Germany values are missing. Most likely this data is incorrect and must be removed from the final dataset.
Data entry errors: Spain and SPain are two different values
Uniqueness: ORDER_ID must be unique
Inconsistent Date Formats: It is a common problem when merging data from various countries
Non-numeric characters inside numeric fields: Same as above, can be easily corrected using "Delete characters transformation function"
Leading and trailing spaces: Invisible enemy of a data analyst. Use "Trim transformation function" to correct this error
Data Cleansing Example
Steps to follow
- Download and install Advanced ETL Processor [Link]
- Download and Unzip example[Link]
- Create a new transformation and open the .ats file
- Double-click on the Reader object and amend the source file path
- Double-click on the Writer object and amend the target file path
- Run the transformation by pressing the green arrow.
How the Data Validation process works
Data reader loads Excel file into memory, validator rejects rows with empty Country name field.
Cleansing the data.
Once bad records are rejected the transformer performs additional cleaning
- Delete Characters Transformation function deletes Dollar sign, Pound sign, Comma and Space characters from Amount field.
- Date Format Transformation function reformats Order Date field into standard ODBC format.
- Lookup transformation Function corrects Country Field values
Please contact us if you need help with transforming the data
Visit ETL Tools Forum |