MHT/MHTML File Format Transformation

  • Posts: 368
  • Karma: 2
  • Thank you received: 32

MHT/MHTML File Format Transformation was created by bruce.gibbins

Hi.

We have come across a datafeed from a large client that we need to process. My guess is that they are using Cognos Impromptu to generate a report that then gets wrapped into a MHT file that is then emailed. I have attached a sanitised version of the file we get. We have several possible options but the most likely approach would be for us to preprocess the file via a python script and extract the TABLE elements and then save this as a CSV file that can then be processed by AETL.

However, given the popularity of Cognos I suspect that this format could be quite popular in certain industry domains.

To view the attached file, simply remove the .txt extension and open it with a modern web browser

Thanks

File Attachment:

File Name: sample.mht.txt
File Size:27 KB
1 year 1 month ago #19743
Attachments:

Please Log in or Create an account to join the conversation.

  • Posts: 368
  • Karma: 2
  • Thank you received: 32

Replied by bruce.gibbins on topic MHT/MHTML File Format Transformation

Hi I would like to re-raise this topic has a possible enhancement that would certainly be of benefit to us

In our case we receive a mail message with several attachments that have been generated by an external reporting system. One of these attachments turns out to be a MHTML (multi-part MIME HTML) file.

We need to extract some tabular data from it that can then be consumed downstream in further ETL processing. I understand that these files contain multiple content-types but in our case and possibly for others as well, we are only interested in data wrapped in Tables.

Therefore, a suggestion may be to leave the collection of the file to an upstream package action such as the IMAP4 reader. But introduce a new Transformation READER type which can strip the junk, do some decoding if necessary and simply provide back a list of Tables. Then the developer could select which Table is required and this is then used in the Transformation Mapping to the Writer.

If several tables are in the file and more than one is required then perhaps a READER for each required Table. This would then offer the ability to use JOIN and GROUP transformations before pushing to the final Mapping to the Writer. In our case the Tables don;t seem to be IDENTIFIED in any way. They just have <table>, SO, I am not sure how to identify them other than by their position within the file (eg, Table1, Table2 etc).

I am sure I have over simplified it. But at the moment, I need to look at an external tool to basically get the table data we are after into a CSV and then use this as a READER source.

Alternatively, I can call a PYTHON script from within an AETL package, but this is introducing another moving part.

Regards
Bruce
2 months 3 weeks ago #20555

Please Log in or Create an account to join the conversation.

  • Posts: 8125
  • Karma: 33
  • Thank you received: 510

Replied by admin on topic MHT/MHTML File Format Transformation

Am I correct in my assumption that when you say extracting data from the tables, you mean HTML tables?
Mike
2 months 3 weeks ago #20557

Please Log in or Create an account to join the conversation.

  • Posts: 368
  • Karma: 2
  • Thank you received: 32

Replied by bruce.gibbins on topic MHT/MHTML File Format Transformation

Hi Mike. Yes that would correct (for our scenario)
2 months 3 weeks ago #20560

Please Log in or Create an account to join the conversation.

  • Posts: 8125
  • Karma: 33
  • Thank you received: 510

Replied by admin on topic MHT/MHTML File Format Transformation

We can split the MHTML file into multiple parts.
The one you provided has only text parts
But it might include images/css/js files

We would need to create a separate package object for it.
This object will allow selecting a source file and the target directory to store parts.
Please note the file you provided has no file names. So we will use the sequential number as a name.

Regarding the HTML tables.

It is not that simple as that seams.

There two HTML tables in your examples and they do not have id's
But you might have a situation where you have a table inside the table.
Or every cell of the table might have another table

HTML table can be treated as a simple XML.
So if it is XML it can be converted and loaded
The part which holds the tables in your example does not represent a valid XML
But it is possible to make it valid XML by applying additional transformations

So the workflow can be

1 transform data into valid XML, reader type file system than str before, after etc to get rid of junk
2 load XML into the database, use XLST if necessary.

I appreciate that my answer is vague if we find a better solution we would let you know
Mike
2 months 2 weeks ago #20579

Please Log in or Create an account to join the conversation.

  • Posts: 1138
  • Karma: 3
  • Thank you received: 135

Replied by Peter.Jonson on topic MHT/MHTML File Format Transformation

FYI

I am trying to find a reliable way of extracting the data from HTML tables
Peter Jonson
Support Analist
2 months 2 weeks ago #20580

Please Log in or Create an account to join the conversation.

 

This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies