The problem with HTML that although it looks like XML it is not properly formatted.

Here is more information:

Likely there is a very useful utility called Tidy:

It allows converting HTML to proper XML format.

Usage example:

tidy -config config.txt -f errs.txt -m “my-html-file.html”

Sample config file for HTML tidy

indent: auto
indent-spaces: 2
wrap: 72
markup: yes
output-xml: yes
input-xml: no
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: no
break-before-br: no
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
new-inline-tags: cfif, cfelse, math, mroot,
mrow, mi, mn, mo, msqrt, mfrac, msubsup, munderover,
munder, mover, mmultiscripts, msup, msub, mtext,
mprescripts, mtable, mtr, mtd, mth
new-blocklevel-tags: cfoutput, cfquery
new-empty-tags: cfelse

Once HTML is converted into XML format you can use XSLT to extract the data you need.

Here is an example:

Further reading:

It is also possible to convert HTML using scripts:

