knowledgebase:extracting_text_from_pdf_file

Extracting Text from PDF File

Download and install python from https://www.python.org/downloads

open command prompt and run:

pip install pdfminer

This will install PDFMiner python library for working with PDF files

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows obtaining the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

https://pypi.python.org/pypi/pdfminer/

PDFMiner has a very useful script called pdf2txt.py

It can be used to convert data into Text, XML or HTML.

This command will extract text information from PDF file:

python C:\Python27\Scripts\pdf2txt.py -o test.txt -t text test.pdf

  • knowledgebase/extracting_text_from_pdf_file.txt
  • Last modified: 14/07/2021 09:07
  • by admin