Python - How to read PDF file

By xngo on June 7, 2019

Python by itself doesn't a native module to read PDF files. So, we have to rely on a 3rd party tool, Apache Tika. The Apache Tika™ toolkit can extracts text from over a thousand different file types such as PPT, XLS, PDF and etc.

Installation

pip install tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Read PDF file

from tika import parser
 
raw = parser.from_file('sample.pdf')
print(raw['content'])

About the author

Xuan Ngo is the founder of OpenWritings.net. He currently lives in Montreal, Canada. He loves to write about programming and open source subjects.