Converting PDF and DOCX files to text

Français Français

For a data analysis project on CVs, I needed to mass convert PDF and DOCX (Microsoft Word) files to text files (TXT). The aim was to apply Machine Learning algorithms on the data.

The Apache Tika library allowed me to easily convert these documents (hundreds of documents in only few seconds).

To install Tika on macOS, I used Brew and a single command line in the Terminal (prerequisite: the Java JDK.)

brew install tika

To get the list of available commands:

tika --help

All the documents that I wanted to convert were placed in a folder input. With the following command, all .pdf and .docx documents were converted to .txt documents in the folder output.

tika --text -i ~/Desktop/input/ -o ~/Desktop/output/

That’s it!