
For a data analysis project on CVs, I needed to mass convert PDF and DOCX (Microsoft Word) files to text files (TXT). The aim was to apply Machine Learning algorithms on the data.
The Apache Tika library allowed me to easily convert these documents (hundreds of documents in only few seconds).
To install Tika on macOS, I used Brew and a single command line in the Terminal (prerequisite: the Java JDK.)
brew install tika
To get the list of available commands:
tika --help
All the documents that I wanted to convert were placed in a folder input
. With the following command, all .pdf
and .docx
documents were converted to .txt
documents in the folder output
.
tika --text -i ~/Desktop/input/ -o ~/Desktop/output/
That’s it!