Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.