Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

Pet Peeve: Tech Switching

I read a decent amount of tech media/press. Barely a day goes by when there isn't someone in my RSS feed explaining how they dropped application X for application Y. This seems to happen most often for frequently-used applications or workflows like scheduling/calendars or email.

I won't call out the specific blog post that set me writing this post, but suffice it to say that I wish there was a clause (in the contract of life) forcing tech writers or bloggers to state why the application they're singing the praises of is better than the one they were using up to now. Specifically, are there any new features, or does it just look shinier? Also, have you been using it for longer than a day or two?

I'm pretty solid and stable in the applications I use. It'll take something pretty seismic to rid me of DevonThink or Tinderbox or Mailmate. But if you catch me flip-flopping in my tech-related writing, please call me out on it.

DevonThink Resurgent

There has never been a better time to get into DevonThink and Tinderbox. Winterfest 2016 is on, and you can get 25% reductions on both those apps, as well as a number of other really useful pieces of software like Scrivener, TaskPaper, Bookends, Scapple and PDFPen.

If you’re unsure if DevonThink is something you’d be interested in, they have a 150-hours-of-use free trial for all their different apps. MacPowerUsers podcast just released a useful overview of the current state of the app — an interview with Stuart Ingram. ScreenCastsOnline also published the first part of a trilogy of video learning materials on DevonThink.

If you’re a Mac user who is perhaps uncomfortable with Evernote’s privacy policies or just seeking to get more out of the data you’ve stored on your hard drive, give DevonThink a try.

Highlights + DevonThink = Pretty Great

I’m late to the Highlights party, but I’m glad I got here.

Like many readers of this blog, I get sent (and occasionally read) a lot of PDFs. In fact, I did a quick search in DevonThink, and I am informed that I have 52,244 PDFs in my library. These are a mix of reports, archived copies of websites, scanned-and-OCRed photos and a thousand-and-one things in between.

Thus far, my workflow has been to read PDFs on my Mac. Any notes I took while reading the file were written up manually in separate files. I would laboriously copy and paste whatever text snippet or quotation I wanted to preserve along with its page reference. These would be fed into DevonThink’s AI engine and magic would happen.

Now, post-Highlights-installation, my workflow is much less laborious. I can take highlights in-app, export all the quotations as separate text or HTML files and have have DevonThink go do its thing without all the intermediary hassle. If you’re  a professional researcher or writer using DevonThink as your notes database — and quite frankly, if not, why not? — the Highlights app will probably please you.