Data

Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

DevonThink Resurgent

There has never been a better time to get into DevonThink and Tinderbox. Winterfest 2016 is on, and you can get 25% reductions on both those apps, as well as a number of other really useful pieces of software like Scrivener, TaskPaper, Bookends, Scapple and PDFPen.

If you’re unsure if DevonThink is something you’d be interested in, they have a 150-hours-of-use free trial for all their different apps. MacPowerUsers podcast just released a useful overview of the current state of the app — an interview with Stuart Ingram. ScreenCastsOnline also published the first part of a trilogy of video learning materials on DevonThink.

If you’re a Mac user who is perhaps uncomfortable with Evernote’s privacy policies or just seeking to get more out of the data you’ve stored on your hard drive, give DevonThink a try.

Highlights + DevonThink = Pretty Great

I’m late to the Highlights party, but I’m glad I got here.

Like many readers of this blog, I get sent (and occasionally read) a lot of PDFs. In fact, I did a quick search in DevonThink, and I am informed that I have 52,244 PDFs in my library. These are a mix of reports, archived copies of websites, scanned-and-OCRed photos and a thousand-and-one things in between.

Thus far, my workflow has been to read PDFs on my Mac. Any notes I took while reading the file were written up manually in separate files. I would laboriously copy and paste whatever text snippet or quotation I wanted to preserve along with its page reference. These would be fed into DevonThink’s AI engine and magic would happen.

Now, post-Highlights-installation, my workflow is much less laborious. I can take highlights in-app, export all the quotations as separate text or HTML files and have have DevonThink go do its thing without all the intermediary hassle. If you’re  a professional researcher or writer using DevonThink as your notes database — and quite frankly, if not, why not? — the Highlights app will probably please you.

PhD Tools: DevonThink for File Storage and Discovery

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

Discovering similar notes in one of my DevonThink databases

Discovering similar notes in one of my DevonThink databases

I first heard about DevonThink in the same breath as Tinderbox. They go together, though they serve different purposes. Some people want to make an either/or decision about which to use. I see them as sufficiently different to assess them on their own merits and as per your usage scenario.

As with all tools, you should come to the decision table with a set of features that you're looking for. Don't just shop around for new things for the sake of newness or for the sake of having a really great set of tools. These programmes are not cheap. Luckily almost all of them come with generous trial versions or periods, but I don't recommend 'newness' as a feature of any particular merit.

Devonthink (I use the Pro Office version) is a place to store your files and notes. It can, I think, take any file you can throw at it. It comes with software for processing PDFs into fully-searchable documents (OCR software, in other words) which is part of the reason why the license for the Pro Office version of the programme is so expensive.

If you're anything like me, you're drowning in PDF documents. They all come with helpful names like "afghanistan_final_report_02_16.pdf" and unless you have a rigorous file hierarchy and sorting system, you'll probably be unable to find one file or the other. And using the basic file hierarchy system for storage doesn't help you with situations like when you want to store the same file in multiple folders (i.e. what if a report is about Afghanistan and Tunisia). (DevonThink has a feature which allows you to store the files in multiple locations, but without saving two copies of the file. Any changes or annotations you make in one file will automatically be transferred to the other).

You might ask yourself why you would need DevonThink and Tinderbox (see this post for more). The short answer is that they store different kinds of files/data, and that DevonThink is less about thinking than about storage (to a certain extent) and discovery.

One of the key features of DevonThink Pro Office is its smart searching algorithms, its ability to suggest similar texts based on the contents of what you are looking at, etc. It does this by means of a proprietary algorithm, so I can't really tell you how it works, but just know that it does. It works best on smaller chunks of text. In this way, I was reading through a particular source from the 3 million-word-strong Taliban Sources Project database and then I clicked the "See also" button and it had found a source I would never otherwise have read on the same topic, even though it didn't even use one of the keywords I would have used to search for it. It uses semantic webs of words to figure this stuff out. Anyway, beyond a certain database size, this power becomes really useful. It can also archive websites, store anything including text, do in-text searches on e-books etc etc. (Read more on how I use DevonThink for research in general here.)

I also used it a little as an archive for substantive drafts / iterations of the writeup process. That's another important part of the process: making backups of many different kinds. I never found any use for them, but at least they were there (just in case).

If you're a data and document hoarder at heart, like me, you'll soon have a Devonthink database (or several databases, split up by topic) that is bigger than you can fully comprehend it, or remember what was inside the files. At that point, search becomes really important. Not just a straightforward search, but the ability to input 'fuzzy' terms (i.e. if you search for "Afghanistan" it'll also find instances where it's incorrectly spelt "Afgahistan"), and boolean language, into your query is really powerful/useful. DevonThink is an amazing search tool. The company that developed the database software also make something called DevonAgent, which is basically a power-user search tool for the internet. Google on steroids, if you will. Fully customisable, scriptable... you can really go crazy with this stuff. I use it, but my PhD wasn't really about searching things on the internet, so I didn't use it much for my research or writeup. But it's a great tool, too.

In short, DevonThink is a research database tool that will help you store and find the documents that relate to your research, and do smart things to help you find sources and texts that maybe you'd forgotten you'd saved. Highly recommended for anyone working with large numbers of documents.

Walking Amman

 
 

I’ve been walking around Amman a little in the past couple of days. My poor sense of direction with the city’s somewhat haphazard street layout mean I make use of digital GPS maps on a regular basis. In Europe or North America, Google Maps is my service of choice, with due acknowledgement of their general creepiness.

But I discovered yesterday that Google Maps is pretty atrocious when walking around Amman. Either their data is old and of poor quality, or the algorithm for calculating time/distance between two points is not properly calibrated for a city with many hills. If you look on Google Maps’ display, you’ll see what looks like a flat terrain. Everything can seem very close. If you look out of the window, or walk on the streets, you’ll see that hills and a highly variable topography are very much a part of the experience of the city. (This gives some idea of it).

Google Maps knows how to deal with hills or variable terrain. After all, San Francisco, close to their centre of operations, is a pretty hilly city and I found the maps and the estimated timings worked pretty well when I was there last year. Which suggests to me that the problem isn’t that Google forgot to take into account topography but rather that the data is poor.

I’m studying data science-y things these days, so I thought a bit about how they might improve this data. Some possible solutions:

  1. They’re already monitoring all the data coming from app usage etc, so why not track whether its estimations match up with how long people actually take to walk certain streets/routes. Mix that in with the topography data, and average it all out.
  2. They could send out more cars. I don’t know how accurate the map data for driving in Amman is, but some anecdotal accounts suggest that it suffers from similar problems. This is probably too expensive, and I’m assuming it’d be preferable to find a solution that doesn’t require custom data collecting of this kind. Maybe something for when the world has millions of driverless cars all powered by Google’s software, but for now it’s impractical as a solution.
  3. Find some abstract solution based on satellite-acquired topographic data which takes better account of gradients of roads etc.

For the moment, Google Maps is pretty poor user experience as a pedestrian. Yesterday evening I was walking back home from the centre of town. The walk would, Google told me, take only 12 minutes. 40+ minutes later I arrived home.

Others have noted this same problem and suggested an alternative: OpenStreetMap data. The data is unattached to a particular app, but I downloaded one alongside the offline mapping data for Jordan/Amman. It seems pretty good at first glance, and I’ll be testing it out in the coming days. I’m interested o learn why it seems to perform better. My initial hypothesis is that its data is just better than that which Google Maps is using.

Taliban public punishments, 1996–2001

 

Executions are a recurrent motif in how historians, journalists and analysts have chosen to write about the Afghan Taliban. See the opening to Dexter Filkins’ The Forever War as one example, or this Reuters piece from May 1999. I wanted to study the role of executions and public punishments in the Taliban’s government for a while, but lacked data to place the anecdotes into some sort of context.

This short overview is a compilation of sources relating to the Taliban’s public punishments, 1996–2001. It is compiled from publicly available sources as well as from the materials gathered as part of the Taliban Sources Project. I think it is as complete an overview as is possible to get from these public sources, given that the Taliban weren’t shy about publicising their ‘public justice project’ – indeed, for them, the publicity was the point – and that we have multiple complete newspaper runs for the time they were in power. This was collated and triangulated with sources from Associated Press, Agence France Presse, BBC Monitoring and the Afghan Islamic Press news agency.

As a brief summary, I was able to find 101 incidents in total that chronicled the deaths of 119 individuals. I included some instances of public punishment not resulting in death, but this wasn’t really the focus of my search so their numbers may be underrepresented in the list. As another caveat, I was of course only looking at public executions, not anything that went on in secret as part of intelligence or domestic security operations and so on. Kabul, Kandahar and Herat were the most prominent locations for incidents and executions, with over half the total numbers coming from those three provinces alone. (Note that this may reflect a bias in whether incidents were reported from the provinces or not).

In any case, I wanted to present the raw data here alongside a timeline and another chart or two in case this is useful for other researchers/analysts. If you find I’ve missed an event, please drop me a line via email or on twitter and I’ll be sure to add it to the database.

Now head over here for an interactive timeline, charts and the raw data...