Coding

Fuzzy Searching and Foreign Name Recognition

Here's something that happens fairly often: I'll be reading something in a book and someone's name is mentioned. I'll think to myself that it'd be useful at this point to get a bit of extra information before I continue reading. I hop over to DevonThink to do a full-text search over all my databases. I let the search compute for a short while, but nothing comes up. I tweak the name slightly to see if a slightly different spelling brings more results. That works a bit better, but I have to tweak the spelling several times until I can really claim the search has been exhaustively performed.

Anyone who's done work in and on a place where a lot of material is generated without fixed spellings for transliteration. In Afghanistan, this ranges from people's names -- Muhammad, Mohammad, Muhammed, Mohammed etc -- to place and province names -- Kunduz, Konduz, Kondoz, Qonduz, Qhunduz etc.

DevonThink actually has a 'fuzzy search' option that you can toggle but it isn't clear to me how it works or whether it's reliable as a replacement for a more systematic approach.

As I'm currently doing more and more work using Python, I was considering what my options would be for making my own fuzzy search emulator.

My first thought was to be prescriptive about the various rules and transformations that happen when people make different spelling choices. The Kunduz example from above reveals that vowels are a key point of contention: the 'u' can also be spelt 'o'. The 'K' at the beginning could also, in certain circumstances, become 'Q' or 'Qh'. These various rules could then be coded in a system that would collect all the possible spelling variations of a particular string and then search the database for all the different variations.

Following a bit of duckduckgo-ing around, I've since learnt that there are quite extensive discussions of this problem as well as approaches to solution that have been proposed. One, commonly referenced, is a Python package called 'FuzzyWuzzy'; it uses a mathematical metric called the Levenshtein distance to measure how similar or not two strings are. I imagine that there are many other possible metrics that one could use to detect how much two strings resemble one another.

I imagine the most accurate solution is a mixture of both approaches. You want something that is agnostic about content in the case of situations where you don't have domain knowledge. (I happen to have read a lot of the materials relating to Afghanistan, so I know that these variations of names exist and that there is a single entity that unites the various spellings of Kunduz, for example). But you probably want to code in some common rules for things which come up often. (See this article, for example, on the confusion over spellings of Muslim names and how this leads to law enforcement mistakes).

I may end up coding up a version that has high accuracy on Afghan names because it's a scenario in which I often find myself, but I'll have to explore the other more mathematically-driven options to see if I can find a happy medium.

Installing PostgreSQL on a Mac

PostgreSQL is a SQL-type database system. It has been around for a while, and is in the middle of a sort of revival. Installing Postgres on your own system can be a little difficult. Last time I tried, I was helped through the process while doing the Udacity Intro to Programming Nanodegree.

Recently I had to reinstall Postgres, and there were some useful improvements to the process when guided through it in my Dataquest lessons.

Postgres.app is an application you can install on your Mac which simplifies a lot of the legwork, particularly when setting up new databases, servers and so on.

When you want to install a commonly used Python library for interfacing with Postgres, psycopg2 is a good option. You can do this easily with Anaconda:

conda install psycopg2

Making and shuffling lists in Python

I discovered some useful functions the other day while trying to solve one of the Dataquest guided projects. These all relate somehow to lists and use Numpy. I'm listing them here mainly as a note for my future-self.

import numpy as np

# this code returns a list of n number of items starting at 0
np.arange(3)
---- returns [0,1,2]

# this code is a variation on the previous one
np.arange(3,7)
---- returns [3,4,5,6]

# this adds the functionality of steps in between values
np.arange(2,9,2)
---- returns [2,4,6,8]

# these are slightly different; they sort lists
# if you want to make list of numbers randomly sorted:

np.random.permutation(10)
---- returns the numbers 1-9 in a list, randomly sorted

# you can also pass non-numeric lists into the `permutation`
list = [a,b,c]
np.random.permutation(list)
---- returns something like [b,a,c]

Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.