Fuzzy Searching and Foreign Name Recognition

Here's something that happens fairly often: I'll be reading something in a book and someone's name is mentioned. I'll think to myself that it'd be useful at this point to get a bit of extra information before I continue reading. I hop over to DevonThink to do a full-text search over all my databases. I let the search compute for a short while, but nothing comes up. I tweak the name slightly to see if a slightly different spelling brings more results. That works a bit better, but I have to tweak the spelling several times until I can really claim the search has been exhaustively performed.

Anyone who's done work in and on a place where a lot of material is generated without fixed spellings for transliteration. In Afghanistan, this ranges from people's names -- Muhammad, Mohammad, Muhammed, Mohammed etc -- to place and province names -- Kunduz, Konduz, Kondoz, Qonduz, Qhunduz etc.

DevonThink actually has a 'fuzzy search' option that you can toggle but it isn't clear to me how it works or whether it's reliable as a replacement for a more systematic approach.

As I'm currently doing more and more work using Python, I was considering what my options would be for making my own fuzzy search emulator.

My first thought was to be prescriptive about the various rules and transformations that happen when people make different spelling choices. The Kunduz example from above reveals that vowels are a key point of contention: the 'u' can also be spelt 'o'. The 'K' at the beginning could also, in certain circumstances, become 'Q' or 'Qh'. These various rules could then be coded in a system that would collect all the possible spelling variations of a particular string and then search the database for all the different variations.

Following a bit of duckduckgo-ing around, I've since learnt that there are quite extensive discussions of this problem as well as approaches to solution that have been proposed. One, commonly referenced, is a Python package called 'FuzzyWuzzy'; it uses a mathematical metric called the Levenshtein distance to measure how similar or not two strings are. I imagine that there are many other possible metrics that one could use to detect how much two strings resemble one another.

I imagine the most accurate solution is a mixture of both approaches. You want something that is agnostic about content in the case of situations where you don't have domain knowledge. (I happen to have read a lot of the materials relating to Afghanistan, so I know that these variations of names exist and that there is a single entity that unites the various spellings of Kunduz, for example). But you probably want to code in some common rules for things which come up often. (See this article, for example, on the confusion over spellings of Muslim names and how this leads to law enforcement mistakes).

I may end up coding up a version that has high accuracy on Afghan names because it's a scenario in which I often find myself, but I'll have to explore the other more mathematically-driven options to see if I can find a happy medium.

Tweeting to the Void

I've previously written about how I turned off Facebook's news feed. I keep an account with Facebook because people occasionally contact me there. It is also an unfortunate truth that many companies in Jordan (where I live) or in the wider Middle East only have representation on Facebook instead of their own website. (Why they insist on doing this baffles me and is perhaps a topic for a future post).

I have long preferred Twitter as a medium for filtering through or touching -- however obliquely -- things going on at any particular moment. I have no pretensions to actively follow every single tweet to pass through my feed. Rather, it's something I dip into every now and then.

Increasingly in recent months, I found myself growing dissatisfied with the pull it often has on me. It has become something of a truism to state that 'twitter isn't what it once was', but there's less and less long-term benefit in following discussions as and when they happen.

RescueTime tells me that I spent 86 hours and 16 minutes on Twitter in 2017 -- just under quarter of an hour each day. That feels like a lot to me.

ScreenShot 2018-01-25 at 19.13.15.png

Enter 'Tweet to the Void'. This is a Chrome extension. (For Firefox and other browsers, I have to imagine things like this exist.) When I visit twitter.com, the feed is not visible. All I see is somewhere to post a tweet if that's what I want to do. (There is still some value in posting blogposts and articles there, since I know some people don't use RSS). Of course, I can always turn off the extension with ease, but adding this extra step has effectively neutralised Twitter for me. 

Try it; see how you feel about having something standing in the way of your social media fix. Let me know how you get on.

Installing PostgreSQL on a Mac

PostgreSQL is a SQL-type database system. It has been around for a while, and is in the middle of a sort of revival. Installing Postgres on your own system can be a little difficult. Last time I tried, I was helped through the process while doing the Udacity Intro to Programming Nanodegree.

Recently I had to reinstall Postgres, and there were some useful improvements to the process when guided through it in my Dataquest lessons.

Postgres.app is an application you can install on your Mac which simplifies a lot of the legwork, particularly when setting up new databases, servers and so on.

When you want to install a commonly used Python library for interfacing with Postgres, psycopg2 is a good option. You can do this easily with Anaconda:

conda install psycopg2

Making and shuffling lists in Python

I discovered some useful functions the other day while trying to solve one of the Dataquest guided projects. These all relate somehow to lists and use Numpy. I'm listing them here mainly as a note for my future-self.

import numpy as np

# this code returns a list of n number of items starting at 0
---- returns [0,1,2]

# this code is a variation on the previous one
---- returns [3,4,5,6]

# this adds the functionality of steps in between values
---- returns [2,4,6,8]

# these are slightly different; they sort lists
# if you want to make list of numbers randomly sorted:

---- returns the numbers 1-9 in a list, randomly sorted

# you can also pass non-numeric lists into the `permutation`
list = [a,b,c]
---- returns something like [b,a,c]

Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

Language Learner's Journal: Meaningful Leisure

[This is a continuation of Taylor's blog series where she details some of the week-in-week-out lessons that she learns through her Arabic studies and coaching work together with me. For other posts in the series, click here.] 

If the first phase of my Arabic study in Jordan was intensive textbook fusha and the second was track-switching ammiya classes, this third and current could be called meaningful leisure, or, hanging out around town a lot and making friends. 

When I went to Bombay for an extended stay in 2010, a journalism colleague gave me a piece of advice: "Take everyone up on their offer to hang out with you." It may sound "duh," but over the years living abroad, I've seen how foreigners spend their free time in ways that often diverge from how residents in a given city do so. When we, as gringos in Rio, may have wanted to go to foreign film festivals or paragilding over the beach, many of our Brazilian peers would be going to baby showers, a classmate's thesis defense, or Outback Steakhouse. All of those activities are great ones, and I think the spirit of my colleague's advice was: If you want to get to know a culture, let your host take the lead and show you how they spend their free time.

That means over the past few weeks, I've sat on the sidewalk in front of a gift shop with a delightful young sculptor and a store clerk, my partners in very unstructured language exchanges that break when one of them needs to pop into the shop to attend a client. I went for a 6:30 a.m. workout with two of the fastest runners in Amman, a pair of brothers I met at a sunset race in Wadi Rum as we waited in the dunes watching for headlamps of other runners finishing. I went to a capoeira performance at Jadal cafe that was held in commemoration of the nakba; I was pleased with how accessible the discussion after the performance was for me, particularly when an older man in the audience vigorously questioned the capoeristas as to why they needed to do someone else's sport when they could do dabke.

Alex often talks about "islands" of vocabulary, and I thought about that as I spent more time with the same people and can make good guesses about the words they're using. (As I crossed the finished line at the race, other runners asked me ايش كان مركزك؟ though I certainly hadn't run fast enough to place. It was satisfying, though, to deduce what they were saying.) The store clerk and I talk often about money and salaries, since she hustles to work two jobs to help her family out.

I could be more purist; I speak plenty of English in these interactions. I'm still searching for the point of equilibrium between taking advantage of each opportunity I get to speak in Arabic while (of course!) having genuine friendships with peers with whom I share interests (running, yoga, current events, feminism, vegetarianism, pets). Plenty of the vocabulary and references regarding those topics are in English, not to mention the people who are interested in them often read and speak in English about them. I don't believe every friendship needs to be instrumentalized for one's language-learning goals (though I believe even more strongly that such an attitude should not be a lofty cover for native English speakers kicking back and relaxing). When I told Alex about my happy sidewalk sessions, which qualify more as bilingual shooting-the-shit than a proper language exchange, he said: You're doing the real thing, rather than practicing for it.

Some working notes, now, on practice:

I've been happy with my second time around testing out language exchanges; I've used the website Conversation Exchange, which I had suspected could be out of use by its retro web design but is actually popping. I'm pretty strict about where I meet the person, i.e., it needs to be as quiet as possible (a first exchange at Indoor cafe across from the University of Jordan was really hard to decipher and, from my point of view, turned into disjointed monologues rather than a conversation because I couldn't hear her well).

I think the exchanges, for my current level, are less experimental zones and more consolidation ones. That is to say, I don't risk and try to reach for vocabulary I'm shaky on but work with what I know decently. That's why I like coupling the exchanges with private classes, which I go to twice a week and are a better place for reaching and experimenting. I also think that in a language exchange it is useful to ask my partner "is the way I said that correct?" but not productive to ask "why?" I save those questions for my teacher.

Alex encouraged me to discover certain transition phrases (على فكرة... على كل حال... بالرغم من) and put them into practice in my speech, which give the impression of being more fluent and conversant than I am. This has been a fun exercise with my private teacher, since I take the English phrases I want and try to describe to her a situation that I might use them.

I'm on board with the many lines of criticism telling us that we need to make an active effort to start unplugging our lives before we turn into cyborgs; that said, having a round of friends here I chat with on Facebook or Whatsapp has indeed been great practice for seeing spelled out how people are saying what I hear each day. In conversations, I still feel like I rarely could repeat back word-for-word what someone has said to me, even if I usually get the message through key words and context.

I bought Diwan Baladna, an ammiya vocabulary book organized by subject matter. I really like it – my hope is that it will help me turn a lot of passive vocabulary into active vocabulary. I have a quibble with the audio component (read too fast in long audio files that make it tedious to isolate the word I want. And having sample sentences is far better than English translations!).

And finally, as per Alex's encouragement, I continue to avoid dictionaries and translation apps. I make ample use of Reverso Context, but only after I've read a message or passage several times through, and usually I'm using it to confirm my guess of a word's meaning is true. Especially when it comes to Whatsapp and chatting, the majority of messages I am receiving are ones that involve words I know well (Want to meet at this time? How far did you run today? I have foul and rice my mom made, want some? It's veg.)