research

Fuzzy Searching and Foreign Name Recognition

Here's something that happens fairly often: I'll be reading something in a book and someone's name is mentioned. I'll think to myself that it'd be useful at this point to get a bit of extra information before I continue reading. I hop over to DevonThink to do a full-text search over all my databases. I let the search compute for a short while, but nothing comes up. I tweak the name slightly to see if a slightly different spelling brings more results. That works a bit better, but I have to tweak the spelling several times until I can really claim the search has been exhaustively performed.

Anyone who's done work in and on a place where a lot of material is generated without fixed spellings for transliteration. In Afghanistan, this ranges from people's names -- Muhammad, Mohammad, Muhammed, Mohammed etc -- to place and province names -- Kunduz, Konduz, Kondoz, Qonduz, Qhunduz etc.

DevonThink actually has a 'fuzzy search' option that you can toggle but it isn't clear to me how it works or whether it's reliable as a replacement for a more systematic approach.

As I'm currently doing more and more work using Python, I was considering what my options would be for making my own fuzzy search emulator.

My first thought was to be prescriptive about the various rules and transformations that happen when people make different spelling choices. The Kunduz example from above reveals that vowels are a key point of contention: the 'u' can also be spelt 'o'. The 'K' at the beginning could also, in certain circumstances, become 'Q' or 'Qh'. These various rules could then be coded in a system that would collect all the possible spelling variations of a particular string and then search the database for all the different variations.

Following a bit of duckduckgo-ing around, I've since learnt that there are quite extensive discussions of this problem as well as approaches to solution that have been proposed. One, commonly referenced, is a Python package called 'FuzzyWuzzy'; it uses a mathematical metric called the Levenshtein distance to measure how similar or not two strings are. I imagine that there are many other possible metrics that one could use to detect how much two strings resemble one another.

I imagine the most accurate solution is a mixture of both approaches. You want something that is agnostic about content in the case of situations where you don't have domain knowledge. (I happen to have read a lot of the materials relating to Afghanistan, so I know that these variations of names exist and that there is a single entity that unites the various spellings of Kunduz, for example). But you probably want to code in some common rules for things which come up often. (See this article, for example, on the confusion over spellings of Muslim names and how this leads to law enforcement mistakes).

I may end up coding up a version that has high accuracy on Afghan names because it's a scenario in which I often find myself, but I'll have to explore the other more mathematically-driven options to see if I can find a happy medium.

Highlights + DevonThink = Pretty Great

I’m late to the Highlights party, but I’m glad I got here.

Like many readers of this blog, I get sent (and occasionally read) a lot of PDFs. In fact, I did a quick search in DevonThink, and I am informed that I have 52,244 PDFs in my library. These are a mix of reports, archived copies of websites, scanned-and-OCRed photos and a thousand-and-one things in between.

Thus far, my workflow has been to read PDFs on my Mac. Any notes I took while reading the file were written up manually in separate files. I would laboriously copy and paste whatever text snippet or quotation I wanted to preserve along with its page reference. These would be fed into DevonThink’s AI engine and magic would happen.

Now, post-Highlights-installation, my workflow is much less laborious. I can take highlights in-app, export all the quotations as separate text or HTML files and have have DevonThink go do its thing without all the intermediary hassle. If you’re  a professional researcher or writer using DevonThink as your notes database — and quite frankly, if not, why not? — the Highlights app will probably please you.

PhD Tools: Goodreads for Cross-Pollination

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

During the period I was working most intensely on my PhD writeup, I read over 100 books. I put that number out there not as a confrontation, but as an illustration that reading is important to ensure you don't get lost in a small box of your own creation. Judging purely from my own experience and from sporadic conversations with a loose handful of fellow PhD candidates, this can be a real problem.

Reading widely and about issues and problems wholly unrelated to your field of study is, I believe, the hallmark of a curious mind. If I meet someone for the first time and I'm assessing their work, I'm far more likely to be interested in the last ten books they've read than many other data points. Even the fact that someone is taking time to read, and to read diversely, is an important indicator for me.

I think I can date my adoption of this books-and-ideas-for-cross-fertilisation to when I read Steven Johnson's book Where Good Ideas Come From. He makes a strong case for a more deliberate approach to how you develop and cultivate ideas in your thinking life. (The book is short and highly suggestive of specific approaches to work. I'd recommend it if this kind of thing interests you).

I've found that things that I don't track and monitor tend to fall beside the wayside. Hence Goodreads and Beeminder and a number of other tracking tools. Goodreads allows you to set how many books you want to read each year and then keeps a convenient little widget reminding your how far ahead or behind you are of your goal. If you want a bit more of a 'sting' for non-compliance, you can hook up Beeminder and you'll be kept honest that way.

Reading books on unrelated topics was something I would do in the afternoons or evenings after my Four Perfect Hours. The time would be mine and I could read without any sense of guilt or that I wasn't making progress on my PhD writeup. No, I'd done my work in the morning, so now I could read to my heart's content.

Encounters with books are encounters with other ideas, other minds. It refreshes your approach and your sense of perspective -- both so important for your PhD. Give it a try! See how you can add in some reading time to your daily routine. Even 30 minutes before bed each evening adds up in the end.

PhD Tools: "Always return to your primary sources"

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

This phrase became a kind of mantra for me during the final write-up of my PhD. Friends and colleagues have since become accustomed to my frequent invocation of this phrase. I wrote up a longish blogpost which stemmed from my frustration at the takeup of primary sources and their use by fellow researchers and analysts in the Afghan context.

With regards to my PhD, I often felt that when I reached a point where I was stuck, the thing that would unstick me was a return to the primary sources. For my specific project, I was lucky to have a rich variety of sources on which to rely. Some may not have this luxury, but for all but the most stalwart of abstract theorists, there is going to be some kind of primary data on which you are basing your research work and writeup.

Thus, whenever you get stuck or you feel your writing starts becoming too self-referential and circular in its logic, go back to the primary sources. I think you'll find this helpful, and you'll return to your writing reinvigorated with new ideas and approaches.

PhD Tools: Freewriting and Journalling to Think Through your Work

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

A few years back, I read a book with the (intentionally) provocative title, Write Your Dissertation in 15 Minutes A Day. I was travelling back to Afghanistan from a short stay in Europe, and I was sat in Istanbul airport, waiting for my connecting flight. I remember the moment quite clearly, because a long wait time plus a delay didn't phase me. I was sucked into the book and the idea that the author presented. (There's also another good one along a similar theme: How to Write a Lot: A Practical Guide to Productive Academic Writing by Paul Silvia.

Basically, she explained how writing for a very short amount of time each day, taking the time to think through whatever was going on with your research, but on paper instead of your head -- was a trick that would really help your work. It's not a new idea, this technique of freewriting. When you take this time, these 15 or 20 minutes, you aren't writing a section of your thesis itself, you're writing almost a note to yourself about how it is going, what you think are important things you  need to consider, whether this is a useful line of inquiry and so on.

Since that day, I've incorporated this kind of writing much more often as a general practice. There's a great service run by all-round make-useful-things-for-everyone-to-benefit-from person Buster Benson called 750Words. It sends you a friendly reminder every day to write 750 words on its site. There's all sorts of gamification and encouragement of writing streaks etc, and while writing the middle sections of my PhD, I would check in to 750words.com every day at the start of the morning to journal out my current research position and think through whatever problems I was about to face in my work that coming day.

It may feel a bit redundant at times, but I've found the practice really useful. Give it a try. You might find that it works for you.

PhD Tools: Pen and Paper

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

It's worth also talking in general terms about pen and paper. Readers of this blog would be right in considering me as someone who uses many different digital tools. Yet I am also a firm advocate for the use of paper and pen.

I've written before about my use of a four-color pen. This was one of the more useful discoveries of 2015.

Using pen and paper offers the opportunity for slowing down and thinking in different ways about particular problems. Needless to say, pen and paper as a tool is firmly 'distraction-free', perhaps unless you're someone who likes to doodle.

I like working on problems from different perspectives throughout my attempts to tackle whatever complexities arise. For this reason, I'll spend some time outlining, some time free-writing, some time structuring and restructuring things I've already written, some time talking things through with a third-party, and some time making mindmaps or lists of ideas with pen and paper.

The full handwritten overview of all my PhD chapters, glued to a large white sheet of paper

The full handwritten overview of all my PhD chapters, glued to a large white sheet of paper

This cycling through different ways of composition / thinking on paper is something I developed over time, and it was in part a product of my time in Kandahar. Electricity was in limited supply, as was the internet, and some days there would simply be no way to write on a laptop. Sometimes even the laptop wouldn't start because the temperature in our little room on the roof was too hot. So I developed things to do during those downtimes, so that I wasn't completely hampered from working. The interruptions and lack of power was such a prominent feature of life that to allow yourself to be dictated by that would be to never complete anything.

So I would read books or articles on my Kindle. I would make lists in my notebooks. I would make lists of things to look up when the internet or electricity came back. I would make lists of tasks. I would outline sections of whatever I was writing. I would have focused discussions with Felix about a particular section or issue. Pen and Paper was at the centre of all of this, and I took that on to my life when I returned to places with constant streams of electricity and internet connectivity.

I've actually found that I'm the most useful and productive (in a holistic sense) when I'm in that disconnected mode, without the reliance on the internet to look everything up, and forced to just forge ahead with the hard work of thinking.

A particular model for this was the work of Erich Auerbach and his book Mimesis: The Representation of Reality in Western Literature, which he wrote from Istanbul during the Second World War without access to many sources. As Edward Said explains in his Introduction:

"He explains in the concluding chapter of Mimesis that, even had he wanted to, he could not have made use of the available scholarly resources, first of all because he was in wartime Istanbul when the book was written and no Western research libraries were accessible for him to consult, second because had he been able to use references from the extremely voluminous secondary literature, the material would have swamped him and he would never have written the book. Thus along with the primary texts that he had with him, Auerbach relied mainly on memory and what seems like an infallible interpretive skill for elucidating relationships between books and the world they belonged to."

My hunch is that the limitations on his work process, and access to sources, was one of the things that made that book so great.

Pen and paper don't need batteries. So give it a try. Go somewhere new, or somewhere you feel like your energy gets recharged, take a notebook with you and make notes. You can always type them up later on, but for now, just write and think.

PhD Tools: Turn Off the Internet with Freedom

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

Freedom does one thing and it does it well: turning off the internet (or parts of it). It removes temptation by giving you a time slot where the internet is turned off (and no way to turn it back on) on both your laptop and your phone. [Note: at the current moment there is no Android version of Freedom, but it's been a long time coming so I imagine that will be released in the near-term future -- a recent twitter query suggested "end of the summer"].

You can run it on an ad hoc basis -- i.e. you decide that you want 30 minutes of 'freedom' starting now, click, and then you've turned off the internet. OR you can pre-schedule those times (my preference) such that you can say Every Monday-Friday, I want to turn the internet off from 5am-12 noon every day. That time will thus be core time for writing, reading or using in some other kind of productive manner, free from distractions and interruptions.

You can tweak the settings so that you're not turning off the entire internet. You can make your own custom blacklist of sites that you know are kryptonite for you. (RescueTime is a great way of coming up with that list of which sites you're sinking too many hours into, especially when you have a few months of data). I don't particularly like this selective blocking because there's always going to be a new site of some kind or other that I haven't preemptively added to my blacklist. I don't need any access to the internet for my work, actually, so it's easiest to just turn it off completely.

In short, Freedom is great for aligning your goals (i.e. write words for my PhD every day) with a reality in which there are many shiny sites and videos and social media streams to follow. If you can find a way to turn that all off (or down to as minimal a level as possible) you'll get a lot more done and feel better at the same time.

PhD Tools: RescueTime for Time Tracking

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

RescueTime is a passive activity tracker for what you do on your laptop (or Android phone -- limitations in Apple's iOS mean that it's not possible to have the same detail in app usage tracking from iPhones and iPads). It sits in the background, watching where you spend your time. You can visit the website to see your stats, or it also sends you a weekly summary of what you did.

I've experimented with various kinds of activity and time trackers in the past, and my experience is that if you have to actively turn it on and off when you start and stop what you're doing, you'll probably forget. Also, that method of time tracking isn't particularly good at noting when you go down a hole of Youtube distraction that one time you have to search for something online. With RescueTime, you can be sure to be delivered an accurate summary of all the ways you are inefficient and wasting your time. (So much shame).

So it's good for tracking the amount of time you're writing (tasks are rated from very unproductive to very productive on a 5-point scale) and it's good for tracking what sites you're visiting during the day. This can be linked up to other sites, like Beeminder, to enforce some kind of time limit. RescueTime also has a version of site blocking where you can say, for example, if I spend over 1 hour on "very distracting time" or "watching videos", block all my internet for the next 3 hours (or something like that). Or you can hook it up to Beeminder and say, as I did, if I'm not writing for 2 hours every day (as in, actually typing and adding words to the page (it knows when you're staring at a page versus actually typing, by the way) ) then take my money.

A lot of your PhD writeup and research will probably be digital-based, so RescueTime is ideal for keeping you honest as to exactly how much work you're getting done. It's easy to have a false sense of all the hours of work you're supposedly doing.

You can keep a little window open somewhere on your screen that shows your real-time 'productivity score' (also compared with the previous day or week). That way, if you're at all competitive, you'll try to beat your own score and try to keep your score high. It may seem stupid, but these little tricks are unfortunately necessary in some cases, particularly when dealing with a long multi-year project. You don't get any marks for having developed a useful workflow that allows you to get your work done, but still, PhDs are as much a test of your ability to carry out this kind of long-term research as they are a test of your specific research skills and argumentative/analytic capability.

One other thing I used from the RescueTime features: internet autoblock first thing in the day. This was before I discovered Freedom App (more later on this), mind, but it was a good substitute. I found that if I somehow managed to hold off from using the internet in the mornings, then my work day would be measurably better (better meaning I actually wrote things and got engaged in the tasks at hand instead of falling down some rabbit hole of distraction, or responding to some "urgent" email). So I set up RescueTime to turn off the internet for 2 hours once my computer had been on and active for 1 minute each day. That way, by the time my laptop had started up and I was ready to do things, the internet was already off. I'll talk more about how it's useful to turn off the internet in a separate whole post on "Deep Work" in a few days. Another good setting: allocating 30 minutes or 1 hour to "very distracting" sites per day. That way you have some leeway to waste time, but not enough that it's going to markedly ruin your ability to work that day.

RescueTime is free for most of the features I described above. Anyone working as a writer/academic etc of some kind ought probably to have it installed, I think, if only to be more aware of how they're spending their time. Go try it out! It's free so you have no excuse!

PhD Tools: Mellel for Layout and Final Presentation

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

Mellel is what I use for the final formatting of documents. It might be overkill for some, but in the case of my PhD, the extra features really saved me some time and headaches.

At first glance, there isn't much to distinguish it from something like Pages.app or Microsoft Word. Mellel is a word processor. It allows you to format how the text is presented on the page. The level of control over those decisions is what distinguishes Mellel over the free/default alternatives.

For example, styling formatting for certain types of text is easy in Mellel. Want to change the way all headings of a certain level are formatted? Mellel can do this. Want to manage the formatting of Arabic, Pashto and Dari text without worrying that things will come out the wrong direction? Mellel is designed to handle these right-to-left languages and scripts. Want to do things with bibliography formatting and scanning? Mellel plays well with Bookends and the other reference managers. Similarly with things like your Table of Contents: Mellel handles it all with style (literally!).

An alternative to Mellel is Nisus Writer Pro. As far as I can make out there isn't that much difference between the two. Mellel also has a version for iPad so you can work on documents on the go as well.

PhD Tools: Bookends for Managing References

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

My PhD included references to 479 individual sources. It's well known that formatting issues often plague students just prior to submission of their dissertations. A reference manager can help solve most of these problems.

When I began my PhD, I was using Sente, a Mac-only programme, but towards the end I transitioned to Bookends. There's no particular reason for the change, mainly that Bookends is a slightly sparer-design.

Different journals and universities require different formatting of references and sources. Bookends (or whatever you choose) is an easy way to stay on top of these formatting issues.

It connects easily (via a shortcut) to Scrivener or many other word processing tools that are commonly used. If you have many references like me, you can colour code them to make it easy to see what's what at a glance (see the image above for part of the database I used for my PhD).

I don't, however, use Bookends as a repository for PDFs and documents. You can do this, technically, but it's not ideal. You're far better off keeping your reference manage for what it does well, and then having a separate file system for your PDFs and other documents (like DevonThink, for example).

PhD Tools: Scrivener for Writing Long Things

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

I spent several years with this particular file...

I spent several years with this particular file...

Scrivener is the go-to tool for anyone working on longer structured pieces of fiction or non-fiction. It's great for structuring your work as well as the writing itself.

When you write a PhD, it's important to keep word counts in mind from the beginning, otherwise you'll be left with hundreds of thousands of words and only 80,000 permitted to submit to the university and your examining committee. Scrivener allows you to manage the word counts of individual sections and their sub-sections (see the image above). It offers a variety of ways of displaying these word counts, setting goals and generally staying on top of this important metric. Of course, PhDs are more than just the number of words you manage to type, but I've met enough people who wrote too much to know that this is a common problem.

Scrivener also excels at structuring texts. You have 80,000 or maybe 100,000 words to write, so you split it up into chapters, but then those chapters must be split into chunks of roughly 500-1000 words as well. You can do this structuring using a corkboard-style visual interface (that I never use much and don't particularly like, but am fully willing to concede that some people do like and use it) or a more standard outline tool.

(Note, too, that there are 1001 other bells and whistles that come along with these core functions. It's highly customisable and adaptable to your specific needs. You can tag, show selective views of your text etc etc to your heart's content. There is also an iOS version for your iPhone / iPad that some people who are more mobile might find useful).

Another thing that PhDs seem to involve is references and footnotes. Scrivener works beautifully together with the major bibliographical reference managers (Bookends, Sente etc) so you can rest assured that you won't have any trouble there.

Finally, it's easy to get things out of Scrivener, when the time comes. Sometimes you just want a copy of a single chapter to show to your supervisor, minus incomplete footnotes and in-text notes or annotations to yourself. Such a custom export is easy to set up. Similarly, when you're finished with the drafting and want to work on the presentation (more on Mellel in a separate post) somewhere else, it's easy to export exactly as you want.

PhD Tools: DevonThink for File Storage and Discovery

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

Discovering similar notes in one of my DevonThink databases

Discovering similar notes in one of my DevonThink databases

I first heard about DevonThink in the same breath as Tinderbox. They go together, though they serve different purposes. Some people want to make an either/or decision about which to use. I see them as sufficiently different to assess them on their own merits and as per your usage scenario.

As with all tools, you should come to the decision table with a set of features that you're looking for. Don't just shop around for new things for the sake of newness or for the sake of having a really great set of tools. These programmes are not cheap. Luckily almost all of them come with generous trial versions or periods, but I don't recommend 'newness' as a feature of any particular merit.

Devonthink (I use the Pro Office version) is a place to store your files and notes. It can, I think, take any file you can throw at it. It comes with software for processing PDFs into fully-searchable documents (OCR software, in other words) which is part of the reason why the license for the Pro Office version of the programme is so expensive.

If you're anything like me, you're drowning in PDF documents. They all come with helpful names like "afghanistan_final_report_02_16.pdf" and unless you have a rigorous file hierarchy and sorting system, you'll probably be unable to find one file or the other. And using the basic file hierarchy system for storage doesn't help you with situations like when you want to store the same file in multiple folders (i.e. what if a report is about Afghanistan and Tunisia). (DevonThink has a feature which allows you to store the files in multiple locations, but without saving two copies of the file. Any changes or annotations you make in one file will automatically be transferred to the other).

You might ask yourself why you would need DevonThink and Tinderbox (see this post for more). The short answer is that they store different kinds of files/data, and that DevonThink is less about thinking than about storage (to a certain extent) and discovery.

One of the key features of DevonThink Pro Office is its smart searching algorithms, its ability to suggest similar texts based on the contents of what you are looking at, etc. It does this by means of a proprietary algorithm, so I can't really tell you how it works, but just know that it does. It works best on smaller chunks of text. In this way, I was reading through a particular source from the 3 million-word-strong Taliban Sources Project database and then I clicked the "See also" button and it had found a source I would never otherwise have read on the same topic, even though it didn't even use one of the keywords I would have used to search for it. It uses semantic webs of words to figure this stuff out. Anyway, beyond a certain database size, this power becomes really useful. It can also archive websites, store anything including text, do in-text searches on e-books etc etc. (Read more on how I use DevonThink for research in general here.)

I also used it a little as an archive for substantive drafts / iterations of the writeup process. That's another important part of the process: making backups of many different kinds. I never found any use for them, but at least they were there (just in case).

If you're a data and document hoarder at heart, like me, you'll soon have a Devonthink database (or several databases, split up by topic) that is bigger than you can fully comprehend it, or remember what was inside the files. At that point, search becomes really important. Not just a straightforward search, but the ability to input 'fuzzy' terms (i.e. if you search for "Afghanistan" it'll also find instances where it's incorrectly spelt "Afgahistan"), and boolean language, into your query is really powerful/useful. DevonThink is an amazing search tool. The company that developed the database software also make something called DevonAgent, which is basically a power-user search tool for the internet. Google on steroids, if you will. Fully customisable, scriptable... you can really go crazy with this stuff. I use it, but my PhD wasn't really about searching things on the internet, so I didn't use it much for my research or writeup. But it's a great tool, too.

In short, DevonThink is a research database tool that will help you store and find the documents that relate to your research, and do smart things to help you find sources and texts that maybe you'd forgotten you'd saved. Highly recommended for anyone working with large numbers of documents.