software

Telling Cats from Dogs

One of the main ways that using neural networks to train models is different from traditional (imperative) programming can be illustrated with a specific task: let's say you want to use a computer to tell you whether any particular photo you give it is a cat or a dog.

An imperative approach might be to make a mega list of certain kinds of features that cats and dogs have, and try to encode the differences into some kind of list of logical features. But even knowing how to tell the computer how it should recognise those features is a potentially massive project. How would you even go about that?

Instead, with neural network layers, we turn that pattern on its side. Instead of this:

ScreenShot 2021-05-26 at 09.28.57.png

We have something closer to this:

So the neural networks are learning on the basis of data provided to it — you give it a bunch of images which you've pre-labelled to say 'this one is a cat and that one is a dog' and so on.

If you use transfer learning, too, you even can use a pretrained model (which is already pretty good at recognising features from images). You can then fine-tune that model to get really good at the specific task you need it to do. (Note, that's exactly what you do in the first chapter of Howard/Gugger's Deep Learning for Coders).

Removing Barriers: Deep Learning Edition

I've been re-reading Jeremy Howard & Sylvain Gugger's Deep Learning for Coders with Fastai and PyTorch and I really appreciate the reminder that a lot of barriers to entry into the Deep Learning space can be productively put to one side.

Gatekeepers make four big claims:

  1. You need lots of maths to use Deep Learning to solve problems
  2. You need lots of data (think prodigious, Google-sized quantities) to use Deep Learning
  3. You need lots of expensive computers and custom hardware to use Deep Learning
  4. You need a PhD, preferably in Maths or Physics or some computation-heavy science

Needless to say, it's not that maths or more data or better hardware isn't maybe going to help or improve your experience. But to say that if you don't have those things then you shouldn't start is also (seemingly) inaccurate or not helpful.

If you are a domain expert in something that has nothing to do with Deep Learning or data science, you probably have a lot of problems that are like low-hanging fruit in terms of your ability to use powerful techniques like Deep Learning to solve them.

Tabula for extracting table data from PDFs

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

  • it is extremely boring
  • it's likely that mistakes will get made, especially if the table is long and extends over several pages
  • it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

ScreenShot 2018-01-17 at 15.56.25.png

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

ScreenShot 2018-01-17 at 15.57.19.png

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

ScreenShot 2018-01-17 at 15.57.50.png

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

Pet Peeve: Tech Switching

I read a decent amount of tech media/press. Barely a day goes by when there isn't someone in my RSS feed explaining how they dropped application X for application Y. This seems to happen most often for frequently-used applications or workflows like scheduling/calendars or email.

I won't call out the specific blog post that set me writing this post, but suffice it to say that I wish there was a clause (in the contract of life) forcing tech writers or bloggers to state why the application they're singing the praises of is better than the one they were using up to now. Specifically, are there any new features, or does it just look shinier? Also, have you been using it for longer than a day or two?

I'm pretty solid and stable in the applications I use. It'll take something pretty seismic to rid me of DevonThink or Tinderbox or Mailmate. But if you catch me flip-flopping in my tech-related writing, please call me out on it.

DevonThink Resurgent

There has never been a better time to get into DevonThink and Tinderbox. Winterfest 2016 is on, and you can get 25% reductions on both those apps, as well as a number of other really useful pieces of software like Scrivener, TaskPaper, Bookends, Scapple and PDFPen.

If you’re unsure if DevonThink is something you’d be interested in, they have a 150-hours-of-use free trial for all their different apps. MacPowerUsers podcast just released a useful overview of the current state of the app — an interview with Stuart Ingram. ScreenCastsOnline also published the first part of a trilogy of video learning materials on DevonThink.

If you’re a Mac user who is perhaps uncomfortable with Evernote’s privacy policies or just seeking to get more out of the data you’ve stored on your hard drive, give DevonThink a try.

Highlights + DevonThink = Pretty Great

I’m late to the Highlights party, but I’m glad I got here.

Like many readers of this blog, I get sent (and occasionally read) a lot of PDFs. In fact, I did a quick search in DevonThink, and I am informed that I have 52,244 PDFs in my library. These are a mix of reports, archived copies of websites, scanned-and-OCRed photos and a thousand-and-one things in between.

Thus far, my workflow has been to read PDFs on my Mac. Any notes I took while reading the file were written up manually in separate files. I would laboriously copy and paste whatever text snippet or quotation I wanted to preserve along with its page reference. These would be fed into DevonThink’s AI engine and magic would happen.

Now, post-Highlights-installation, my workflow is much less laborious. I can take highlights in-app, export all the quotations as separate text or HTML files and have have DevonThink go do its thing without all the intermediary hassle. If you’re  a professional researcher or writer using DevonThink as your notes database — and quite frankly, if not, why not? — the Highlights app will probably please you.