Data

On the interpretability of models

May 28, 2021 in Deep Learning, Tech, Science

A common criticism of deep learning models is that they are 'black boxes'. You put data in one end as your inputs, the argument goes, and you get some predictions or results out the other end, but you have no idea why the model gave your those predictions.

Ways of interpreting learning in computer vision models - credit https://thedatascientist.com/what-deep-learning-is-and-isnt/

This has something to do with how neural networks work: you often have many layers that are busy with the 'learning', and each successive layer may be able to interpret or recognise more features or greater levels of abstraction. In the above image, you can get a sense of how the earlier layers (on the left) are learning basic contour features and then these get abstracted together in more general face features and so on.

Some of this also has to do with the fact that when you train your model, you do so assuming that the model will be used on data that the model hasn't seen. In this (common) use case, it becomes a bit harder to say exactly why a certain prediction was made, though there are a lot of ways we can start to open up the black box.

Diagnosing Diabetes with Weka & Machine Learning

June 18, 2018 in Coding, Science, Tech

[I mentioned two weeks ago that I was working to dive into the practical uses of machine learning algorithms. This is the first of a series of posts where I show what I’ve been working on.]

The Pima Indians dataset is well-known among beginners to machine learning because it is a binary classification problem and has nice, clean data. The simplicity made it an attractive option. In what follows I’ll be mostly following a process outlined by Jason Brownlee on his blog.

The Pima Indian population are based near Phoenix, Arizona (USA). They have been heavily studied since 1965 on account of high rates of diabetes. This dataset contains measurements for 768 female subjects, all aged 21 years and above. The attributes are as follows, and I list them here since they weren’t explicitly stated in the version of the data that came with Weka and I only found them after a bit of digging online:

preg - the number of times the subject had been pregnant
plan - the concentration of blood plasma glucose (two hours after drinking a glucose solution)
pres - diastolic blood pressure in mmHg
skin - triceps skin fold thickness in mm
insu - serum insulin (two hours after drinking glucose solution)
mass - body mass index ((weight/height)**2)
pedi - ‘diabetes pedigree function’ (a measurement I didn’t quite understand but it relates to the extent to which an individual has some kind of hereditary or genetic risk of diabetes higher than the norm)
age - in years

This video gives a bit of helpful context to the data and the test subjects:

https://www.youtube.com/watch?v=pN4HqWRybwk

I also came across a book by David H. DeJong called “Stealing the Gila: The Pima Agricultural Economy and Water Deprivation, 1848-1921” which describes how the diverting of water and other policies “reduced [the Pima] to cycles of poverty, their lives destroyed by greed and disrespect for the law, as well as legal decisions made for personal gain.” It looks like a really interesting read.

The Problem

The idea with this data set is to take the attributes listed above, combine them with the labelling (i.e. we know who has been diagnosed with diabetes and who hasn’t) and figure out the pattern as much as we can. Can we figure out if someone is likely to have diabetes just by taking a few of these measurements?

The promise of machine learning and other related statistical tools is that we can learn from the data that we have to make testing more useful. Perhaps we only need your height, genetic risk factor and skin thickness to make such a prediction? (Unlikely, but still, perhaps…). If we emerge from our study with a statistical model, how well does it perform? How much can we generalise from the data? What would be an acceptable error rate in the medical context? Is it 80% or is it 99.99%? The former would save millions of dollars in test costs but would throw lots of errors; the latter would be highly accurate but it might be expensive to calculate the model.

The use case for this specific case would maybe be to identify at-risk individuals who are on the way to a diagnosis of diabetes and intervene somehow. Our motivation here is clear: people don’t want to be diabetic, so how early can we catch this transition? It would save governments money, expose fewer people to unnecessary tests and improve their quality of life.

I’m not a doctor, but to solve this problem manually would seem to require monitoring of blood tests (glucose and insulin levels), perhaps looking at exercise and diet, and also weight. At scale across the population of an entire country, for example, this seems like it might get expensive and/or too much for one person to process in their head. The data isn’t too large or complex, but it still seems to be useful you’d want to automate it to some extent.

There are some potential ethical issues around the data. Everything offered as part of the table of data is anonymised, but there are some outliers (see below) that I have to believe wouldn’t be too hard to find. The applicability of whatever model comes from this data will likely only have a limited application — the data is drawn only from women, after all. I also noticed that while the data is no longer available on the UCI Machine Learning Repository website, it still comes packaged with Weka. There was a notice on the UCI site (which I can no longer seem to be able to locate) stating that the permission to host the data had expired. It is unclear to me what’s going on with the permission there.

Data Preparation

Exploring the data using Weka’s explorer tool plus the attribute list above we can see that we have some blood test data, some non-blood body measurements and this genetic marker (presumably achieved through either blood tests or interview questions about family history). As I was working to understand the various attributes, it occurred to me that for this to be really useful, we’d want our model to work on data that wasn’t derived from blood tests; they’re expensive and they’re invasive. I didn’t get round to doing that for this round of exploration but it’d be high up on my wishlist next time I return to this data.

There are only 768 instances, so it’s still quite a small data set, especially in the context of machine learning examples. This is probably explained by the fact that it’s real medical data (so there are consent issues) plus the fact that it is several decades old and the processing power available then didn’t lend itself to processing mega-huge sets.

Thinking about what attributes might be removed to make a simpler model, I first thought that maybe the number-of-pregnancies might be dispensable, but then I thought to the number of hormonal and other changes that happen and I guess actually it is probably quite important.

There were some outliers in the data that I identified as needing further consideration / processing before we get our model trained:

There were some women who had been pregnant 16 or 17 times. They were on the far edge of the long tail, but I ended up leaving them in for the model rather than deleting them completely.
There were 5 people who had 0 as their result for ‘plus’, which seems to be an error. I decided to remove these.
There were 35 people who had 0 as their blood pressure, which seems to be an error.
There were 227 people with 0mm skin thickness. This is possible, but I think it’s more likely that no measurement was taken, at least for a lot of them.
There were 11 people who are listed as weighing 0kg. That seems to be an error.

After I’d identified these various outliers I decided to make a series of transformations to the whole set. From this I’d emerge with three broad versions of the data:

the baseline dataset, with nothing removed or changed
the outliers removed completely and replaced with NaN values
the outliers replaced with mean averages for each particular attribute

For each of these broad versions, moreover, I prepared three separate versions:

all values normalised (ranges and values for all attributes transformed to being from 0-1 instead of being in their original ranges. i.e. maximum weight as 1 and minimum weight as 0 etc)
all values standardised (set the mean for the data as being zero and the standard deviation to 1)
all values normalised and standardised (i.e. both transformations applied)

Producing these various versions of the data was something I learned from Brownlee’s book, “Machine Learning Mastery With Weka”. It turned out to be somewhat fiddly to do in Weka. In particular, every time you want to open up a file to apply transformations the default folder it remembers is often several folders down in the folder hierarchy. By the ninth transformation (there were nine sets in total, by the end of this process) I was ready for a more functional / automated approach to these data conversions!

Weka does offer some nice tools for the initial exploration of the data. Here you can see two charts that are generated in the ‘explorer’ application. First we have a series of simple bar charts visualising all the individual attributes. Then we have a plot matrix showing how all the various attributes correlate to each other (or not, as was mostly the case for this data set).

Visualisation of Pima Indians dataset attributes (auto-generated in Weka)

Plot matrix showing visualisation of correlations between all attributes (auto-generated using Weka)

Choosing Algorithms and Training the Model

Given that I’m very much at the beginning of my machine learning journey, I don’t have any strong sense of which algorithms might be more appropriate or not for this particular data set. I knew that this is a classification problem and not regression (i.e. we’re trying to decide whether people have diabetes or not — two categories — instead of predicting where people fall on a scale / spectrum) so that ruled out a few options, but really the field was wide open.

Jason Brownlee advises taking a sample of around ten different algorithm families to get an initial sense of whether there are any clear outliers (either over-performers or under-performers). Once I have a better sense of the overall space, I can then tweak things, or double down on a particular algorithm family to select a more limited feature set perhaps.

For this algorithmic spot-check I chose 12 algorithms, sampling from all the main families as I currently understand them. Running this set, I immediately came across an error message: Weka was telling me that the function.LinearRegression algorithm doesn’t function for classification algorithms. I removed that and reran the tests.

When doing this kind of test, it helps to have a baseline accuracy figure against which you can compare how much these fancy algorithms are improving predictions. In Weka, this is called a ZeroR algorithm and I think it basically says that everyone has no diabetes. For this dataset, it got 65.11% accuracy, which isn’t bad all things considered!

(Note that everything here is being run through k-fold cross-validation where training and test data are kept separately, and then this is repeated ten times. The final results are averaged out between them. Weka does this all with great ease, making it pleasant to conform to best practices when it comes to data science).

This figure shows how logistic regression was the best performing algorithm out of the box at 77.47% accuracy. I read somewhere that it often performs well on binary classification problems, so this didn’t surprise me. Support Vector Machines (listed as SMO in the Weka GUI) are also supposedly quite good for binary problems and it was only two-thirds of a percent behind logistic regression. Using Weka’s tools for statistical analysis of the result, I came to the conclusion that LMT, logistic regression, SGD and SMO were all worth further exploration and tinkering.

For example, I tried the following with the Support Vector Machines algorithm:

tweaking the value of c (complexity) to see if 0.25 performed better than 0.75, for example. It turned out that 0.5 was the sweet spot for the c value.
trying different kernels - I tried most of the options listed in Weka and they all performed pretty poorly. In particular, RBF (radial basis) was really poor.

None of my tweaks really seemed to improve the accuracy of the model. I imagine that some of the algorithms function better with more data, but I am not in a position to generate more.

The next step was to try some ensemble methods where the predictions made by multiple models are combined. In particular, bagging/bootstrap, boosting and voting were all recommended as options to try out.

You can see here that ultimately none of those outperformed logistic regression, which was surprising to me. I’m not at a place where my statistical understanding can explain why that’s the case — ensemble methods seemed to offer the best of many worlds — but I can’t really argue against the results.

Finally, I tried the MultiLayerPerceptron to throw one possible implementation of Deep Learning at the problem. This performed pretty poorly as per default configuration.

Findings / Conclusions

The best accuracy I was able to achieve on this data set was using a logistic regression model. This performed with 77.47% accuracy (standard deviation of 4.39%). We can restate this as an accuracy of between 68.96% and 86.25% accuracy on unseen data. This is slightly disappointing since it isn’t that much better than the ZeroR algorithm.

Towards the latter stages of my work on this problem, I came across a blog post by someone who used a neural network to reach results of 95% accuracy on this same data set, showing that there are models that bring dramatically improved performance. I don’t understand neural networks enough to be able to evaluate what he did (i.e. to know whether this is simply overfitting or actually a performant / real improvement on my results). Nevertheless, it seems like a significant improvement.

As my first big push to work on a real data set using machine learning tools, this process was instructional in the following ways:

Weka is easy to use and it makes some of the best practices in data science no-brainers to implement
Constructing the various data sets, implementing the experiments to compare the algorithms and so on was made slightly tedious by the GUI interface. If I wanted to run through many more variations it would have been prohibitively tiresome to have to manually click through all the options.
Weka is slow (or maybe my linux laptop is slow). Some of the algorithm sets I tried (Support Vector Machines, for example, or ensemble methods using SVMs) took 20+ minutes to run. The data set wasn’t huge at all, so I have to imagine that a real ‘big data’ set would make this kind of quick incremental exploration and iteration difficult to practice. Weka is, of course, a Java app and I’m running that on my Mac. I suspect that if I were to run similar algorithms through Python (or even better, C) on my Mac I’d get significant performance improvements.
I have very little sense of the variation between the various algorithms, what each one does and where the strengths and weaknesses lie. I want to tackle this from two directions: improving my baseline understanding of statistics and also just getting more experience implementing them for practical problems such as this one.

The next problem I want to tackle is that of the UCI soybean dataset). Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. Again, the dataset isn’t huge but it is a multivariate classification problem so there are new challenges to be tackled there.

Tabula for extracting table data from PDFs

January 17, 2018 in Afghanistan, Coding, Productivity, Tech, Useful Tools

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

it is extremely boring
it's likely that mistakes will get made, especially if the table is long and extends over several pages
it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

Trello to Markdown: a Chrome extension

January 17, 2017 in Useful Tools

Turns out, whenever you need something on the internet, someone else has already made it.

I was looking around for a way to get several hundred Trello notes (and their descriptions) into a format where I could work on the texts in an offline format. (Trello doesn't have an offline mode.)

I found this excellent extension (made by 'Trapias') which allows you to get your data out of Trello. Click here for the Chrome extension itself and here for the source code over on Github. You can export to HTML, Markdown, Excel and OPML. This is a great set of options, and there are all sorts of advanced selections you can make.

To use:

(1) Go to the board menu -> more -> print and export

(2) a new button will appear called TrelloExport. (Previously there would just have been a JSON export option).

10 podcasts to learn about data science and programming

January 11, 2017 in Coding, Podcast, Useful Tools

When learning a new skill or indulging a new interest, I like to find out who is podcasting about that thing. I do this early on in the process since it is hard to get a sense of the full landscape of a particular skill or issue without reading widely. The internet, of course, is a great way to find out about all the different nooks and crannies of a particular community. That doesn't help me much when I'm cooking, though, or when I'm walking in town. Podcasts are an opportunity to broaden my exposure to those topics while doing something else, usually something physical.

Here are some of my favourite podcasts relating to data science and programming:

Becoming a Data Science Podcast

Hosted by Renee Teate, this was my first exposure to smart data professionals talking about their work. I think the podcast is meant to be modelled on the book Data Scientists at Work. (I've read the book and thoroughly enjoyed it, for much the same reasons as I enjoyed this podcast). I'd recommend going back to the first episode and listening to them all. You can get a good overview from listening to just these episodes. Renee has recently resumed podcasting for season two and she also runs a data-science resource online empire that is filled with useful materials.

CodeNewbie

I'm working my way through all of these these, listening through from the very beginning (over 120 of them). At the beginning, I find it's useful to hear from people who started from nothing and who found a way to use code in their life or career. Not all of the guests are equally interesting, but the podcast is well produced and there's still loads of useful information buried in the overall corpus.

Learn to Code With Me

This podcast is in its second season. Just like CodeNewbie, the podcasts tend to focus on the journeys of people who have started from nothing. The guests are variable in terms of how interesting I find them, but I enjoy it nonetheless.

Partially Derivative

This is one of the more popular data science podcasts out there at the moment. Episodes range from discussion panels to interviews to deep-dives on a particular topic. This is great for keeping up with the buzz or gossip in the data science world and they often have pretty senior guests from the world of data science and business.

Start Here: Web Development

This podcast also has a Ruby/Rails sidekick podcast. I've just started the Web Development series and am finding it a useful overview. I'm not coming at this all as a complete beginner, so I'm not sure of the extent to which it'd be useful if that's you, but each episode covers a particular area of programming and web development. Episodes end with an assignment or set of tasks to do to practice or learn whatever they were discussing. This is a nice combination of approaches and as a quick revision of some web development fundamentals, this is a really good place to start.

Talk Python To Me

Python is the language that I've studied and coded most with, so this podcast allows me some contact with people who are light-years ahead of me in terms of their skills. It also introduces frameworks and personalities that are important within the Python community, so if Python's your thing, this is probably a useful podcast to listen to.

Data Skeptic

The contents of this podcast fairly frequently go right over my head, but it's good to be exposed to the ideas being discussed. There are interviews with people about particular issues and mini-dives explaining a particular data science feature or area.

Full Stack Radio

This is less content-rich than some of the other podcasts listed in the list, and there's sometimes too much focus on the marketing/business side of things, but it's nevertheless an enjoyable set of discussions about web development and programming in general. Guests often speak about meta-issues relating to productivity, management and so on.

Greater Than Code

This is one of my favourites. It's quite a new podcast, with only 14 episodes as of writing, but the hosts take care to bring in a diverse range of guests. In particular, it's a breath of fresh air to have lots of female coders on, since most of the 'successful' coding podcasts tend to be heavily dominated by men and their male guests.

O'Reilly Data Show

This podcast includes a lot of the latest technologies, big trends and big-name guests. Some episodes are too vague and unspecific, but as a big name in the data science podcast crowd, this is a fairly good place to go for the orthodoxy of many institutions and individuals involved in the space.

Please let me know if I'm missing any good podcasts that would be appropriate for someone at the beginning end of their programming journey.

Talking DevonThink with Gabe Weatherhead

December 26, 2016 in Podcast, Tech, Useful Tools

I’ve been on a bit of a DevonThink kick these past weeks, and the catalyst for all of this was a conversation I had with Gabe Weatherhead (@macdrifter over on Twitter, though that account is no longer active).

You can listen to the full episode on your podcast player of choice or over on the Sources and Methods site. Towards the end of the episode we get into the weeds on how he uses DevonThink Pro Office and several other pieces of software. I’m looking forward to hearing Gabe’s much-anticipated appearance on MacPowerUsers in January, since I imagine he’ll go into even more detail there.

We also discussed social media and some of the ways he found himself drifting away from commonly-used sites like Facebook and Twitter. For me, this was the most interesting part of the podcast.

DevonThink Resurgent

December 19, 2016 in Useful Tools, Tech

There has never been a better time to get into DevonThink and Tinderbox. Winterfest 2016 is on, and you can get 25% reductions on both those apps, as well as a number of other really useful pieces of software like Scrivener, TaskPaper, Bookends, Scapple and PDFPen.

If you’re unsure if DevonThink is something you’d be interested in, they have a 150-hours-of-use free trial for all their different apps. MacPowerUsers podcast just released a useful overview of the current state of the app — an interview with Stuart Ingram. ScreenCastsOnline also published the first part of a trilogy of video learning materials on DevonThink.

If you’re a Mac user who is perhaps uncomfortable with Evernote’s privacy policies or just seeking to get more out of the data you’ve stored on your hard drive, give DevonThink a try.

Highlights + DevonThink = Pretty Great

December 09, 2016 in Books, Productivity, Tech, Useful Tools

I’m late to the Highlights party, but I’m glad I got here.

Like many readers of this blog, I get sent (and occasionally read) a lot of PDFs. In fact, I did a quick search in DevonThink, and I am informed that I have 52,244 PDFs in my library. These are a mix of reports, archived copies of websites, scanned-and-OCRed photos and a thousand-and-one things in between.

Thus far, my workflow has been to read PDFs on my Mac. Any notes I took while reading the file were written up manually in separate files. I would laboriously copy and paste whatever text snippet or quotation I wanted to preserve along with its page reference. These would be fed into DevonThink’s AI engine and magic would happen.

Now, post-Highlights-installation, my workflow is much less laborious. I can take highlights in-app, export all the quotations as separate text or HTML files and have have DevonThink go do its thing without all the intermediary hassle. If you’re a professional researcher or writer using DevonThink as your notes database — and quite frankly, if not, why not? — the Highlights app will probably please you.

Knot 1: Solitary Confinement & Digital Security

December 02, 2016 in Knot, Books, Science, Tech

This is the first of what I hope will be a regular feature on this blog. Knot will link to recent things that I’ve been reading. It will include short articles as well as books and other things on the internet.

Articles

Martin Garbus — “America’s Invisible Inferno” (NYRB)

This NYRB review of ‘Hell Is A Very Small Place’ covers the use of solitary confinement in the American prison system. It ends with unearned optimism, but nevertheless is a useful reminder of the importance of the issue.

Mary Catherine O’Connor — “The latest weapon in the fight against illegal fishing? Artificial intelligence” (The Guardian)

Interesting short piece on the use of machine learning combined with data science competition site, Kaggle, to try to improve the efficiency and accuracy of fishing inspections. Hardly a day passes when a new initiative like this is launched, merging data science and crowdsourced solutions. I look forward to being able to contribute (one day in the hopefully-not-so-distant future).

Meeri Kim — “How a researcher used big data to beat her own ovarian cancer” (Washington Post)

More data science, though this is more at the N=1 edge of things than the previous fishery data story. The more data we are able to access and generate, the more there will be need for ways of analysing and processing these complex interactions. A lot of money is being invested in finding ways to automate this for non-technical users, though I imagine we’re still a little way away from that utopia.

Check Point — “More Than 1 Million Google Accounts Breached by Gooligan” (CheckPoint Blog)

I’m not quite sure why this isn’t a bigger story. I’ve long believed that anything stored by Google (in particular, emails) will at some point be leaked in a hack or by some disgruntled employee. This latest hack is pretty close to that scenario, with the caveat that we don’t know what was taken. Takeaway: move away from the Google ecosystem where possible.

Scott Gilbertson — “HTTPS is not a magic bullet for Web security” (Ars Technica)

This is a few months old but offers a useful overview on HTTPS technology, what it can help with (and what it is less useful for). The headline is actually a bit deceptive since the article seems to make a strong case for the use of HTTPS.

Shane Parrish — "The Simple Plan To Read More” (Farnham Street / Personal Growth)

I’ve long ago taken this advice on board, but it’s a useful reminder. I enjoy reading, and I enjoy the diversity that comes from reading widely and broadly, so keeping up the pace and seeing it more as a habit to be done most days is, to my mind, the right way of looking at it.

Books

I read three books this past week. Jim Klopman’s Balance Is Power was my easy introduction to some of the science behind why it is useful to train balance in a focused way as a skill.

I finished the second volume in Elena Ferrante’s tetralogy, The Story of a New Name, though I didn’t enjoy it quite as much as the first. I suspect I’ll have to return to it in a few weeks after the dust has settled.

Finally, I devoured Paul Kalanathi’s When Breath Becomes Air, a beautifully written exploration of death, medicine, how we go through all these things as people, how illness affects not only the body but the mind and the spirit as well. It reminded me of Atul Gawande’s Being Mortal, which covered similar themes albeit with less of an intimately tragic outcome. I highly recommend giving When Breath Becomes Air a read.

Sometimes Exist helps me feel better about myself :)

Seeing The Forest But For The Trees: On Exist.io

October 11, 2016 in Productivity, Tech, Useful Tools

So many services, so much data. How can I make sense of it all? You’ve probably had this thought yourself on occasion. You have your Fitbit data, your calendar data, your email data, your task management system, your last.fm music data and so on. All of these things exist in their own silos.

If you’re halfway intrepid, you’ve maybe even made efforts to liberate your data from these prisons. Best case scenario: you’re left with a half-complete .CSV file that is incompatible with all the other .csv files you’ve downloaded from other services. Now, if you want to hook all these files up together, you’ll have to clean them up, massage the various tables and forms so that they match up. And then you’ll have to perform a whole host of calculations in order to figure out how you’re doing, or what the data has to tell you.

Collating all this data and making sense of it can be a full-time job in and of itself. I admire those who give talks at institutions like the Quantified Self, telling tales derived from the past 20 years of financial tracking or weight data.

Personally, I don’t have time to do all of that. I need a service to help me out, something that will serve up platters of charts and correlations that address the main broad questions that I might address of these various data streams, notably:

how am I feeling?
how am I sleeping?
am I moving enough?
am I getting enough work done?

Now you might say that it’s easy to look at each of the services individually and you’d be right. I can look at a chart over at Fitbit.com and I’ll see something that shows me all the data for the past 12 months:

This is nice, but it doesn’t tell me a great deal, aside from the fact that I seem to be walking less in Amman than when I lived next to a huge forest in a provincial part of Holland. No surprises there. Also, a whole year is a bit too much for me to process. It doesn’t really help me calibrate my current actions and my plan going forward.

Enter, Exist. I’ve been using Exist for a little over a year now, checking in on the site pretty much every day. Their own explanation for what they do is:

“We turn numbers into insights. We collect data from the services you already use and find trends and correlations in the results.”

You hook up all your web services, and your Exist dashboard will be populated with pretty graphs, useful correlations and various other goodies. These are the services currently supported:

Your dashboard page has a bunch of different panels which aggregate data from different parts of your life. For instance, my activity panel for today looks like this:

I usually go for a walk in the evenings, so I’ll most likely end today with a higher count, but you get the idea. There are panels for activity, productivity, sleep, mood, workouts, health, location, social media, music and weather.

So the dashboard is where you can come to get a snapshot of the reality of your current short-term. Exist also offers correlations and suggestions based on its combination and number-crunching of all the various parts of your data stream. This is often most interesting when it comes to your mood data. Check out these recent correlations that Exist crunched from my self-reported mood data alongside the other automated streams:

There are so many different types of charts and graphs; it’s difficult to pick my favourite ones. Here you can see my slow but steady battle to increase the amount of time I sleep every day, for example:

The mood data is self-reported. Exist sends you an email every day at a time of your choosing and you respond with a number from 1–5 alongside some comments about your subjective experience of the day. These comments are also then processed into useful charts and prompts. Weird things I’ve learned from Exist’s number crunching:

the less I sleep, the more likely I am to have a ‘perfect’ day (i.e. 5 on the 1–5 mood scale)
the more it rains or snows, the happier I am

You can use Exist in a bunch of different ways, depending on what you hope to get out of it. For my own situation, I love how it encourages awareness of my current patterns on the short- to medium-term timescale (i.e. one week to three months). For things like sleep, where it’s easy to forget whether you’re managing to do what you seek to do (i.e. sleep more, or get up earlier etc), Exist is really excellent.

There's a sizeable userbase who actively contribute to shaping the future of the company / service through feedback. There are a bunch of things I'd love to see happen with Exist -- changing the 1-5 scale for mood feedback to 1-10 as a start -- but the site is actively developed, and I'm really happy with all the new features that have been added in the year or so since I signed up.

You can find out the future / expansion plans that Belle and Josh have for Exist here. There’ll be more services and correlations and integrations will be more usefully presented. All of this is as you’d expect. Belle and Josh are good people (together, they make up HelloCode which is the Australia-based parent entity that produces Exist, among other things). I spoke to Belle for a Sources and Methods podcast episode; that’ll be out soon, I hope, so keep your eyes out for that.

In my interactions with Belle / Josh I’ve been really impressed with the service that they offer through Exist and their other products. I put them in my personal category of ‘nice, kind people seeking to do good things in the world through offering really useful services’. Like the team who run Beeminder and GMB Fitness, Belle and Josh offer something that saves me time and adds meaning on a daily basis to my life. Exist is a paid service. If you already run a bunch of the services listed above in the diagram, you’ll probably benefit from Exist’s charts and correlations. I hope you’ll consider signing up for their free one-month trial.

PhD Tools: DevonThink for File Storage and Discovery

August 26, 2016 in Useful Tools, PhD, Productivity, Books

[This is part of a series on the tools I used to write my PhD. Check out the other parts here.]

Discovering similar notes in one of my DevonThink databases

I first heard about DevonThink in the same breath as Tinderbox. They go together, though they serve different purposes. Some people want to make an either/or decision about which to use. I see them as sufficiently different to assess them on their own merits and as per your usage scenario.

As with all tools, you should come to the decision table with a set of features that you're looking for. Don't just shop around for new things for the sake of newness or for the sake of having a really great set of tools. These programmes are not cheap. Luckily almost all of them come with generous trial versions or periods, but I don't recommend 'newness' as a feature of any particular merit.

Devonthink (I use the Pro Office version) is a place to store your files and notes. It can, I think, take any file you can throw at it. It comes with software for processing PDFs into fully-searchable documents (OCR software, in other words) which is part of the reason why the license for the Pro Office version of the programme is so expensive.

If you're anything like me, you're drowning in PDF documents. They all come with helpful names like "afghanistan_final_report_02_16.pdf" and unless you have a rigorous file hierarchy and sorting system, you'll probably be unable to find one file or the other. And using the basic file hierarchy system for storage doesn't help you with situations like when you want to store the same file in multiple folders (i.e. what if a report is about Afghanistan and Tunisia). (DevonThink has a feature which allows you to store the files in multiple locations, but without saving two copies of the file. Any changes or annotations you make in one file will automatically be transferred to the other).

You might ask yourself why you would need DevonThink and Tinderbox (see this post for more). The short answer is that they store different kinds of files/data, and that DevonThink is less about thinking than about storage (to a certain extent) and discovery.

One of the key features of DevonThink Pro Office is its smart searching algorithms, its ability to suggest similar texts based on the contents of what you are looking at, etc. It does this by means of a proprietary algorithm, so I can't really tell you how it works, but just know that it does. It works best on smaller chunks of text. In this way, I was reading through a particular source from the 3 million-word-strong Taliban Sources Project database and then I clicked the "See also" button and it had found a source I would never otherwise have read on the same topic, even though it didn't even use one of the keywords I would have used to search for it. It uses semantic webs of words to figure this stuff out. Anyway, beyond a certain database size, this power becomes really useful. It can also archive websites, store anything including text, do in-text searches on e-books etc etc. (Read more on how I use DevonThink for research in general here.)

I also used it a little as an archive for substantive drafts / iterations of the writeup process. That's another important part of the process: making backups of many different kinds. I never found any use for them, but at least they were there (just in case).

If you're a data and document hoarder at heart, like me, you'll soon have a Devonthink database (or several databases, split up by topic) that is bigger than you can fully comprehend it, or remember what was inside the files. At that point, search becomes really important. Not just a straightforward search, but the ability to input 'fuzzy' terms (i.e. if you search for "Afghanistan" it'll also find instances where it's incorrectly spelt "Afgahistan"), and boolean language, into your query is really powerful/useful. DevonThink is an amazing search tool. The company that developed the database software also make something called DevonAgent, which is basically a power-user search tool for the internet. Google on steroids, if you will. Fully customisable, scriptable... you can really go crazy with this stuff. I use it, but my PhD wasn't really about searching things on the internet, so I didn't use it much for my research or writeup. But it's a great tool, too.

In short, DevonThink is a research database tool that will help you store and find the documents that relate to your research, and do smart things to help you find sources and texts that maybe you'd forgotten you'd saved. Highly recommended for anyone working with large numbers of documents.

Walking Amman

August 19, 2016 in Jordan, Tech

I’ve been walking around Amman a little in the past couple of days. My poor sense of direction with the city’s somewhat haphazard street layout mean I make use of digital GPS maps on a regular basis. In Europe or North America, Google Maps is my service of choice, with due acknowledgement of their general creepiness.

But I discovered yesterday that Google Maps is pretty atrocious when walking around Amman. Either their data is old and of poor quality, or the algorithm for calculating time/distance between two points is not properly calibrated for a city with many hills. If you look on Google Maps’ display, you’ll see what looks like a flat terrain. Everything can seem very close. If you look out of the window, or walk on the streets, you’ll see that hills and a highly variable topography are very much a part of the experience of the city. (This gives some idea of it).

Google Maps knows how to deal with hills or variable terrain. After all, San Francisco, close to their centre of operations, is a pretty hilly city and I found the maps and the estimated timings worked pretty well when I was there last year. Which suggests to me that the problem isn’t that Google forgot to take into account topography but rather that the data is poor.

I’m studying data science-y things these days, so I thought a bit about how they might improve this data. Some possible solutions:

They’re already monitoring all the data coming from app usage etc, so why not track whether its estimations match up with how long people actually take to walk certain streets/routes. Mix that in with the topography data, and average it all out.
They could send out more cars. I don’t know how accurate the map data for driving in Amman is, but some anecdotal accounts suggest that it suffers from similar problems. This is probably too expensive, and I’m assuming it’d be preferable to find a solution that doesn’t require custom data collecting of this kind. Maybe something for when the world has millions of driverless cars all powered by Google’s software, but for now it’s impractical as a solution.
Find some abstract solution based on satellite-acquired topographic data which takes better account of gradients of roads etc.

For the moment, Google Maps is pretty poor user experience as a pedestrian. Yesterday evening I was walking back home from the centre of town. The walk would, Google told me, take only 12 minutes. 40+ minutes later I arrived home.

Others have noted this same problem and suggested an alternative: OpenStreetMap data. The data is unattached to a particular app, but I downloaded one alongside the offline mapping data for Jordan/Amman. It seems pretty good at first glance, and I’ll be testing it out in the coming days. I’m interested o learn why it seems to perform better. My initial hypothesis is that its data is just better than that which Google Maps is using.

Vanishing Interest in Afghanistan

November 01, 2015 in Afghanistan, Journalism

Afghanistan has been fading from the international media map for several years. This chart (courtesy of Google Trends) -- illustrating search interest and media publication -- shows how the peak of late 2009 has been followed by a slow decline, one set to continue as international media outlets continue their pullout in favour of newer, flashier conflicts.