Diagnosing Diabetes with Weka & Machine Learning

June 18, 2018 in Coding, Science, Tech

[I mentioned two weeks ago that I was working to dive into the practical uses of machine learning algorithms. This is the first of a series of posts where I show what I’ve been working on.]

The Pima Indians dataset is well-known among beginners to machine learning because it is a binary classification problem and has nice, clean data. The simplicity made it an attractive option. In what follows I’ll be mostly following a process outlined by Jason Brownlee on his blog.

The Pima Indian population are based near Phoenix, Arizona (USA). They have been heavily studied since 1965 on account of high rates of diabetes. This dataset contains measurements for 768 female subjects, all aged 21 years and above. The attributes are as follows, and I list them here since they weren’t explicitly stated in the version of the data that came with Weka and I only found them after a bit of digging online:

preg - the number of times the subject had been pregnant
plan - the concentration of blood plasma glucose (two hours after drinking a glucose solution)
pres - diastolic blood pressure in mmHg
skin - triceps skin fold thickness in mm
insu - serum insulin (two hours after drinking glucose solution)
mass - body mass index ((weight/height)**2)
pedi - ‘diabetes pedigree function’ (a measurement I didn’t quite understand but it relates to the extent to which an individual has some kind of hereditary or genetic risk of diabetes higher than the norm)
age - in years

This video gives a bit of helpful context to the data and the test subjects:

https://www.youtube.com/watch?v=pN4HqWRybwk

I also came across a book by David H. DeJong called “Stealing the Gila: The Pima Agricultural Economy and Water Deprivation, 1848-1921” which describes how the diverting of water and other policies “reduced [the Pima] to cycles of poverty, their lives destroyed by greed and disrespect for the law, as well as legal decisions made for personal gain.” It looks like a really interesting read.

The Problem

The idea with this data set is to take the attributes listed above, combine them with the labelling (i.e. we know who has been diagnosed with diabetes and who hasn’t) and figure out the pattern as much as we can. Can we figure out if someone is likely to have diabetes just by taking a few of these measurements?

The promise of machine learning and other related statistical tools is that we can learn from the data that we have to make testing more useful. Perhaps we only need your height, genetic risk factor and skin thickness to make such a prediction? (Unlikely, but still, perhaps…). If we emerge from our study with a statistical model, how well does it perform? How much can we generalise from the data? What would be an acceptable error rate in the medical context? Is it 80% or is it 99.99%? The former would save millions of dollars in test costs but would throw lots of errors; the latter would be highly accurate but it might be expensive to calculate the model.

The use case for this specific case would maybe be to identify at-risk individuals who are on the way to a diagnosis of diabetes and intervene somehow. Our motivation here is clear: people don’t want to be diabetic, so how early can we catch this transition? It would save governments money, expose fewer people to unnecessary tests and improve their quality of life.

I’m not a doctor, but to solve this problem manually would seem to require monitoring of blood tests (glucose and insulin levels), perhaps looking at exercise and diet, and also weight. At scale across the population of an entire country, for example, this seems like it might get expensive and/or too much for one person to process in their head. The data isn’t too large or complex, but it still seems to be useful you’d want to automate it to some extent.

There are some potential ethical issues around the data. Everything offered as part of the table of data is anonymised, but there are some outliers (see below) that I have to believe wouldn’t be too hard to find. The applicability of whatever model comes from this data will likely only have a limited application — the data is drawn only from women, after all. I also noticed that while the data is no longer available on the UCI Machine Learning Repository website, it still comes packaged with Weka. There was a notice on the UCI site (which I can no longer seem to be able to locate) stating that the permission to host the data had expired. It is unclear to me what’s going on with the permission there.

Data Preparation

Exploring the data using Weka’s explorer tool plus the attribute list above we can see that we have some blood test data, some non-blood body measurements and this genetic marker (presumably achieved through either blood tests or interview questions about family history). As I was working to understand the various attributes, it occurred to me that for this to be really useful, we’d want our model to work on data that wasn’t derived from blood tests; they’re expensive and they’re invasive. I didn’t get round to doing that for this round of exploration but it’d be high up on my wishlist next time I return to this data.

There are only 768 instances, so it’s still quite a small data set, especially in the context of machine learning examples. This is probably explained by the fact that it’s real medical data (so there are consent issues) plus the fact that it is several decades old and the processing power available then didn’t lend itself to processing mega-huge sets.

Thinking about what attributes might be removed to make a simpler model, I first thought that maybe the number-of-pregnancies might be dispensable, but then I thought to the number of hormonal and other changes that happen and I guess actually it is probably quite important.

There were some outliers in the data that I identified as needing further consideration / processing before we get our model trained:

There were some women who had been pregnant 16 or 17 times. They were on the far edge of the long tail, but I ended up leaving them in for the model rather than deleting them completely.
There were 5 people who had 0 as their result for ‘plus’, which seems to be an error. I decided to remove these.
There were 35 people who had 0 as their blood pressure, which seems to be an error.
There were 227 people with 0mm skin thickness. This is possible, but I think it’s more likely that no measurement was taken, at least for a lot of them.
There were 11 people who are listed as weighing 0kg. That seems to be an error.

After I’d identified these various outliers I decided to make a series of transformations to the whole set. From this I’d emerge with three broad versions of the data:

the baseline dataset, with nothing removed or changed
the outliers removed completely and replaced with NaN values
the outliers replaced with mean averages for each particular attribute

For each of these broad versions, moreover, I prepared three separate versions:

all values normalised (ranges and values for all attributes transformed to being from 0-1 instead of being in their original ranges. i.e. maximum weight as 1 and minimum weight as 0 etc)
all values standardised (set the mean for the data as being zero and the standard deviation to 1)
all values normalised and standardised (i.e. both transformations applied)

Producing these various versions of the data was something I learned from Brownlee’s book, “Machine Learning Mastery With Weka”. It turned out to be somewhat fiddly to do in Weka. In particular, every time you want to open up a file to apply transformations the default folder it remembers is often several folders down in the folder hierarchy. By the ninth transformation (there were nine sets in total, by the end of this process) I was ready for a more functional / automated approach to these data conversions!

Weka does offer some nice tools for the initial exploration of the data. Here you can see two charts that are generated in the ‘explorer’ application. First we have a series of simple bar charts visualising all the individual attributes. Then we have a plot matrix showing how all the various attributes correlate to each other (or not, as was mostly the case for this data set).

Visualisation of Pima Indians dataset attributes (auto-generated in Weka)

Plot matrix showing visualisation of correlations between all attributes (auto-generated using Weka)

Choosing Algorithms and Training the Model

Given that I’m very much at the beginning of my machine learning journey, I don’t have any strong sense of which algorithms might be more appropriate or not for this particular data set. I knew that this is a classification problem and not regression (i.e. we’re trying to decide whether people have diabetes or not — two categories — instead of predicting where people fall on a scale / spectrum) so that ruled out a few options, but really the field was wide open.

Jason Brownlee advises taking a sample of around ten different algorithm families to get an initial sense of whether there are any clear outliers (either over-performers or under-performers). Once I have a better sense of the overall space, I can then tweak things, or double down on a particular algorithm family to select a more limited feature set perhaps.

For this algorithmic spot-check I chose 12 algorithms, sampling from all the main families as I currently understand them. Running this set, I immediately came across an error message: Weka was telling me that the function.LinearRegression algorithm doesn’t function for classification algorithms. I removed that and reran the tests.

When doing this kind of test, it helps to have a baseline accuracy figure against which you can compare how much these fancy algorithms are improving predictions. In Weka, this is called a ZeroR algorithm and I think it basically says that everyone has no diabetes. For this dataset, it got 65.11% accuracy, which isn’t bad all things considered!

(Note that everything here is being run through k-fold cross-validation where training and test data are kept separately, and then this is repeated ten times. The final results are averaged out between them. Weka does this all with great ease, making it pleasant to conform to best practices when it comes to data science).

This figure shows how logistic regression was the best performing algorithm out of the box at 77.47% accuracy. I read somewhere that it often performs well on binary classification problems, so this didn’t surprise me. Support Vector Machines (listed as SMO in the Weka GUI) are also supposedly quite good for binary problems and it was only two-thirds of a percent behind logistic regression. Using Weka’s tools for statistical analysis of the result, I came to the conclusion that LMT, logistic regression, SGD and SMO were all worth further exploration and tinkering.

For example, I tried the following with the Support Vector Machines algorithm:

tweaking the value of c (complexity) to see if 0.25 performed better than 0.75, for example. It turned out that 0.5 was the sweet spot for the c value.
trying different kernels - I tried most of the options listed in Weka and they all performed pretty poorly. In particular, RBF (radial basis) was really poor.

None of my tweaks really seemed to improve the accuracy of the model. I imagine that some of the algorithms function better with more data, but I am not in a position to generate more.

The next step was to try some ensemble methods where the predictions made by multiple models are combined. In particular, bagging/bootstrap, boosting and voting were all recommended as options to try out.

You can see here that ultimately none of those outperformed logistic regression, which was surprising to me. I’m not at a place where my statistical understanding can explain why that’s the case — ensemble methods seemed to offer the best of many worlds — but I can’t really argue against the results.

Finally, I tried the MultiLayerPerceptron to throw one possible implementation of Deep Learning at the problem. This performed pretty poorly as per default configuration.

Findings / Conclusions

The best accuracy I was able to achieve on this data set was using a logistic regression model. This performed with 77.47% accuracy (standard deviation of 4.39%). We can restate this as an accuracy of between 68.96% and 86.25% accuracy on unseen data. This is slightly disappointing since it isn’t that much better than the ZeroR algorithm.

Towards the latter stages of my work on this problem, I came across a blog post by someone who used a neural network to reach results of 95% accuracy on this same data set, showing that there are models that bring dramatically improved performance. I don’t understand neural networks enough to be able to evaluate what he did (i.e. to know whether this is simply overfitting or actually a performant / real improvement on my results). Nevertheless, it seems like a significant improvement.

As my first big push to work on a real data set using machine learning tools, this process was instructional in the following ways:

Weka is easy to use and it makes some of the best practices in data science no-brainers to implement
Constructing the various data sets, implementing the experiments to compare the algorithms and so on was made slightly tedious by the GUI interface. If I wanted to run through many more variations it would have been prohibitively tiresome to have to manually click through all the options.
Weka is slow (or maybe my linux laptop is slow). Some of the algorithm sets I tried (Support Vector Machines, for example, or ensemble methods using SVMs) took 20+ minutes to run. The data set wasn’t huge at all, so I have to imagine that a real ‘big data’ set would make this kind of quick incremental exploration and iteration difficult to practice. Weka is, of course, a Java app and I’m running that on my Mac. I suspect that if I were to run similar algorithms through Python (or even better, C) on my Mac I’d get significant performance improvements.
I have very little sense of the variation between the various algorithms, what each one does and where the strengths and weaknesses lie. I want to tackle this from two directions: improving my baseline understanding of statistics and also just getting more experience implementing them for practical problems such as this one.

The next problem I want to tackle is that of the UCI soybean dataset). Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. Again, the dataset isn’t huge but it is a multivariate classification problem so there are new challenges to be tackled there.

Figuring out the Go testing ecosystem

June 17, 2018 in Coding

I've been trying to figure out a way to work my way into increasingly complex Go explorations. I thought I'd start at the most simple and expand from there.

I thought I'd start off with a simple fmt.Println command. I loaded up GoLand only to find that I had 2 days left in my trial for the software. Open-source software is much more my scene, so I downloaded Atom and continued my coding there.

At first I wrote a basic draft of my code. I couldn't figure out how to test it, however, perhaps because my function wasn't returning anything, so I switched gears. No point getting distracted by these details. I wrote a function that definetely did return something: the sum of two integers.

[My code is all available here.]

I wrote examples for all my functions using the ExampleFunctionname syntax as described in the golang spec / documentation but I couldn't seem to get that to work. I'll have to return to that in due course.

Benchmarking worked fine. Go's benchmarking tools figure out how many times they need to run a particular function in order to properly test it. In my case, each function ran two billion times for an average of 0.29 and 0.3 nanoseconds per iteration. Pretty fast! I have no baseline for knowing how fast that is, but it seems fast to me. Of course, the function isn't doing anything particularly taxing.

My next task will be to figure out table testing so that instead of my current setup where I have the values for each test written out manually, and just one test per function, I want to have many different sets of values to test.

From AI to to Brahms

June 13, 2018 in Art, Coding, Tech

The past few days have been consumed by a flurry of meetups, conferences and other engagements.

I attended the inaugural PyLondinium conference at the Bloomberg office in London. Lots of interesting speakers. Some went over my head but overall I was pleased at how much was accessible. It has been a while since I’ve done much Python coding, but apparently it hasn’t been completely forgotten. The day before the conference started I did a day-long hackathon with Trans*Code, an organisation that sets up these kinds of events in support of trans community issues in the UK.

Much of Tuesday was spent at CogX, a conference devoted to all things AI. There was a strong emphasis on the commercial application, and it had a distinctly different feeling to PyLondinium a couple of days earlier. Everywhere you looked at CogX, people were pitching, networking or sitting on the sidelines making pitch slide decks on their laptops, it seemed. I went to some great talks:

Haiyan Zhang talking about innovation at Microsoft’s Research lab in Cambridge
Sarah Gold talking about safety and accountability in ‘learned systems’
Zavain Dar got philosophical and talked about how AI and machine learning breakthroughs were reconfiguring how we talk and think about empiricism
a really great panel on the intersection between AI and education, with a memorable set of contributions from Priya Lakhani.

There were many others. Then I wandered over to the ExCel exhibition centre for TechXLR8 which was quite disappointing.

(This week is Tech Week in London, if you haven’t noticed, so there are many more events and meetups than normal, it seems).

In the evening I went to see the Belcea Quartet perform Mozart’s String Quintet (K515), Shostakovich’s Eighth Quartet and Brahms’ Second String Quintet. I haven’t heard the Belcea Quartet before, but they’re really something. The lead violinist and cellist have a great rapport, and judging from the sound seem to be playing really wonderful sonorous instruments. It was a warmingly affirmative reminder — after all the talk of machines and technology — that humans still have something to contribute.

Machine Learning with Weka

June 04, 2018 in Coding, Useful Tools

Learning to program is an infinite process. The field is as open and wide as you can imagine, and you are mostly constrained by your imagination.

I spent much of May getting my mind around Go. I took Todd McLeod's Go course on Greater Commons and learned a great deal. The course was somewhat short on practical implementation, however, and I'm eager to do things with what I learned. More on that in due course.

A parallel strand of my studies has been in statistics and more advanced applications of statistical methods i.e. machine learning. I had done a bit of this in the past, but my poor foundation in basic statistics didn't serve me well. I am now rectifying that through Andy Field's excellent textbook.

For machine learning I decided to take a step back from the programming and use a graphical interface to start with. There are great APIs / tools available for this in most languages you can think of but I wanted more of a solid foundation in workflows around machine learning and the kinds of analysis that get done.

I read through a good deal of Jason Brownlee's blog(https://machinelearningmastery.com/) as well as his book on Weka and he made a good case for why Weka is a good place to start.

I have noted a number of steps to move through in sequence, at the same time recognising that data analysis is often unsequential. I expect this to expand and/or redefine this over time.

Kaggle is one of the major hubs for machine learning practice (and learning) and I wanted to reengage there. The first data set they usually have you work on comes from the Titanic disaster. You take the full roster of people who boarded, including data points like their economic class, where they were staying on board the ship and their age/gender etc and use anything and everything in terms of tools to predict who survived and who didn't. I had used this data set in the past when I was studying ML with Python.

My initial idea, therefore, was to take the .csv files from the Kagge competition and use them in Weka to come up with some predictions. Unfortunately, there are some idiosyncracies about the .csv file that make this difficult. Some of the attributes / columns in the data (like names) use punctuation marks which make parsing the csv data non-trivial. Weka uses ARFF files as standard but has the option to parse CSV data. It ran into quite a few errors when trying to crunch through the Titanic data, and no amount of basic fiddling would fix it.

Reading around a little, it seems that others have noted this problem in the past. One blog post tackled the problem head on but the solution didn't really help me much in the short term. I'm now somewhat stuck, knowing that the fix to the problem is to use another language (Python, perhaps) to range over the data and process it in a form that will be more palatable for Weka. Alternatively, I could use it as an opportunity to build a short Go programme that could perform the same function.

For the moment, i've decided to do neither. I'm going to find an alternative data set which doesn't require wrangling and fiddling. I know wrangling and fiddling is an important skill to master, but it's not the skill I'm trying to focus on right now. Luckily, between the UCI Machine Learning repository and various other places, I'm not exactly lacking for examples / other data sets. Today I'll work with the Pima Indians Diabetes data set which came built-in with Weka.

Fuzzy Searching and Foreign Name Recognition

January 31, 2018 in Coding, Afghanistan

Here's something that happens fairly often: I'll be reading something in a book and someone's name is mentioned. I'll think to myself that it'd be useful at this point to get a bit of extra information before I continue reading. I hop over to DevonThink to do a full-text search over all my databases. I let the search compute for a short while, but nothing comes up. I tweak the name slightly to see if a slightly different spelling brings more results. That works a bit better, but I have to tweak the spelling several times until I can really claim the search has been exhaustively performed.

Anyone who's done work in and on a place where a lot of material is generated without fixed spellings for transliteration. In Afghanistan, this ranges from people's names -- Muhammad, Mohammad, Muhammed, Mohammed etc -- to place and province names -- Kunduz, Konduz, Kondoz, Qonduz, Qhunduz etc.

DevonThink actually has a 'fuzzy search' option that you can toggle but it isn't clear to me how it works or whether it's reliable as a replacement for a more systematic approach.

As I'm currently doing more and more work using Python, I was considering what my options would be for making my own fuzzy search emulator.

My first thought was to be prescriptive about the various rules and transformations that happen when people make different spelling choices. The Kunduz example from above reveals that vowels are a key point of contention: the 'u' can also be spelt 'o'. The 'K' at the beginning could also, in certain circumstances, become 'Q' or 'Qh'. These various rules could then be coded in a system that would collect all the possible spelling variations of a particular string and then search the database for all the different variations.

Following a bit of duckduckgo-ing around, I've since learnt that there are quite extensive discussions of this problem as well as approaches to solution that have been proposed. One, commonly referenced, is a Python package called 'FuzzyWuzzy'; it uses a mathematical metric called the Levenshtein distance to measure how similar or not two strings are. I imagine that there are many other possible metrics that one could use to detect how much two strings resemble one another.

I imagine the most accurate solution is a mixture of both approaches. You want something that is agnostic about content in the case of situations where you don't have domain knowledge. (I happen to have read a lot of the materials relating to Afghanistan, so I know that these variations of names exist and that there is a single entity that unites the various spellings of Kunduz, for example). But you probably want to code in some common rules for things which come up often. (See this article, for example, on the confusion over spellings of Muslim names and how this leads to law enforcement mistakes).

I may end up coding up a version that has high accuracy on Afghan names because it's a scenario in which I often find myself, but I'll have to explore the other more mathematically-driven options to see if I can find a happy medium.

Installing PostgreSQL on a Mac

January 24, 2018 in Coding, Useful Tools

PostgreSQL is a SQL-type database system. It has been around for a while, and is in the middle of a sort of revival. Installing Postgres on your own system can be a little difficult. Last time I tried, I was helped through the process while doing the Udacity Intro to Programming Nanodegree.

Recently I had to reinstall Postgres, and there were some useful improvements to the process when guided through it in my Dataquest lessons.

Postgres.app is an application you can install on your Mac which simplifies a lot of the legwork, particularly when setting up new databases, servers and so on.

When you want to install a commonly used Python library for interfacing with Postgres, psycopg2 is a good option. You can do this easily with Anaconda:

conda install psycopg2

Making and shuffling lists in Python

January 23, 2018 in Coding

I discovered some useful functions the other day while trying to solve one of the Dataquest guided projects. These all relate somehow to lists and use Numpy. I'm listing them here mainly as a note for my future-self.

import numpy as np

# this code returns a list of n number of items starting at 0
np.arange(3)
---- returns [0,1,2]

# this code is a variation on the previous one
np.arange(3,7)
---- returns [3,4,5,6]

# this adds the functionality of steps in between values
np.arange(2,9,2)
---- returns [2,4,6,8]

# these are slightly different; they sort lists
# if you want to make list of numbers randomly sorted:

np.random.permutation(10)
---- returns the numbers 1-9 in a list, randomly sorted

# you can also pass non-numeric lists into the `permutation`
list = [a,b,c]
np.random.permutation(list)
---- returns something like [b,a,c]

Tabula for extracting table data from PDFs

January 17, 2018 in Afghanistan, Coding, Productivity, Tech, Useful Tools

Have you ever come across a PDF filled with useful data, but wanted to play around with that data yourself? In the past if I had that problem, I'd type the table out manually. This has some disadvantages:

it is extremely boring
it's likely that mistakes will get made, especially if the table is long and extends over several pages
it takes a long time

I recently discovered a tool that solves this problem: Tabula. It works on Windows and Mac and is very easy and intuitive to use. Simply take your page of data:

A page listing Kandahar's provincial council election polling stations from a few years back. Note the use of English and Dari scripts. Tabula handles all this without problems.

Then import the file into Tabula's web interface. It's surprisingly good at autodetecting where tables and table borders are, but you can do it manually if need be:

Then check that the data has been correctly scraped, select formats for export (from CSV to JSON etc):

And there you have it, all your data in a CSV file ready for use in R or Python or just a simple Excel spreadsheet:

Note that even though the interface runs through a browser, none of your data touches external servers. All the processing and stripping of data from PDFs is done on your computer, and isn't sent for processing to cloud servers. This is a really nice feature and I'm glad they wrote the software this way.

I haven't had any problems using Tabula so far. It's a great time saver. Highly recommended.

This is what a Wikipedia site looks like in Arabic

Why do Arabic fonts appear so small? (and how to fix it)

March 12, 2017 in Arabic, Coding, Language

Anyone who's ever studied Arabic and attempted to increase their exposure to the language through the internet will have encountered this problem: Arabic fonts are always two or three sizes smaller than their English/Roman alphabet equivalent. This can make navigating the web a dispiriting experience. Most big websites take a lot of time and effort to get their browsing experience just right, with fonts that are appropriately scaled and optimised for reading. (Get a sense of how much thought goes into typefaces here, for example, at the New York Times.)

So why does this happen? At first I thought it was just a case of Arabic fonts being very much a sideshow in the what-doesn't-everyone-else-speak-english show that encompasses so much of Silicon Valley's design mentality. The most-used products are generally designed for an English-speaking audience, with people writing from left-to-right. Apple and Android's operating systems both work and function much better / logically when set to a Roman alphabet / layout. I happen to have my phones and computers set to an Arabic alphabet, and it's blindingly obvious that less thought went into designing the experience for such Arabic-speaking users. (For a more detailed explanation of some of the deficiencies of Arabic fonts, read this and this.)

What's worse, the consistency is subject to random change. To give one small example, iOS's 'Save to Evernote' dialogue box allows me to save articles from Instapaper into Evernote. (This is part of my somewhat laborious workflow for getting articles into DevonThink. Read here for more.) For years, I clicked a button in the right-hand corner to 'save', but a few months back they switched all the boxes around and now I have to click in the left-hand corner. The muscle memory is such that this is a hard one to fix. It's not the end of the world, but it still is an indication (along with the many other times this happens, seemingly without plan) of how little thought goes into this design and user interface work.

This is what the English version of Wikipedia looks like

Coming back to fonts, the real reason for why this happens has to do with the amount of vertical space that letters take up. Thomas Phinney of FontLab explained it clearly over on Quora when he wrote:

"Arabic letters have a smaller body relative to the extenders above and below, so the most common elements are smaller relative to everything else. Because the height of the font needs to more-or-less fit within the body size, Arabic looks smaller than Latin at the same point size."

There are three main approaches to solving this problem:

Deal with it. A lot of internet 'advice' tends towards this attitude. It's probably wise, but not especially practical for those early on in their studies of Arabic, nor even particularly practical for general audiences who want a pleasant reading experience.
Zoom in on the page. CTRL+ or CMD+ will do this on most browsers. Unfortunately, it messes with the rest of the design and functionality of the page, so you'll usually have a pretty unpleasant experience if you do this.
Install something to make the Arabic letters bigger. There are some scripts that will do this for you that you can have Greasemonkey (in Firefox) or Tampermonkey (in Chrome) handle. These two (here and here) seem to be the best known. I have tried both and couldn't get them to work. Needless to say, this is pretty fiddly and not at all ideal.

For now, it seems we're stuck with a poor browser and operating system experience.

UPDATE: Gerald Drißner (of 'Arabic for Nerds') suggested a Chrome extension, Huruf, which I have now installed and seems to work pretty well. Try that if you use Chrome!

Developing for Android with Udacity

January 13, 2017 in Coding, Tech

I was lucky to be awarded a scholarship to learn to code apps for Android phones. 10,000 such scholarships were awarded (from 70,000 applicants) to citizens of the European Union through a Google-Udacity partnership.

If you'd have asked me a few months ago whether I had an interest in learning to code apps for the Android platform (and learning the requisite Java code) I probably would have said it was low on my priority list. This scholarship allows me to learn the basics over the coming three months (with a possible extension of another three months as part of the full Udacity fasttrack nano degree, however, so I will certainly take the time and make use of the opportunity.

It isn't clear to me how far the course will take me, but I'm already thinking that this might be a really great opportunity to develop a version of my CoachBot language-learning tool for Android, hopefully one that doesn't require users to be online to use it.

The course so far has been fun. As usual, Udacity's platform and teaching style is highly interactive, iterates over problems and gets you solving practice questions from the start. As a language, Java is different from the Python and Javascript that I've encountered thus far, though I'm not deep enough into the weeds to have a strong appreciation of exactly how.

In any case, watch this space. If you're an Android user and would be interested in an app version of the CoachBot tool, drop me an email to let me know.

10 podcasts to learn about data science and programming

January 11, 2017 in Coding, Podcast, Useful Tools

When learning a new skill or indulging a new interest, I like to find out who is podcasting about that thing. I do this early on in the process since it is hard to get a sense of the full landscape of a particular skill or issue without reading widely. The internet, of course, is a great way to find out about all the different nooks and crannies of a particular community. That doesn't help me much when I'm cooking, though, or when I'm walking in town. Podcasts are an opportunity to broaden my exposure to those topics while doing something else, usually something physical.

Here are some of my favourite podcasts relating to data science and programming:

Becoming a Data Science Podcast

Hosted by Renee Teate, this was my first exposure to smart data professionals talking about their work. I think the podcast is meant to be modelled on the book Data Scientists at Work. (I've read the book and thoroughly enjoyed it, for much the same reasons as I enjoyed this podcast). I'd recommend going back to the first episode and listening to them all. You can get a good overview from listening to just these episodes. Renee has recently resumed podcasting for season two and she also runs a data-science resource online empire that is filled with useful materials.

CodeNewbie

I'm working my way through all of these these, listening through from the very beginning (over 120 of them). At the beginning, I find it's useful to hear from people who started from nothing and who found a way to use code in their life or career. Not all of the guests are equally interesting, but the podcast is well produced and there's still loads of useful information buried in the overall corpus.

Learn to Code With Me

This podcast is in its second season. Just like CodeNewbie, the podcasts tend to focus on the journeys of people who have started from nothing. The guests are variable in terms of how interesting I find them, but I enjoy it nonetheless.

Partially Derivative

This is one of the more popular data science podcasts out there at the moment. Episodes range from discussion panels to interviews to deep-dives on a particular topic. This is great for keeping up with the buzz or gossip in the data science world and they often have pretty senior guests from the world of data science and business.

Start Here: Web Development

This podcast also has a Ruby/Rails sidekick podcast. I've just started the Web Development series and am finding it a useful overview. I'm not coming at this all as a complete beginner, so I'm not sure of the extent to which it'd be useful if that's you, but each episode covers a particular area of programming and web development. Episodes end with an assignment or set of tasks to do to practice or learn whatever they were discussing. This is a nice combination of approaches and as a quick revision of some web development fundamentals, this is a really good place to start.

Talk Python To Me

Python is the language that I've studied and coded most with, so this podcast allows me some contact with people who are light-years ahead of me in terms of their skills. It also introduces frameworks and personalities that are important within the Python community, so if Python's your thing, this is probably a useful podcast to listen to.

Data Skeptic

The contents of this podcast fairly frequently go right over my head, but it's good to be exposed to the ideas being discussed. There are interviews with people about particular issues and mini-dives explaining a particular data science feature or area.

Full Stack Radio

This is less content-rich than some of the other podcasts listed in the list, and there's sometimes too much focus on the marketing/business side of things, but it's nevertheless an enjoyable set of discussions about web development and programming in general. Guests often speak about meta-issues relating to productivity, management and so on.

Greater Than Code

This is one of my favourites. It's quite a new podcast, with only 14 episodes as of writing, but the hosts take care to bring in a diverse range of guests. In particular, it's a breath of fresh air to have lots of female coders on, since most of the 'successful' coding podcasts tend to be heavily dominated by men and their male guests.

O'Reilly Data Show

This podcast includes a lot of the latest technologies, big trends and big-name guests. Some episodes are too vague and unspecific, but as a big name in the data science podcast crowd, this is a fairly good place to go for the orthodoxy of many institutions and individuals involved in the space.

Please let me know if I'm missing any good podcasts that would be appropriate for someone at the beginning end of their programming journey.

Introducing CoachBot: Your Personal Language Taskmaster

January 06, 2017 in Useful Tools, Productivity, Tech, Language, Coding

For languages that aren’t new, I often feel like I’m stagnating and get bored when I reach the intermediate levels. This can reflect a lack of materials from which to study (as was the case with Pashto when I first started studying it) or — more commonly — a surfeit of materials. This creates a kind of choice paralysis where the number of options means I’m far less likely to sit down and pick one of them. (In a similar way, I'll sometimes choose not to watch any of the in-flight entertainment because there are too many choices to pick from.)

Studying a brand new language is (almost) always fun: you’re making quick progress, everything is new so you still have that nice-and-shiny feeling, and you feel like you’re really on your way to success. Continuing that study after two to four years of effort is a little harder. Like with any longer-term project, you start having to find ways to remind yourself of why you’re even working on it in the first place. It can often feel like you’ve lost that original magic somehow, even to the extent that you question whether you actually want to learn the language.

It is useful to address some of these issues ahead of time. That way, when you hit a period of less energy or motivation, you have a pre-formulated plan of action (made by you when you weren’t consumed by whatever mood is dominant). For me, this takes the form of making lists of suggestions to my future-self. I have pre-made task lists for:

When I’m travelling
When I’m feeling sick
When I have no time to study
When I have oodles of time to study
When I have lots of energy and enthusiasm for learning
When I have no enthusiasm for learning

Try to have at least 10 or 15 tasks in whatever lists you do end up creating. Maybe save a few pages at the back of your language notebook to list these tasks. This way, you always have them handy. It helps to have a good amount of variety in the tasks you pre-assign to yourself.

I keep lists as described above, but they weren’t as effective as I'd hoped. I’d glance at the tasks, feel only a limited enthusiasm for the options available and then put the list to one side. I needed a different solution.

I happened to be teaching myself to program/code at around the same time, so I thought this might make an interesting practice problem to try to solve. (I was studying Python and so I found a way to make a web app that uses that to connect to Flask.)

CoachBot is the free tool I designed to solve the problem of study choice paralysis for language-learners. It’s still only a prototype, but I'm soft-launching it here now since I imagine it might help those reading who are in similar situations.

CoachBot gives you a task that you can complete within a specific time-frame. If you have only 5 minutes, it'll pick a random task from the database that I curated and wrote myself. Have an hour? It'll suggest a different kind of task. If you don't want to do a particular task that it suggests, just click a button to get a new one.

These are the kinds of tasks I suggest when working with students one-on-one. They’re also the kinds of tasks I had written down in my lists. As of writing, there are 386 unique tasks in the database, which means that the suggestions are far more varied and creative than anything I was previously using.

I’d suggest you use it as follows: if you ever feel like you don’t know what to do to keep going with your language studies, open up CoachBot, pick a time corresponding to your needs and do whatever it tells you to do. When you’re done, make a note of what you did and how long it took in your learning log. Consider doing another session.

I’ve been using this for a few weeks already and can attest to its value. One of the key benefits I’ve found is just in getting started. Sometimes I’ll only need to do a five-minute task before I realise that there was something else that I wanted to read or study and then I’ll get busy working on that.

There are lots of features that I hope to build in for future versions. I want to include user accounts and tracking of how much time you spend on the different tasks. I want to sub-divide by language skill (i.e. which skill is being trained) and eventually to build in some kind of guidance and interactivity to how the tool functions. But for now, use it as it is: get some studying done by outsourcing the choice of what you’ll be studying.

There are more details on the website itself. You can click through to the project’s roadmap where you can see an updated version of features coming soon. You can also make suggestions for tasks that you’d like included in the Bot and/or specific features you’d like me to build as part of the project.

[Special thanks to Alex, Ian, Kevin and Peter for patiently answering my questions while I was building this initial prototype].

Dinner Party Decision Matrix: A Python Tool

December 29, 2016 in Coding, Useful Tools

Some friends were coming round for dinner this evening and I couldn’t decide what to cook. I couldn’t decide because there were too many variables floating around that would impact the decision.

So I decided to make an app.

I wanted something tasty but that also was relatively hassle-free. I wanted something that didn’t take too much time to prepare, but it still needed to be different from the kind of food I’d cook in a rush at the end of a long day.

I’d come across the idea of weighted decision-making a while back, and really like its premise; you say how important various factors are to you and then rate all the different options according to those same factors. At the end, you’re left with a score that can be said to be more objective than just your gut feeling.

So yesterday morning I listed a bunch of the factors that were important to me in coming to a decision about what to cook. I figured I could get some coding practice by making a tool that would work to calculate the best option from any combination of choices.

I put the finishing touches on the decision-making tool this morning. You can check it out on Github, though you’ll need to know a little about how Python works to get it going for yourself. I’m fairly sure my code looks horrific to a seasoned professional. There are probably ways I could have standardised the flow of questions and improved the output, but for now, I’m satisfied.

The app ended up picking the following for dinner (photo below):

- Courgette, chickpea and feta filo pastry pie

- Roast beetroot and pistachio salad

- Kale, pomegranate and shredded chicken salad

- and a beetroot and chocolate cake

I learnt a bunch of things about Python while coding the tool:

When you run a x= raw_input(“Ask question here”) command, the information allocated to x will be in a string format. If you are collecting numbers, you’ll need to convert it to an integer format by using int().
Lists and dictionaries can be tricky, particularly when you’re looping over them or looping over certain keys etc.
It helps to test while you code. This particular function wasn’t too complex to understand, so I just pushed on without testing so much. At some point towards the end, I realised something in the middle was wrong: cue much debugging. I keep reading about the importance of having in-built tests. I’m not sure what that would mean / have meant in the context of this particular app, but I imagine I’ll find out in due course.
Coding is fun and appeals to the compulsive side of my brain. I rarely need to be urged back to the task at hand when I’m coding. This probably has something to do with the quick feedback loop that coding enables. In any case, I’m hooked (again).