• Home
  • About
  • Ekko
  • C.V.
  • Blog
  • Archive
  • Contact
  • RSS
  • Menu

Alex Strick van Linschoten

  • Home
  • About
  • Ekko
  • C.V.
  • Blog
  • Archive
  • Contact
  • RSS
weka-loader.png

Machine Learning with Weka

June 04, 2018 in Coding, Useful Tools

Learning to program is an infinite process. The field is as open and wide as you can imagine, and you are mostly constrained by your imagination.

I spent much of May getting my mind around Go. I took Todd McLeod's Go course on Greater Commons and learned a great deal. The course was somewhat short on practical implementation, however, and I'm eager to do things with what I learned. More on that in due course.

A parallel strand of my studies has been in statistics and more advanced applications of statistical methods i.e. machine learning. I had done a bit of this in the past, but my poor foundation in basic statistics didn't serve me well. I am now rectifying that through Andy Field's excellent textbook.

For machine learning I decided to take a step back from the programming and use a graphical interface to start with. There are great APIs / tools available for this in most languages you can think of but I wanted more of a solid foundation in workflows around machine learning and the kinds of analysis that get done.

I read through a good deal of Jason Brownlee's blog(https://machinelearningmastery.com/) as well as his book on Weka and he made a good case for why Weka is a good place to start.

I have noted a number of steps to move through in sequence, at the same time recognising that data analysis is often unsequential. I expect this to expand and/or redefine this over time.

Kaggle is one of the major hubs for machine learning practice (and learning) and I wanted to reengage there. The first data set they usually have you work on comes from the Titanic disaster. You take the full roster of people who boarded, including data points like their economic class, where they were staying on board the ship and their age/gender etc and use anything and everything in terms of tools to predict who survived and who didn't. I had used this data set in the past when I was studying ML with Python.

My initial idea, therefore, was to take the .csv files from the Kagge competition and use them in Weka to come up with some predictions. Unfortunately, there are some idiosyncracies about the .csv file that make this difficult. Some of the attributes / columns in the data (like names) use punctuation marks which make parsing the csv data non-trivial. Weka uses ARFF files as standard but has the option to parse CSV data. It ran into quite a few errors when trying to crunch through the Titanic data, and no amount of basic fiddling would fix it.

Reading around a little, it seems that others have noted this problem in the past. One blog post tackled the problem head on but the solution didn't really help me much in the short term. I'm now somewhat stuck, knowing that the fix to the problem is to use another language (Python, perhaps) to range over the data and process it in a form that will be more palatable for Weka. Alternatively, I could use it as an opportunity to build a short Go programme that could perform the same function.

For the moment, i've decided to do neither. I'm going to find an alternative data set which doesn't require wrangling and fiddling. I know wrangling and fiddling is an important skill to master, but it's not the skill I'm trying to focus on right now. Luckily, between the UCI Machine Learning repository and various other places, I'm not exactly lacking for examples / other data sets. Today I'll work with the Pima Indians Diabetes data set which came built-in with Weka.

Tags: machinelearning, weka, golang, statistics
Prev / Next

Mailing List

Popular Posts

Featured
Coding, Productivity
Solid Study Habits for Coders
Coding, Productivity
Coding, Productivity
General, Movement
Pain: A Love Story
General, Movement
General, Movement
Useful Tools, Productivity, Tech, Language, Coding
Introducing CoachBot: Your Personal Language Taskmaster
Useful Tools, Productivity, Tech, Language, Coding
Useful Tools, Productivity, Tech, Language, Coding
Books, Jordan, Language
Everything You Need to Study Jordanian Arabic
Books, Jordan, Language
Books, Jordan, Language
Incremental Elephant, Language, Books
The Two Books Every Intermediate Arabic Student Needs to Read
Incremental Elephant, Language, Books
Incremental Elephant, Language, Books
Books, Productivity
Fundamentals Versus Hacks
Books, Productivity
Books, Productivity
Productivity, PhD
PhD Tools: The Secret to Finishing Your PhD
Productivity, PhD
Productivity, PhD
Jordan, Climbing
Existential Battles: Climbing in Amman
Jordan, Climbing
Jordan, Climbing
Afghanistan, Books, First Draft Publishing
Reading the Afghan Taliban: 67 Sources You Should Be Studying
Afghanistan, Books, First Draft Publishing
Afghanistan, Books, First Draft Publishing
Books, Journalism, Pakistan
North Waziristan: A Reading List
Books, Journalism, Pakistan
Books, Journalism, Pakistan

Recent Posts

Blog
First stitches: on learning to knit
about 5 months ago
Language Learning Crash Course: from slightly more than zero to slightly less than advanced
about a year ago
All the things I wish I knew about studying at school
about a year ago
Automating social media posting for my new blogposts
about a year ago
Vermeer at the Rijksmuseum
about 2 years ago