Getting and Cleaning Data is the third part of the John Hopkins Data Science specialization on Coursera.
As a title this module is probably in the bottom quartile for excitement, up against the big hitters of “regression models” and “practical machine learning”. After all I’m not sure anything with “cleaning” in the title has ever been labelled exciting!
Personally I was interested in hearing what this course had to offer as I have on more than once fallen foul having quality data to work with. This is a much underestimated aspect of working with data; you actually need to source it in the first place. Generally speaking though you’re thinking up what clever things you are going to do with this yet unsourced data. Sometimes people may even tell you that you can get the data you are after… don’t believe them. You may even have a look at the data yourself and think it looks ok, but until you really start to interrogate it’s probably flawed in some way. I have found this often in both academia and the work place. It was nice to see and it was nice to see Jeff Leek point this out fairly early on and coin a nice phrase with a lecture on “Data Motivation”.
Although the lectures are spread evenly with weeks 1&2 concerned with getting data and 3&4 cleaning data. The assignment is only really only concerned with the cleaning, which is a shame as it would have been nice to touch some external services return JSON or some such.
I can’t say there was anything that really blew me away though; I think some of the material using external data was covered in the R-Programming course. Probably the main take away is in the depths of the assignment with the concept of narrow vs wide data and normalized data is not necessarily clean data.
Once again reading from slides as you might expect! Jeff Leek must be a bit camera shy though as I don’t recall any opening pieces to camera like some lecturers like to do. That said he is perhaps more engaging than others. It is a simple thing, but if you are using slides to convey information actively underlining important parts helps a lot. The student watching the video has some form of visual stimulation, rather than just a static slide on the screen for one minute.
I’ll be honest I think that it is a bit of misnomer to call the “quizzes” this. They’re far more thorough than what I consider a quiz to be, more like “tests”. Perhaps it is an Anglo connotation of the word that tends to imply quick fire brief answers to questions. I found that 5 questions could easily take an hour to complete. Consider the following example:
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income & Group. How many countries are Lower middle income but among the 38 nations with highest GDP?
Pretty involved, but this is a good thing!
I hadn’t used the course forums for the previous two modules as previously I hadn’t felt the need. It was good to see they were pretty active and there was a particularly attentive community teaching assistant. Otherwise I mainly used it for a sounding block for how I was going to submit my assignment as there was a certain amount of license users could take.
Assignment – the highlight
The assignment for this course was fairly easy to implement, it probably took about half the time compared the previous one in the R programming module. This is probably more in line with how long I would expect for a part time course.
The actual activities involved merging several data files, calculating a mean and some arbitrary tidying of variable names etc. The added spice to the assignment comes with being asked to either keep the data in a “wide” or convert it to a “narrow” format. As pointed out to students, there is no right or wrong answer you just have to justify your choice. This part probably took me the most time as I carried out addition background reading and sought counsel from the other student on the forum before settling on a narrow format.
What worked well with this is that either answer is correct, yet the likely hood is that students that read the course background material will probably favor the narrow format. There’s a certain sense of satisfaction from getting a better result after going the extra mile.
Peer Assessment – the low light
Here we go again, what appears to be the elephant in the room for MOOCs. Once more I find myself pretty disenchanted with this. Perhaps more so than before, but with this course it is once again due to faculty’s choice of what is peer assessed. I have seen peer assessments work well for online courses in the past, when learning a new spoken language or written text for example. For marking technical submissions that can largely be composed of subjective choices it is just not sensible.
The truth of the matter is, you are getting people who are still novices in a subject to mark others. This ultimately leads to false positives or false negatives i.e. people get credit they don’t deserve or are not credited with what they have done.
We have to remember people are also paying for this course, so is it really acceptable? Is the policy of “assessment is final” and can’t be re-assessed fair? When we have tools like automated unit tests for assessing code, do we need to really peer assessment for that?
From the assessments of others I completed, I would say 50% of people struggle with understanding what they are meant to do in the assignment; will they really be able to interpret the accompanying marking rubric any better?
The marking rubrics themselves aren’t great either; they’re so loose you can get half marks for doing next to nothing. Consider the mark allocation for submitting the Code Book (A description of the data you have process and presented in the assignment)
12 – Yes, the student submitted a code book and it appears correct.
6 – Yes, the student submitted a code book but it appears to have major flaws.
0 – No the student did not submit a codebook.
Of the four peer assessments I carried out two people submitted code books that we’re just copy and paste from the assignment or background reading. Yet according to the rubric they have scored half marks even though they have submitted little of value.
Overall I enjoyed this course though I particularly the recommended reading and additional swirl exercises (R based learning package). Particularly as they we’re relevant to the assignment and the fairly involved quizzes. It definitely makes for richer learning experience rather than perhaps doing two of the offerings at once and the bare minimum work. Essentially making it more a university experience where students can take on extra effort on if they desire.
December is a busy month for me, so will be picking up with Exploratory Data Analysis in the New Year.