Monday, October 7, 2013

RoboTeach: Semi-automatic method for grading a million homework assignments

From Strata:
Organize solutions into clusters and “force multiply” feedback provided by instructors

One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classes with a few (hundred) thousand students aren’t unusual.

Researchers at Stanford recently combed through over one million homework submissions from a large MOOC class offered in 2011. Students in the machine-learning course submitted programming code for assignments that consisted of several small programs (the typical submission was about 16 lines of code). While over 120,000 enrolled only about 10,000 students completed all homework assignments (about 25,000 submitted at least one assignment).

The researchers were interested in figuring out ways to ease the burden of grading the large volume of homework submissions. The premise was that by sufficiently organizing the “space of possible solutions”, instructors would provide feedback to a few submissions, and their feedback could then be propagated to the rest.


Domain specific metrics
Organizing the space of homework submissions required a bit of domain1 expertise. The researchers settled on two dimensions: functional variability and coding style (syntactic variability). Unit test results were used as a a proxy for functional variability. In the machine-learning course unit test results were numbers, and programs were considered functionally equal if resulting output vectors were the same. Abstract syntax trees (AST is a tree representation of code structure) and tree edit distance2 were used to measure stylistic similarity of code submissions....MORE

Code Webs: Stanford
The above figure is the landscape of ~40,000 student submissions to the same programming assignment on Coursera’s Machine Learning course. Nodes represent submissions and edges are drawn between syntactically similar submissions. Colors correspond to performance on a battery of unit tests (with red submissions passing all unit tests). In particular, clusters of similarly colored nodes correspond to multiple similar implementations that behaved in the same way (under unit tests).