I saw this talk at ICER, and I really loved how it led to the idea of evaluating tests in terms of "wheats" (good programs pass the test) and "chaffs" (bad programs fail the tests). They describe this in terms of Thoroughness and Validity.
> A suite is valid if it accepts (i.e., its assertions pass) all correct implementations... In order for a suite to be valid for all implementations of median, it must not include any assertions involving empty input lists. We can accurately identify such assertions as invalid by checking them against two correct implementations (henceforth wheats [24])... If a student asserts that implementations should produce an error on empty inputs, their suite will reject the wheat that produces 0 (and visa versa). Provided that the set of wheats completely exercises the space of underspecified behaviors permitted by the specification, accepting all wheats guarantees that a suite is valid and will accept all correct implementations.
> A suite is thorough if it rejects (i.e., its assertions do not pass) buggy implementations. We assess the thoroughness of a suite by running it against a curated set of buggy implementations (henceforth chaffs [24]). The thoroughness of a suite is measured as the proportion of chaffs it rejects. To assess test suites, the set of chaffs should include subtly buggy implementations. To assess examples, we take a different perspective: the set of chaffs should exercise logical misunderstandings that students are likely to make. For
instance, to assess the thoroughness of examples for median, the set of chaffs could include implementations of mean and mode.
I want to see this used in more curricula and tools. I need to see if there's been any follow-up on this research and learn how it's gone.
skrishnamurthi 706 days ago [-]
Unfortunately, all the follow-up work I know of has been done only by us (the same group of authors). Still waiting for others to pick it up! (-:
np_tedious 706 days ago [-]
Seems roughly analogous to consistency and completeness in a logic system
skrishnamurthi 706 days ago [-]
More like soundness and completeness. We're comparing two different systems, the executable examples versus the ground truth. Soundness and completeness are statements of comparison between two systems. Whereas consistency means that the system doesn't derive contradictions.
jswrenn 706 days ago [-]
Oh, whoa, I'm the author of this! Happy to answer any questions.
> A suite is valid if it accepts (i.e., its assertions pass) all correct implementations... In order for a suite to be valid for all implementations of median, it must not include any assertions involving empty input lists. We can accurately identify such assertions as invalid by checking them against two correct implementations (henceforth wheats [24])... If a student asserts that implementations should produce an error on empty inputs, their suite will reject the wheat that produces 0 (and visa versa). Provided that the set of wheats completely exercises the space of underspecified behaviors permitted by the specification, accepting all wheats guarantees that a suite is valid and will accept all correct implementations.
> A suite is thorough if it rejects (i.e., its assertions do not pass) buggy implementations. We assess the thoroughness of a suite by running it against a curated set of buggy implementations (henceforth chaffs [24]). The thoroughness of a suite is measured as the proportion of chaffs it rejects. To assess test suites, the set of chaffs should include subtly buggy implementations. To assess examples, we take a different perspective: the set of chaffs should exercise logical misunderstandings that students are likely to make. For instance, to assess the thoroughness of examples for median, the set of chaffs could include implementations of mean and mode.
I want to see this used in more curricula and tools. I need to see if there's been any follow-up on this research and learn how it's gone.