> A suite is valid if it accepts (i.e., its assertions pass) all correct implementations... In order for a suite to be valid for all implementations of median, it must not include any assertions involving empty input lists. We can accurately identify such assertions as invalid by checking them against two correct implementations (henceforth wheats )... If a student asserts that implementations should produce an error on empty inputs, their suite will reject the wheat that produces 0 (and visa versa). Provided that the set of wheats completely exercises the space of underspecified behaviors permitted by the specification, accepting all wheats guarantees that a suite is valid and will accept all correct implementations.
> A suite is thorough if it rejects (i.e., its assertions do not pass) buggy implementations. We assess the thoroughness of a suite by running it against a curated set of buggy implementations (henceforth chaffs ). The thoroughness of a suite is measured as the proportion of chaffs it rejects. To assess test suites, the set of chaffs should include subtly buggy implementations. To assess examples, we take a different perspective: the set of chaffs should exercise logical misunderstandings that students are likely to make. For
instance, to assess the thoroughness of examples for median, the set of chaffs could include implementations of mean and mode.
I want to see this used in more curricula and tools. I need to see if there's been any follow-up on this research and learn how it's gone.