Header ImageHeader Image

Evaluation

Process: Objective and subjective evaluation


If we view objectivity and subjectivity of evaluation along a continuum, we can represent various assessment and scoring methods along its length.

Test items that can be evaluated objectively have one right answer (or one correct response pattern, in the case of more complex item formats). Scorers do not need to exercise judgment in marking responses correct or incorrect. They generally mark a test by following an answer key. In some cases, objective tests are scored by scanning machines and computers. Objective tests are often constructed with selected-response item formats, such as multiple-choice, matching, and true-false. An advantage to including selected-response items in objectively scored tests is that the range of possible answers is limited to the options provided by the test writer—the test taker cannot supply alternative, acceptable responses.

Because much of what we assess in reading and listening comprehension measures is first interpreted by the test writer, some degree of subjectivity is present in objectively scored items. For that reason, assessments of the Interpretive mode, even those comprised of "one-right-answer" items, might not be placed all the way at the objective end of the continuum.

Evaluating responses objectively can be more difficult with even the simplest of constructed-response item formats. An answer key may specify the correct answer for a one word, gap-filling item, but there may in fact be multiple, acceptable alternative responses to that item that the teacher or test developer did not anticipate. In classroom testing situations, teachers may perceive some responses as equally or partially correct, and apply some subjective judgment in refining their scoring criteria as they mark tests. Informal scoring criteria for short-answer items probably work well for classroom testing as long as they are applied consistently and are defensible.

Just as there may be few truly objective measures of second language knowledge and skill, so too is it rare to find purely subjective evaluations of performance. Allowing the subjective impressions of scorers to determine learners' grades would not be acceptable to most students, their parents, or other stakeholders. We do not usually have to justify our opinion that a work of art is good or bad—we simply like it or we don't. Since our judgment has no significant consequences for the artist (unless we are art critics), a subjective evaluation is acceptable. It is also not a matter of concern that the many viewers of the artwork do not agree about its quality.

In assessment, we strive to ensure two types of reliability: inter-rater (raters agree with each other) and intra-rater (a rater gives the same score to a performance rated on separate occasions). The higher the stakes, the more reliable (consistent) judgments must be. Scoring criteria, in the form of rubrics, are generally used to guide raters to arrive at the same, or nearly the same, evaluation of a product. Thus, although it is common to refer to scoring which requires human judgment as subjective evaluation, in most cases we might place it near the midpoint on our objective-subjective continuum.

In rated assessments, the scoring criteria form an integral part of the evaluation. Specialists in language testing often identify three key components in performance assessment. These components are:

  • Tasks that are effective in eliciting the performance to be assessed.
  • Rating criteria to evaluate the quality of the performance. The criteria reflect the relative importance of various aspects of the performance, and are appropriate for the population being assessed.
  • Raters that are trained to apply the criteria and can do so consistently.

Rating criteria and rater training are the topics of the next pages.

Next: Checklists

Center for Advanced Research on Language Acquisition (CARLA) • 140 University International Center • 331 17th Ave SE • Minneapolis, MN 55414 | Contact CARLA