Research and analysis

Reporting of measurement uncertainty and reliability for US educational and licensure tests: Summary

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Overview

Overview by: Andrew Watts, AlphaPlus Consultancy Ltd.

The writers of this report were asked by Ofqual to investigate how assessment agencies in the United States report uncertainty information when making public the results of their assessments. The report focuses on tests and examinations given to students in school, and tests given to people wanting to enter certain professions - licensure tests. The team gathered the information by searching the documents of educational and assessment organisations, and also by making personal contact with people responsible for the assessments.

2. Background

This report was written by researchers in the US to explain how measurement uncertainty in low-stakes and high-stakes tests is reported in the US. They were asked by Ofqual to answer the questions: Is the reporting of measurement error (that is, imprecision in scores and grades) common or typical in the US, or is it uncommon or atypical? And, if it is common or typical, how is it usually done? The researchers focused on educational tests, which are given to students still at school, and licensure tests, which are taken by people who want to practise in a profession. The study was carried out by use of web searches to find the relevant documents and also by telephone calls to agencies which conduct the tests and publish the results. The study also looked at how the general public responds to information about measurement error being reported alongside examination and test results.

The US ‘rule book’ of test development is the Standards for Educational and Psychological Testing (1999), published by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). These standards specify that the degree of measurement uncertainty must be reported, and US courts have followed the standards in their rulings. Thus US test publishers are careful to ensure that their practices are consistent with the standards – and for some government-commissioned tests this kind of reporting is required as part of the contract.

But, if all the tests referred to in the report follow the standards, what does reporting to the public mean? How is it done? The researchers reported a variety of ways in which the testing agencies implement this requirement. A common practice is to report more widely only the most easily digestible test result statistics, and to put the more technical statistics – such as those specifying measurement imprecision – into ‘technical’ and/or ‘test development’ reports, which are not widely disseminated.

3. Educational tests

The researchers found that a lot of uncertainty information was given when the results of four major kinds of US educational test were reported. For the state exams linked to the ‘No Child Left Behind’ programme (reading and mathematics tests across seven year grades, and science across three grades), practices vary from state to state but all publish technical information in manuals linked to the tests. These give estimates of measurement errors in scores, and information about consistency and accuracy of performance classifications. Where the marking of the tests has required more decision making from markers, i.e. when the answers are not multiple choice or objectively marked, the manuals also give information about ‘inter-rater reliability’ (showing whether the markers were marking to the same standards as each other).

Many of the states highlight possible imprecision in the results alongside the students’ scores on the published parent/student reports. High-school leaving examinations also give similar amounts of uncertainty information, as do the two major college/university entrance exams, the SAT and ACT. The National Assessment of Educational Progress (NAEP) federal exams survey students’ performance across the US. They assess the performance of different groups but do not give individual results to students. The researchers found that these are the most transparent in terms of the information provided about how the test papers and questions performed. The NEAP results are made easier to interpret with numerous interpretative reports and on-line tools.

4. The public response

The report points out that parents are not directly given the full range of measurement uncertainty information about their children’s tests – but that those who want to see it can find it in the published test manuals. Much of it is highly technical and dull. The researchers reported that one state examination’s technical manual came to 800 pages. The public are more interested to read other information about the tests, for example descriptions of the content to be covered and the kinds of question that will be asked. The report concludes that the availability of uncertainty information appears to match the level of demand for it.

5. Licensure exams

The researchers investigated licensure exams for entrants to medicine, accounting, nursing, law and teaching. The amount of information about levels of measurement uncertainty vary in these exams. Some report uncertainty information directly to students, along with more technical information in the test manuals; some just publish the manuals; some prepare reports which give substantial detail about the tests, but which are not released to the public.

The nursing exams use computer-based, multiple-choice questions and the tests are variable-length, adaptive tests. That is, all candidates are asked a minimum number of questions, but the programme gives the candidates different questions to answer according to their success in answering previous questions. This builds a greater certainty into the exam because, where there is uncertainty, the programme will ask more questions of the candidate. The accounting exams are of a similar type and the assessment agency provides a range of uncertainty information. The candidate reports contain a statement about possible sources of measurement error, and confidence bands are used to emphasise the role of measurement imprecision. The medical exams also publish a great deal of psychometric information and examinees are provided with diagnostic information with confidence bands.

The researchers looked at tests for prospective teachers developed by NES and Pearson. They found that little technical information was made available about these tests either in manuals or reports for candidates. Reports are written for the states involved, but these are not made public. The law licensure exams that were looked at also did not make technical information about the tests public, but only published material which informed candidates about the content of the examinations.

6. Levels of transparency in the US

Generally the report concludes that the level of transparency in the US is greater for educational tests than for licensure tests. It is greater also for objective tests than for those which require more marker judgement, and there is more information given for larger-scale exams than smaller ones.

On the methods by which information about the imprecision of scores and grades might be given, the report mentions graphical score bands, with ranges within which scores would fall if the testing were replicated, and with statements that appear at the bottom of score reports explaining the concept of measurement imprecision. Sometimes it is reported to students that score imprecision might result from, for example, the sampling of test items used on a given day, the impact of guessing, or how they might be feeling on a particular day. Imprecision due to the subjectivity of scoring constructed-response items is almost always reported in technical manuals. Imprecision associated with the setting of performance standards (i.e. cut scores) is also usually reported in technical manuals. Generally, interpretive guides caution against over-interpreting test scores because of measurement error.