Report: States Should Focus on Score Validity When Using Automated Scoring for...

Report: States Should Focus on Score Validity When Using Automated Scoring for Assessments of Spoken English Proficiency


PRINCETON, New Jersey, June 1, 2017 /PRNewswire-HISPANIC PR WIRE/ — A new report from Educational Testing Service (ETS), urges states, districts and schools to focus on score validity as they move to automate scoring of speaking proficiency for the growing numbers of K-12 English learners. The authors offer recommendations for the most effective approaches to produce meaningful scores.

ETS logo.

The report – Approaches to Automated Scoring of Speaking for K-12 English Language Proficiency Assessments is a guide for states and districts that are considering the use of automated scoring of speaking for K-12 English learners. It reviews the main considerations about design (components of state-of-the-art automated speech scoring systems, speaking tasks they have been applied to) and implementation (test delivery, score reporting, and costs) and makes recommendations to help states determine a path forward. The report is the fifth in a series concerning English language proficiency (ELP) assessments and English learners produced by ETS Research & Development.

“There are a number of considerations that states need to balance,” says lead author Keelan Evanini. “They need to think about key measurement issues like validity, reliability and fairness, and of course there are practical considerations related to usability and efficiency. But above all, it is essential that the test is based on a meaningful representation of student’s speaking skill.”

The authors write that speaking remains uniquely challenging to score, but can provide essential information about students’ language proficiency. Automated scoring systems can ease the burden of scoring, but care must be taken to implement an automated scoring system that is appropriate to the skills being assessed.

The most effective automated speech scoring systems provide broad coverage of the speaking construct, including pronunciation, fluency, intonation, vocabulary, and grammar. Depending on the nature of the speaking task, they can produce scores that match those of human scorers, and where the variation between automated and human scores is not greater than that between two human scorers. The authors remind us that automated scoring systems are not yet fully mature for most task types that elicit spontaneous speech, because their ability to assess the content of the spoken response is still developing.

“The need to assess spontaneous speech effectively is important because current ELP standards, which are written to correspond with Career and College Readiness (CCR) standards, often emphasize aspects of language proficiency that are required to express higher order skills involving critical thinking,” explains co-author Kenji Hakuta. “These aspects of speaking proficiency are best assessed via tasks that elicit spontaneous speech and match the target language of the classroom. Therefore it is crucial to find a balance between these constraints when designing tasks to be included in an ELP assessment that employs automated scoring.”

One key decision that states need to make when considering to use automated scoring systems is whether they will be used as the sole scoring mechanism for a speaking assessment (fully automated) or together with human rating (hybrid approach). The authors recommend a hybrid approach for most assessments in order to ensure full construct coverage.

“In hybrid approaches, the practical benefits of automated scoring with respect to cost savings and score turnaround are not as striking as they are in the fully automated model,” says co-author Maurice Cogan Hauck, “but the automated scoring technology does considerably reduce the amount of human labor needed. Furthermore, some studies have shown that the combination of human and automated scores can result in more reliable scores than using either human or automated scores alone. A hybrid approach also helps ensure that speaking scores are valid for their intended purpose.”

In addition to providing schematics for developing an automated speech scoring system, the authors provide recommendations on pilot testing, sample size, and length of development prior to implementation. With the use of automated speech scoring come requirements for common test delivery platforms (desktop computers or tablets) and standardized processes for capturing high-quality digital recordings of student speech samples. The report also reviews both up-front and recurring costs for implementation.

Recommendations for States Considering Automated Scoring

Given the considerable advances in automated scoring technology, as well as the large and growing number of K-12 EL students who need to be assessed, now is an opportune time to consider the potential of using automated scoring to assess the speech of K-12 EL students. The authors make the following recommendations:

  • validity of the assessment should be the first and foremost consideration
  • automated scoring needs to be considered during the test design phase so that the test can include speaking tasks that elicit evidence of the targeted knowledge, skills, and abilities in a way that is compatible with state-of-the-art automated scoring capabilities.
  • to provide full construct coverage for a given set of ELP standards, a hybrid approach, in which responses to some tasks are scored by human raters and responses to other tasks are score by an automated system, is likely to be optimal.
  • states should consider not only the most obvious benefits such as cost savings and increased score turnaround time, but also some less readily apparent benefits, e.g. more reliable scores using the hybrid approach, the possibility of providing detailed information about a student’s speaking proficiency in real time to teachers and students for use as formative feedback, and providing additional benefits for learning and ongoing monitoring of student progress.

“As the state of the art in the fields of automated speech recognition and automated speech scoring continues to advance, the ability of the automated speech scoring systems to provide a valid assessment of K-12 EL speaking proficiency across a wide variety of task types is expected to continue to increase. Therefore potential users of the technology should routinely reevaluate decisions about the appropriateness of automated speech scoring for a given assessment,” the report concludes.

Copies of the report may be downloaded free of charge from the Wiley Online Library

Download other papers in the series:

  1. Creating a Next-Generation System of K–12 English Learner (EL) Language Proficiency Assessments
  2. Conceptualizing Accessibility for English Language Proficiency Assessments
  3. Next-Generation Summative English Language Proficiency Assessments for English Learners: Priorities for Policy and Research
  4. Key Issues and Opportunities in the Initial Identification and Classification of English Learners

About ETS
At ETS, we advance quality and equity in education for people worldwide by creating assessments based on rigorous research. ETS serves individuals, educational institutions and government agencies by providing customized solutions for teacher certification, English language learning, and elementary, secondary and postsecondary education, and by conducting education research, analysis and policy studies. Founded as a nonprofit in 1947, ETS develops, administers and scores more than 50 million tests annually — including the TOEFL® and TOEIC® tests, the GRE® tests and the Praxis Series® assessments — in more than 180 countries, at over 9,000 locations worldwide.


SOURCE Educational Testing Service

Report: States Should Focus on Score Validity When Using Automated Scoring for Assessments of Spoken English Proficiency