Automated assessment, including the use of Artificial Intelligence (AI), is one of the latest education tech solutions. It speeds up exam marking times, removes human biases, and is as accurate and at least as reliable as human examiners. As innovations go, this one is a real game-changer for teachers and students.
However, it has understandably been met with a lot of questions and sometimes scepticism in the ELT community – can computers really mark speaking and writing exams accurately?
The answer is a resounding yes. Students from all parts of the world already take AI graded tests. PTE Academic and Versant tests – for example – provide unbiased, fair and fast automated scoring for speaking and writing exams – irrespective of where the test takers live, or what their accent or gender is.
This article will explain the main processes involved in AI automated scoring and make the point that AI technologies are built on the foundations of consistent expert human judgements. So, let’s clear up the confusion around automated scoring and AI and take a look into how it can help teachers and students alike.
AAI versus traditional automated scoring
First of all, let’s distinguish between traditional automated scoring and AI. When we talk about automated scoring, generally we mean scoring items that are either multiple choice or cloze items. You may have to reorder sentences, choose from a drop down list, insert a missing word- that sort of thing. These question types are designed to test particular skills and automated scoring ensures that they can be marked quickly and accurately every time.
While automatically scored items like these can be used to assess receptive skills such as listening and reading comprehension, they cannot mark the productive skills of writing and speaking. Every student response in writing and speaking items will be different, so how can computers mark them?
This is where AI comes in.
We hear a lot about how AI is increasingly being used in areas where there is a need to deal with large amounts of unstructured data, effectively and 100% accurately – like in medical diagnostics, for example. In language testing, AI uses specialized computer software to grade written and oral tests.
HHow AI is used to score speaking exams
The first step is to build an acoustic model for each language that can recognize speech and convert it into waveforms and text. While this technology used to be very unusual, most of our smartphones can do this now.
These acoustic models are then trained to score every single prompt or item on a test. We do this by using human expert raters to score the items first, using double marking. They score many hundreds of oral responses for each item, and these ‘Standards’ are then used to train the engine.
Next, we validate the trained engine by feeding in many more human marked items, and check that the machine scores are very highly correlated to the human scores. If this doesn’t happen for any item, we remove the item, as it is essential to match the standard set by human markers. We expect a correlation of between .95-.99. That means that tests will be marked between 95-99% exactly the same as human marked samples.
This is incredibly high compared to the reliably of human marked speaking tests. In essence, we use a group of highly expert human raters to train the AI engine, and then their standard is replicated time after time.