Addressing Artificial Intelligence issues in Automatic Speech Recognition that affect Transcription Accuracy

#39526

Speaker(s)

  • Jeremy Brassington, CEO, Habitat Learn Inc
  • Daniel Goerz, North American CEO, Habitat Learn Inc

Session Details

  • Length of Session: 1-hr
  • Format: Lecture
  • Expertise Level: Intermediate
  • Type of session: General Conference

Summary

Automatic Speech Recognition (ASR) system integrity may be compromised by hidden biases within the AI models employed. Biases often stem from inaccurate representations of different populations within collected data. Research by USC suggests up to 38% of training data, such as generic sentences, lack fairness. We will illustrate issues relating to gender and accents in English, the impact they have on academic transcriptions, and provide insight into an automated evaluation system that identifies and addresses the problem.

Abstract

Automatic Speech Recognition (ASR) systems play a role in academia, where they facilitate the transcription of lectures, discussions, and academic presentations. However, their effectiveness depends on their ability to accurately interpret a range of accents and other features affecting diction, including age and gender. External noises, technology use and other environmental factors can lead to inaccurate transcriptions alongside embedded machine learning biases in ASR models, which without correction and feedback, perpetuates disparities and hinders accessibility.

We present a bias-aware evaluation platform with fine-tuning of ASR models to mitigate such issues. Speech recorded data, collected for AI training, may inadequately represent certain demographic groups. ASR systems often struggle to accurately transcribe non-standard accents. We trialed 10,000 recordings initially with 5000 each for women and men. The second used 6000 sentence-length recordings, 2000 each for English, Welsh, and Scottish accents. Word Error Rate (WER) is a common metric for ASR model accuracy and was used to compare the improvements provided by fine-tuning. WER equals sum of errors divided by total number of words. The lower the WER the better.

Limitations include lack of contextual understanding, missed semantic errors and omissions. Using a pre-trained model with fine-tuning when considering gender using equal numbers of male and female recordings, demonstrated scores that are up to 5 times better than original WERs. 1000 female speaker test sample transcripts had an average WER score of 24.65 reducing to 5.1; compare 1000 male speaker samples from 23.74 to 4.34.

When accents were checked using fine-tuned data with appropriate numbers of text samples for each accent, similar WER reductions occurred. ASR models that include more diverse and representative data to correct transcription errors can be checked automatically and fine-tuning can alter other aspects of biased data.

Keypoints

  1. Automated evaluation of bias issues affecting transcription accuracy is possible with defined limitations
  2. Fine tuning of data training to encompass diverse populations can improve word error rates in transcriptions.”
  3. Different biases in AI training data require specific types of labelled data to improve outcomes.

Disability Areas

Cognitive/Learning, Deaf/Hard of Hearing

Topic Areas

Artificial Intelligence, Assistive Technology, Uncategorized

Speaker Bio(s)

Jeremy Brassington

21 years experience in Assistive Technology as owner and founder of Conversor, an assistive listening device company, and cofounder of Habitat Learn, the Speech to text software and services company. Jeremy is a Board Member of the Assistive Technology Industry Association.

Daniel Goerz

Daniel has spent 15 years supporting students in Higher Education with assistive technology. He founded Note Taking EXpress, remote note taking service company, in 2013. Note taking Express merged with habitat Learn in 2018.