You think we said what? A comparison of multiple AI’s attempts at captioning a folk music performance

Handouts

Scheduled at 9:00 am in Matchless on Wednesday, November 19.

#41106

Speaker(s)

  • Laura Ciporen, Digital Content Accessibility Manager, McGraw Hill Education

Session Details

  • Length of Session: 1-hr
  • Format: Lecture
  • Expertise Level: All Levels
  • Type of session: General Conference

Summary

An experiment of running various automated captioning tools on a video of a 30-minute folk music set yielded some wildly different, and some interestingly similar, results. While AI is generally a black box, the mistakes it makes can shed a few specks of light on how each one works and hint at where its data comes from. See how an assortment of captioning tools fared, laugh and cry at the misinterpretations, and witness how the conference’s transcribers handle a few snippets of harmony singing.

Abstract

On Valentine’s day, 2025, a folk duo with "tight harmonies and loose shtick" performed for the first time at a sci-fi convention. Before posting the amateurly-filmed video for public view, I ran it through various automated caption-generators to get a starting point for creating the captions. Things did not go as I had hoped. Cues that would help humans understand the content were shown not to be useful to AI. Chunks of content were skipped with no reference to any sounds having been made. Bizarre nonsense was confidently generated. And yet, key details were picked up, uncommon names were interpreted with surprising acuity, and surprising tools were surprisingly reliable. While it certainly didn’t save me time, this experiment taught me a great deal about the different strengths and weaknesses of various AI tools from Otter.ai to YouTube to Microsoft Word and more. Come have a laugh, get provocative insight into the data that goes into some AI’s black boxes, uncover some straightforward room for improvement, and hear small snippets of folk singing. As a bonus, learn where this advice came from: “Tinny lamb….We can drive away the silence of the gravy. We all fear that we all fall silent in the end. Sing.”

Keypoints

  1. Automated captioning tools vary wildly in their interpretation of the same content.
  2. Less work on transcripts before providing them as sources for caption-timing is more successful.
  3. Always research the format of the output of captioning tools before committing to using them.

Disability Areas

Cognitive/Learning, Deaf/Hard of Hearing

Topic Areas

Artificial Intelligence, Captioning/Transcription, Uncategorized

Speaker Bio(s)

Laura Ciporen

IAAP CPACC. Creating inclusive content (with WCAG-based accessibility) for higher education since 2015.

Handout(s)