OpenSpeaks Archives, a language archive for Wikimedia projects

Translate this post
OpenSpeaks Archives Poster

Wikipedia is far from becoming the sum of all human knowledge until the vehicle to that knowledge, the human languages spoken worldwide, are also well represented on Wikipedia.

The majority of the world’s languages are oral and not written. Text cannot fully express the nuances of a language as compared to audio or video. However, an audio or video without transcription in one language is merely representational to a non-native speaker. Of 7,164 languages spoken worldwide, there are only 354 Wikipedias. We need a descriptive audio or video for each spoken language. The media needs subtitles, making it comprehensible for non-native speakers and readers with disabilities. That is an overly ambitious goal. But often, such goals demand one tiny step at a time! We’re launching OpenSpeaks Archives, an open and public digital language multimedia archive optimised for Wikipedia and Wikimedia projects, focusing on lesser-resourced languages.

In our first pilot, we brought audio and videos in five native, spoken tongues: Kusunda (Gejmehac Gipan) from Nepal and Baleswari-Odia, Bonda (Remosam), Ho and Van Gujjari from India. Each video file is subtitled in multiple languages—at least in one local official language and English. Some of the videos also have closed captions containing subtitles in the language spoken. The videos have enriched Wikipedia in over 20 languages as well as Wiktionary, Wikisource, Wikidata, and Wikimedia Commons.

Ladura Singh Haiburu, a Ho-language speaker, demonstrating and saying the names of body parts (subtitled in English)

The pilot contributed to the maiden launch of Wiki Loves Languages, an edit-a-thon to grow knowledge of languages and speakers in Wikimedia projects. Collaboration with two international archives is underway, acquiring and distributing the media among their networks. The source footage is all unused, archival media from five documentaries—Gyani Maiya (2019), Remosam (2019), Mage Porob (2019), MarginalizedAadhaar (2021) and Nani Ma (2022). The approach and methodology are from OpenSpeaks. The pilot extensively used open source software and identified a list of technological gaps hindering language documentation.

Overall workflow

We combed through raw audio and video recordings from 2014 in our private archive containing content relevant to Wikipedia or Wiktionary. We often recorded audio and video separately and synced them using a non-linear video editor. We made dummy subtitles to identify pauses in spoken sentences and sent those along with roughly edited videos to language experts. They watched the videos, edited subtitles and sent back draft subtitles. We further edited the videos for content and audio, trimming unnecessary parts and adding relevant B-rolls. We translated subtitles and checked with the language experts. Once subtitles were finalised, the videos were exported, converted into WebM, uploaded to Wikimedia Commons, and embedded in Wikipedia articles and other places. We checked with the language experts multiple times for accuracy throughout the process.

Gaps this archive addresses

OpenSpeaks Archives will focus on five (and one optional) critical aspects of media production:

  1. Descriptive, natural speech recording: a speaker speaking about any topic in conversational language
  2. Recordings without background music: to keep spoken words clear unless the recording is of a musical performance
  3. Recording to be professionally edited for content: multiple videos merged if needed; unnecessary parts (e.g. interviewer’s questions) moderately trimmed and b-rolls inserted (only in video) without distorting speech flow, making each recording independent and comprehensive; audio lightly cleaned for amplifying speaker’s voice and reducing noise—all such edits while keeping speech natural
  4. Output files subtitled, transcribed and subtitles translated in English: spoken sentences transcribed or subtitled in a neighbouring majority language and English using closed captioning (for multilingual subtitling), not burned-in subtitles
  5. Upload highest quality recording and embed in Wikimedia projects: Upload videos as WebM and audio as WAV (lossless) to Wikimedia Commons and as .mov (videos only, try for lossless exporting in editing suite) to the Internet Archive or a similar open knowledge online library. Maximise using each video in the maximum number of Wikipedia/other Wikimedia project entries.
  6. Get media archived in a noted GLAM institution (optional): Cataloguing media in a noted GLAM institution’s online catalogue helps increase its citation count, increasing its reliability.

Software tools wishlist

OpenSpeaks Subtitle Editor Demo
An experimental browser-based offline subtitle editor was used for multilingual subtitling in the first pilot of OpenSpeaks Archives

This is a list of software we wish we had. Every community-based language archivist would need all or most from this list. We did not have the resources to build full-fledged tools, so we used command-line open source scripts, mostly Python-based. Many archivists might not be adept with such workflows, so standalone tools are dearly needed. It would be nice if some were browser-based, independent of operating systems (work even on smartphones or tablets), and offline to address remote/very low internet bandwidth barriers. We plan to work on these and invite others to contribute, too.

  1. Audio/video-to-dummy subtitle creator: Identifies pauses between sentences to create dummy subtitles, which can later be manually edited.
  2. Offline, browser-based subtitle editor: A simple, browser-based, offline subtitle editor that creates video subtitles by playing, pausing, and typing.
  3. Audio/video file duration calculator: For counting audio/video file duration. Helps with budgeting.
  4. Video bitrate calculator (and converter): for sending draft audio/video files with file size constraint (e.g. file sharing on messaging applications) back and forth between video editors and language experts.

As this pilot is almost over, we plan to expand to more languages, involve more community archivists, and involve more Wikimedians. We already have recordings in 20+ low-resourced languages with informed consent from the interviewees. Our top priority would be to bring some of those recordings to Wikimedia projects.

More

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?