Reaching out to Indonesia’s indigenous voices: WikiKathā

Translate this post
WikiKathā’s logo (Gunarta (WMID), CC BY 4.0)

Language documentation efforts through digital media have been carried out by various means, including through Wikimedia projects by Wikimedia contributors all around the world. In Indonesia, there were already several specific programmes to document Indonesia’s regional languages by means of Lingua Libre and the local versions of Wiktionary, such as WikiDialek (2020) of the Minangkabau language, WikiPandir (2022) of the Banjarese language, WikiSora (2022) of the Sundanese language, including the most recent programme of WikiTutur (2024–present) organised for various languages of Indonesia across the western part of the country. However, the languages documented through those programmes, so far, are mostly those belonging to the “major languages”1 group and the programmes were mainly participated in by participants residing in big cities who are more exposed to easy access to technology. We still have a problem where there are more “minor languages”2 of indigenous people that are still not documented digitally, especially on Wikimedia platforms. Hence, WikiKathā presents to reach out to Indonesia’s indigenous voices that are still not documented on Wikimedia platforms.

What is WikiKathā?

Participants of the WikiKathā programme in Halmahera, October 2025 (Bangrapip, CC BY-SA 4.0)

As a language documentation programme, WikiKathā is organised by Wikimedia Indonesia in collaboration with Aliansi Masyarakat Adat Nusantara (the Alliance of Nusantaran Indigenous Communities), an Indonesian non-governmental organisation for indigenous people’s rights advocacy. Our main aims are to enrich the audio documentation and lemmas on the Indonesian version of Wiktionary (Wikikamus) for several endangered languages of more remote indigenous communities of Indonesia. The name itself is derived from wiki, as most of the other Wikimedia projects are, and the Sanskrit word कथा kathā ‘speech, story’, which was later borrowed into Malay/Indonesian as kata ‘word’. Hence, we believe that this documentation programme is not about documenting voices merely as a kata (word), but also as voices containing a kathā (story) about the communities themselves behind it.

The programme has been planned to be organised in eight different regions across Indonesia for several languages, and it has been ongoing since August 2025. The regions included are as follows:

  • Balikpapan, East Kalimantan for Balik language [lbx] (August 2025)
  • Jambi for Orang Rimba [kvb] and Jambi Malay [jax] languages (September 2025)
  • Halmahera, North Moluccas for Isam-Pagu [pgu] and O’Hongana Manyawa (Inner Tobelorese) [tuj] languages (October 2025)
  • Banyuwangi, East Java for Osing language [osi] (planned)
  • Mataram, West Nusa Tenggara for Sasak language [sas] (planned)
  • Tenggarong, East Kalimantan either for Tenggarong [vkt] or Kota Bangun [mqg] varieties of Kutai language (planned)
  • Riau for Sakai language and Talang Mamak variety of Malay language (planned)
  • Tarakan, North Kalimantan for either of the three varieties of Punan language, i.e., Punan Merah [puf], Punan Merap [puc], or Punan Tubu [puj].

From the above-listed regions, three of them have been successfully organised, i.e., in Balikpapan (August 2025), Jambi (September 2025), and Halmahera (October 2025). Up to October 2025, we have produced more than 3500 audio files for five languages as mentioned above, which previously had still a few lemmas, or even no existing lemmas, on Wiktionary. Yet, the programme is still ongoing.

The programme consists of different activities involving indigenous native speakers and youths, including:

Lemma collection carried out by participants together with the native speakers of the Orang Rimba language (Bangrapip, CC BY-SA 4.0)
  • Lemma collection: This step aims to gather the lemmas before the recording process. Usually, we provide a separate well-ordered and prepared database consisting of vocabularies from existing references to be verified by the native speakers. In the case of a lack of references for vocabulary, we provide a list of important words to be used as a reference to collect the words from the native speakers.
  • Audio documentation: The audio documentation process is usually carried out by means of Lingua Libre, an automated online tool designed to allow us to record and upload the audio files on Wikimedia Commons. In certain cases where it is unfeasible to utilise Lingua Libre, a manual recording method using recording devices may also be applicable.
  • Expanding Wiktionary: A mini-training on editing on Wiktionary is given to introduce the participants, especially the local youth, to demonstrate how to input the lemmas previously collected into Wiktionary. Such a mini-training is also organised for Wikimedians or other participants outside the indigenous communities to raise awareness of the existence of those communities.
  • Feedback session: A discussion session with participants is an integral part of the programme to exchange ideas or advice from the community for further projects.

Challenges encountered

One of the native speakers of Balik language participating in WikiKathā, August 2025 (Wadaihangit, CC BY-SA 4.0)

Despite the promising opportunities of this programme, some challenges are also encountered during the sessions. So far, the challenges faced include:

  • Unfamiliarity of native speakers with writing systems: As an automated tool, Lingua Libre allows us to make recordings of hundreds of words in a row instantly by just pronouncing every word displayed on the screen in turn. However, in the cases we have encountered, many (not all) native speakers, who live mainly in remote and relatively isolated areas and probably did not get access to a proper education, are not familiar with reading in any writing system. In this case, we have to be more patient in making the recording for every single word one by one.
  • Lack of good internet access: Internet access has become an integral part of our digital language documentation effort through Wikimedia projects, such as Lingua Libre and Wiktionary. However, in some regions, good internet access is scarce. Therefore, in that case, it is unfeasible to use Lingua Libre to make the recordings. This might be a further consideration for the Lingua Libre tool developer team to develop an offline-based recording feature that allows us to record in an offline state and store it temporarily in local storage before its upload, whilst being online.
  • Lack of a standardised orthography: This is not a specific obstacle for the WikiKathā programme, but rather also a common problem for other digital language documentation projects through Wikimedia projects in Indonesia. Despite being linguistically diverse, most of the regional languages of Indonesia do not have a standardised orthography. Some published dictionaries use their own orthography that might be different to the orthography used in other dictionaries of the same language. This lack of a standardised orthography makes it challenging to choose the orthography to be used for the lemmas on Wiktionary. In this case, we usually choose either a so-called “dictionary orthography”, which we deem to be the most suitable for the language, or the popular spelling used by many people for daily purposes. We hope that this programme will raise awareness about this problem, which will be taken into consideration by the local governmental language institutions.

Lessons learnt

O’Hongana Manyawa (Inner Tobelorese) youth recording words with the native speakers of the language, Halmahera, October 2025 (Bangrapip, CC BY-SA 4.0)

As previously mentioned, this programme does not seek only to gather kata (words), but also the kathā (stories) behind them. From this programme, we have learnt various stories about the communities and why the languages need to be documented and preserved. We hope that this programme can be a beginning point for other local heritage preservation projects.

Let us begin the stories from the Balik tribe of East Kalimantan, where their territory is being pushed aside by the development of Indonesia’s new planned capital. This development causes some members of the community to be displaced, so the community become more scattered. Those members of the community who moved to other places and live side by side with other people of other ethnic groups would prefer to give up their native tongue in favour of a more widely accepted language in a more diverse society, i.e., Indonesian. This will decrease the number of speakers of their native tongue, which will lead to its extinction. Besides, by this development, the natural ecosystem, which has become an integral part of the tribe, is being transformed into a city with buildings, which might also cause the endangerment of local flora and fauna species. The lesson learnt from WikiKathā session that we have organised in East Kalimantan, in August 2025, is that we see a strong connection between the Balik language and the natural environment, which is shown by the fact that most of the lexicons we gathered during the WikiKathā session in East Kalimantan are highly related to the local natural ecosystem, including its flora and fauna species. Hence, we learnt about the importance of preserving both the language and the ecosystem as the heritage of Balik forefathers in order to continue their survival in the digital age.

As for the orang Rimba tribe of Jambi, Sumatra, they have their own story. Having a geographically, linguistically, and historically close relation to the surrounding Malays, who predominantly are Muslims, the orang Rimba tribe retains their own distinct local wisdom in their language. During the WikiKathā session organised in Jambi in September 2025, we gathered orang Rimba lexicons, amongst which are related to their traditional local beliefs and rituals. We learnt that despite being linguistically close to the Malays, their isolation in more remote areas makes their customs entirely different from those of their neighbours. Their adherence to their beliefs and rituals, which become an integral part of the community, signifies a firm relationship between their language and their customs. Hence, we hope that WikiKathā can be the beginning point for further projects related to the documentation of their customs in the future for expanding the open and free knowledge.

Two North Moluccan tribes, the Isam (Pagu) people and the O’Hongana Manyawa (Inner Tobelorese, or literally ‘the people of the forest’) people, have a different story. Their languages belong to the West Papuan language family. Ethnologue classifies these two languages as endangered languages due to the language shift of their youth to the languages deemed to be more prestigious, either to North Moluccan Malay or Indonesian—as was also stated by the elders of the tribes. The negative attitude of some members of the communities, who feel embarrassed by their languages and deem them as ‘backward languages’, toward the languages only aggravates the situation. Hereby, a digital language documentation effort is needed to prevent—or at least, to decelerate—its extinction. The presence of WikiKathā, as a digital language documentation programme which allows the language to exist on the internet, especially on Wikimedia projects, not only documented the language but also inspired the community to feel proud to preserve their ancestors’ heritage. Moreover, this programme inspired one of the participants, who happens to be from another indigenous community, the Tabaru people, to make the same effort for his language.

The stories we learnt from every community and region might be different. This signifies diversity across the country. We are sure that there are still more different stories from other communities in other regions, and we hope to learn them in the future.

Conclusion

WikiKathā, as a digital language documentation effort by means of Wikimedia projects, are aimed at documenting the languages of indigenous communities that are relatively less known by many people due to their settlement in remote and relatively isolated locations. WikiKathā exists as a means to reach out to the voices of indigenous communities to help the survival of their existence in the digital era. With the presence of this programme, we hope that their representation on the internet within our digital society becomes more visible and preserved. Moreover, this programme exists not only to document the local heritages, but also to inspire the communities in preserving them. We also hope that this project can be the beginning point of other various projects to document digitally the heritage of indigenous communities, since we believe that there is a kathā (story) behind every kata (word). We understand that WikiKathā, as a digital language documentation effort that has been carried out so far, has its strengths and weaknesses in its organisation, and we always learn from them to organise it better and better. At the end of the day, this effort aims to expand open and free knowledge.

Notes

  1. “Major languages” here refer to the languages with a vast number of speakers or well-known languages. ↩︎
  2. i.e., lesser-known languages of the indigenous community (some of them live in more remote and relatively isolated areas) ↩︎

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?