By Medari J Tham
Language not only reflects our identity but also serves as a medium through which our culture is transmitted and preserved. In this technological age, where connectivity has no borders, language usage has expanded exponentially. For instance, WhatsApp alone handles nearly 100 billion messages daily. This raises an important question as to how has Khasi, as a language, fared on digital platforms compared to other languages in India and across the globe. Understandably, English continues to dominate, accounting for about 49.1 per cent of all websites, followed by Spanish, German, and Japanese at six per cent and below. In the Indian context, Hindi accounts for just 0.1 per cent of web content, while Khasi’s online presence remains negligible.
Although Khasi was removed from UNESCO’s endangered languages in 2012 and declared “safe” due to its widespread use in education, media and religious domains, it is imperative that language preservation and propagation efforts accelerate amid this global surge in digital language use. Being aware of the current status of Khasi’s digital presence and tapping into the opportunities offered by language technology and open knowledge platforms will ensure that the language continues to thrive as an integral part of our identity.
The presence of online Khasi newspapers, news channels, blogs is encouraging and indicates that language usage is vibrant and active. Khasi has also made its mark on social media through platforms such as YouTube, Facebook, and Instagram. YouTube statistics are especially striking: Batesi TV has about 11 lakh subscribers with 97 crore total views; T7 News Channel has around 7.7 lakh subscribers and 58 crore views; U Nongsaiñ Hima has a little more than 7 lakh subscribers and 54 crore views.
The growing field of Natural Language Processing (NLP), where machines learn to understand human language using artificial intelligence and linguistics, has transformed the digital landscape of languages worldwide. English, being the technological lingua franca, has advanced rapidly, with tools such as spell and grammar checkers, automatic speech transcription for video subtitles, voice assistants like Siri, Google Assistant, and Alexa, and conversational AIs like ChatGPT and Gemini enhancing writing and communication. Given these developments, it is important to ask to what extent have these technological advancements influenced the Khasi language and the Khasi-speaking community, and what opportunities lie ahead for language preservation in the digital era.
ChatGPT now supports dozens of major world languages and can attempt to understand and generate text in many others, including Khasi, though with varying reliability. Other conversational AIs—such as Google’s Gemini, DeepSeek, and Meta AI—also provide some level of Khasi support, each with different degrees of fluency and accuracy. Google Translate, one of the world’s largest multilingual translation systems with over 500 million daily users and support for more than 100 languages, added Khasi to its platform in June 2024. While the platform offers many features—such as camera translation, speech translation, document and website translation, offline support, phrasebook functions, and transcription—Khasi currently benefits mainly from text-based translation. This is due to the limited availability of digital data for Khasi, making it a low-resource language from an AI perspective.
There is, however, a growing number of native researchers contributing to Khasi language technology through tools such as part-of-speech taggers (which automatically assign grammatical tags to words), shallow parsers (which identify noun phrases and verb phrases), automatic text summarization, and English-to-Khasi machine translation systems. The backbone of conversational AI and translation technology is the Large Language Model, which relies on massive datasets. For languages such as English, German, and Mandarin, available digital resources amount to billions or trillions of tokens collected from the Internet. This abundance has been a decisive factor in enabling models like ChatGPT and Gemini to perform strongly in these languages.
Corpus development for Khasi through building machine-readable collections of texts and speech has gained momentum in recent years. Researchers are increasingly recognizing the need for large, diverse, and openly accessible datasets to support modern AI tools. For Khasi, continued growth in corpus size, diversity, and availability will be crucial for improving translation systems, speech technologies, and future NLP applications.
In April 2025, the Government of Meghalaya signed an MoU with the Digital India Bhashini Division to integrate Khasi and Garo into the Bhashini platform. Bhashini, an initiative under the Ministry of Electronics & Information Technology (MeitY), aims to leverage AI-powered language technologies to build an inclusive digital ecosystem that transcends linguistic barriers. This partnership marks a promising step toward the inclusion of non-scheduled languages alongside the 22 scheduled languages already supported by the platform. Bhashini offers services such as text-to-text translation, video translation, voice translation, text-to-speech, speech-to-text, and other AI-enabled features. While Khasi is already listed as a supported language in the mobile application, the actual functionality currently available for Khasi is limited to text-based translation.
One of Bhashini’s key initiatives is community-driven crowdsourcing, where volunteers contribute speech samples, translations, and validations of AI-generated outputs. For Khasi, crowdsourcing could involve thousands of speakers providing English–Khasi translations, recording speech samples, and validating AI outputs. Community participation will accelerate corpus development and ensure inclusivity by representing diverse accents, dialects, and natural speech patterns. Such efforts can significantly reduce the time required to develop language resources.
Another promising opportunity for Khasi speakers lies in contributing to Open Knowledge platforms. The recently concluded Bahu Bhasa 2025 festival, organized by Open Knowledge Initiatives (OKI) at IIIT Hyderabad, brought together experts working on language diversity and equitable access to knowledge. This platform was successfully planned by OKI which is actively engaged to strengthen and expand the open knowledge and technology ecosystem. Discussions explored how policy, governance, and digital technologies influence language use. One session highlighted how AI tools such as aakhor AI have empowered Assamese users with digital typing, content creation, and inclusive knowledge access. Other discussions showcased community-driven projects in documentation, preservation, and revitalization with examples from Wiki Loves, Bengali Wikisource that are part of the Wikimedia movement, one of the world’s most remarkable volunteer-driven efforts to make knowledge freely accessible to everyone.
In another session, the Language Technologies Research Center (LTRC), IIIT Hyderabad, demonstrated its work on machine translation among Indian languages, including Khasi. The Science Gallery Bengaluru (SGB), part of the Science Gallery International Network pioneered by Trinity College Dublin, highlighted the importance of public engagement in science—moving beyond traditional museum formats by blending art, design, technology, and research.
Another strong example of community-driven engagement is OSM Kerala, an OpenStreetMap community in Kerala built on the principle that mapping should belong to everyone. It is a vibrant and active volunteer group contributing to the mapping of Kerala’s geography, infrastructure, and cultural landmarks on OpenStreetMap, the world’s largest open, crowdsourced mapping platform. The common message across these discussions was clear: digital inclusion, language preservation, and equitable access for small languages are achievable when communities take the lead.
In this context, the State government’s hosting of a Regional AI Impact Conference on December 3, 2025 is timely. It presents a unique opportunity to accelerate the digital presence of Khasi (and Garo), strengthen AI language tools, and empower students, researchers, and the wider public.
(Dr. Medari J. Tham works in Language Technology and can be reached at [email protected])





