Thursday, July 17, 2025
spot_img

PhD on natural language processing for Khasi language

Date:

Share post:

spot_imgspot_img

SHILLONG, April 5: There are degrees but Medari Janai Tham who recently got her doctorate from the Department of Computer Science and Engineering, Assam Don Bosco University, for her thesis ‘Shallow Parsing for Khasi’, completed under the supervision of Prof Pushpak Bhattarcharyya of IIT Bombay, has managed to create a Khasi annotated corpus.
Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text.
The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of the NLP systems for a language.
Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible.
In English, the most widely used corpus is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpora and hence it is referred to as a resource poor language in so far as the application of NLP is concerned.
A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus”, which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/.
The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of Tham’s PhD and available on https://grammarkhasi.in, which is a companion website of the book Ka Grammar Khasi Da Ka Jingdr  by the same author published by Macmillan Education, India.
Other contributions from the same scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bi-directional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the aforementioned website.
Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser on https://medaritham.pythonanywhere.com.

spot_imgspot_img

Related articles

Congress demands President’s Rule in Odisha after girl’s self-immolation

Bhubaneswar, July 16: Demanding the resignation of Odisha Chief Minister Mohan Charan Majhi over the death of a...

India’s 1st net-zero e-waste park to come up in Delhi

New Delhi, July 16: In a first for India, Delhi is set to launch a Rs 150 crore...

Calcutta HC allows fresh recruitment of teachers with new rules, weightage criteria

Kolkata, July 16: The Calcutta High Court on Wednesday allowed the West Bengal School Service Commission (WBSSC) to...

Self-reliance in UAVs, counter-unmanned aerial system strategic imperative for India, says CDS

New Delhi, July 16: Chief of Defence Staff General Anil Chauhan on Wednesday said recent conflicts globally have...