Tuesday, May 13, 2025
spot_img

PhD on natural language processing for Khasi language

Date:

Share post:

spot_imgspot_img

SHILLONG, April 5: There are degrees but Medari Janai Tham who recently got her doctorate from the Department of Computer Science and Engineering, Assam Don Bosco University, for her thesis ‘Shallow Parsing for Khasi’, completed under the supervision of Prof Pushpak Bhattarcharyya of IIT Bombay, has managed to create a Khasi annotated corpus.
Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text.
The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of the NLP systems for a language.
Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible.
In English, the most widely used corpus is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpora and hence it is referred to as a resource poor language in so far as the application of NLP is concerned.
A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus”, which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/.
The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of Tham’s PhD and available on https://grammarkhasi.in, which is a companion website of the book Ka Grammar Khasi Da Ka Jingdr  by the same author published by Macmillan Education, India.
Other contributions from the same scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bi-directional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the aforementioned website.
Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser on https://medaritham.pythonanywhere.com.

spot_imgspot_img

Related articles

Rupali Ganguly becomes first celebrity to call for Turkey boycott amid rising Indo-Pak tensions

Mumbai, May 13: Television star Rupali Ganguly has become the first Indian celebrity to publicly call for a...

Pakistan FM says ceasefire could be under threat if India refuses to reverse IWT suspension

Islamabad, May 13: Pakistan's Deputy Prime Minister and Foreign Minister Ishaq Dar has said that the ceasefire between...

Defence Minister Rajnath Singh reviews security with top officials, chiefs of forces

New Delhi, May 13: A day after Prime Minister Narendra Modi outlined the new normal in the country’s...

MPCC critical of delay by govt in recommending testing agency to conduct CUET

Shillong, May 13: The Meghalaya Pradesh Congress Committee (MPCC) has criticised the delay on part of the State...