Thursday, November 21, 2024
spot_img

PhD on natural language processing for Khasi language

Date:

Share post:

spot_img
spot_img

SHILLONG, April 5: There are degrees but Medari Janai Tham who recently got her doctorate from the Department of Computer Science and Engineering, Assam Don Bosco University, for her thesis ‘Shallow Parsing for Khasi’, completed under the supervision of Prof Pushpak Bhattarcharyya of IIT Bombay, has managed to create a Khasi annotated corpus.
Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text.
The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of the NLP systems for a language.
Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible.
In English, the most widely used corpus is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpora and hence it is referred to as a resource poor language in so far as the application of NLP is concerned.
A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus”, which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/.
The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of Tham’s PhD and available on https://grammarkhasi.in, which is a companion website of the book Ka Grammar Khasi Da Ka Jingdr  by the same author published by Macmillan Education, India.
Other contributions from the same scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bi-directional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the aforementioned website.
Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser on https://medaritham.pythonanywhere.com.

spot_img
spot_img

Related articles

You chant slogans for Palestine, they shower top honours on him: BJP slams PM Modi’s critics

New Delhi, Nov 21: Bharatiya Janata Party (BJP) leader Sudhanshu Trivedi on Thursday spoke at length about Prime...

38 killed in attack on passenger vehicles in Pakistan

Islamabad, Nov 21: At least 38 people were killed and 11 others injured in firing on three passenger...

Cancel flights delayed beyond 3 hours, govt tells airlines

New Delhi, Nov 21: The Ministry of Civil Aviation has directed airlines to immediately inform passengers of any...

China Masters: Sindhu, Anupama bow out after losing second round matches

Shenzen (China), Nov 21: Two-time Olympic medallist PV Sindhu and Anupama Upadhaya suffered early exits from the ongoing...