Sunday, December 15, 2024
spot_img

PhD on natural language processing for Khasi language

Date:

Share post:

spot_img
spot_img

SHILLONG, April 5: There are degrees but Medari Janai Tham who recently got her doctorate from the Department of Computer Science and Engineering, Assam Don Bosco University, for her thesis ‘Shallow Parsing for Khasi’, completed under the supervision of Prof Pushpak Bhattarcharyya of IIT Bombay, has managed to create a Khasi annotated corpus.
Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text.
The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of the NLP systems for a language.
Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible.
In English, the most widely used corpus is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpora and hence it is referred to as a resource poor language in so far as the application of NLP is concerned.
A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus”, which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/.
The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of Tham’s PhD and available on https://grammarkhasi.in, which is a companion website of the book Ka Grammar Khasi Da Ka Jingdr  by the same author published by Macmillan Education, India.
Other contributions from the same scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bi-directional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the aforementioned website.
Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser on https://medaritham.pythonanywhere.com.

spot_img
spot_img

Related articles

BRS goes back to its roots to regain lost ground

Hyderabad, Dec 15: A year after losing power to the Congress in Telangana, the Bharat Rashtra Samithi (BRS)...

India sends 60 tonnes of medical aid to disaster-hit Jamaica

New Delhi, Dec 15 : A fresh consignment of around 60 tonnes of emergency medical equipment, generators, and...

UP CM lauds PM Modi for honouring workers who built Ram Temple in Ayodhya

Mumbai (Maharashtra), Dec 15 : Lauding the 'respect' given to the labour force under the Bharatiya Janata Party...