Saturday, May 4, 2024
spot_img

PhD on natural language processing for Khasi language

Date:

Share post:

spot_img
spot_img

SHILLONG, April 5: There are degrees but Medari Janai Tham who recently got her doctorate from the Department of Computer Science and Engineering, Assam Don Bosco University, for her thesis ‘Shallow Parsing for Khasi’, completed under the supervision of Prof Pushpak Bhattarcharyya of IIT Bombay, has managed to create a Khasi annotated corpus.
Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text.
The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of the NLP systems for a language.
Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible.
In English, the most widely used corpus is the British National Corpus (BNC) and it is popular among researchers due to its accessibility.
Where Khasi is concerned, there are no such publicly available corpora and hence it is referred to as a resource poor language in so far as the application of NLP is concerned.
A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus”, which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/.
The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages.
The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of Tham’s PhD and available on https://grammarkhasi.in, which is a companion website of the book Ka Grammar Khasi Da Ka Jingdr  by the same author published by Macmillan Education, India.
Other contributions from the same scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bi-directional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the aforementioned website.
Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser on https://medaritham.pythonanywhere.com.

spot_img
spot_img

Related articles

National Nuggets

Woman dumps newborn Kochi, May 3: A 23-year-old woman delivered a baby in the bathroom of her apartment in...

India’s reservoirs report water shortage

Southern belt worst hit as levels dip to 16 pc: Central Water Commission New Delhi, May 3: The Central...

Rahul files nomination from Rae Bareli; BJP mocks him for ‘fleeing’ from Amethi

Rae Bareli/New Delhi, May 3: Congress leader Rahul Gandhi filed his papers from Uttar Pradesh’s Rae Bareli constituency...

May consider interim bail to Kejriwal in view of polls: SC

New Delhi, May 3: The Supreme Court on Friday said it may consider granting interim bail to Delhi...