Design and Development of an Application for Sindhi Language Information Retrieval System

Authors

  • Mohsin Raza Shah Assistant Professor, Computer Science, College Education Department, Government of Sindh
  • Amjad Ali Mahesar Lecturer, College Education Department, Government of Sindh

Abstract

This research paper integrates tokenization, rule-based stemming, and indexing to ensure efficient document retrieval. Sindhi is an Indo-Aryan language that is spoken by some 44.8 million people across the globe but continues to be one of the most under-resourced languages in computation in South Asia. Although there was a huge and increasing volume of online Sindhi-language content, there is requirement of a complete and full-fledged automatic Information Retrieval (IR) system of Sindhi language and this paper offers the design, development and evaluation of Sindhi IRS the first, automatic Information Retrieval System for Sindhi language. The system combines five major NLP preprocessing modules: Unicode normalization, tokenization, stop-word removal, morphological stemming and document-indexing inverted index. The most basic is the Sindhi Rule-Based Stemmer (SRBS) that is implemented on a vocabulary of 5,327 words and 38 linguistic rules to minimize inflected Sindhi words to their root words with an 84.85% accuracy (Shah, 2016). The system also uses TF-IDF weighting and cosine similarity to rank documents, and was tested on a corpus of 86,733 Sindhi words representing various types of documents. Experimental evidence shows that Sindhi IRS can reach a mean average precision of 0.6538 on single-word queries, and scale performance well to double-word queries and sentence-length queries. The system is a basic infrastructure in the future Sindhi NLP applications such as, question answering, sentiment analysis, text summarization and machine translation.

Keywords: Sindhi Language, Information Retrieval System, NLP, Stemming, Morphological Analysis, Indexing, TF-IDF, Tokenization, Stop-Word Removal, Low-Resource Language Processing

Downloads

Published

2026-03-29

How to Cite

Mohsin Raza Shah, & Amjad Ali Mahesar. (2026). Design and Development of an Application for Sindhi Language Information Retrieval System. `, 5(01), 3130–3140. Retrieved from https://assajournal.com/index.php/36/article/view/1654