SerbianStemmer

Stemmer for Serbian language created for my master thesis, rewritten in python

Download .zip Download .tar.gz View on GitHub

Stemmer for Serbian language

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. Here is presented suffix-stripping stemmer for Serbian language, one of the highly inflectional languages.

When performing many of the natural language processing tasks it is needed that all forms of a word with same meaning has same form. This process of fitting words to usable forms is called normalization. There are several techniques to perform normalization such as normalization using equivalence classes, case-folding, stemming and lemmatization.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

History of stemming

The first ever written stemming algorithm was written by Julie Beth Lovins in 1968.. Most popular stemmer for English was written by Martin Porter in 1980. This stemmer became de-facto standard for English stemming and Martin Porter was rewarded with Tony Kent Strix award for his work on stemming and information retrieval in year 2000. Until year 2000 almost all works related to stemming was about stemming for English, since 60% of World Wide Web is on English language. But after year 2000, there was emerging need to build natural language processing application also for other languages. In year 2000, Martin Porter published his work on snowball framework for creating stemming algorithms, and published stemming algorithms for several languages. Rules for stemming of Serbian language are still not precisely defined

Algorithm

This is a rule based stemmer for Serbian. As a base we used a stemmer from Kešelj and Šipka and reduced and improved their rules (reduced from 1000 rules to 300). Algorithm is available on PHP and Python. On GitHub only Python version is available. For more information about algorithm and for PHP version please check this paper. Also, if you are using this stemmer please reference this paper:

Milošević, Nikola. "Stemmer for Serbian language." arXiv preprint arXiv:1209.4471 (2012).

@article{milovsevic2012stemmer,
  title={Stemmer for Serbian language},
  author={Milo{\v{s}}evi{\'c}, Nikola},
  journal={arXiv preprint arXiv:1209.4471},
  year={2012}
}

Requirements

NLTK is necessary for tokenization. You can download it from here.

Authors and Contributors

This project is created by Nikola Milosevic (@nikolamilosevic86), as his master project at the School of Electrical Engineering, University of Belgrade.

Support or Contact

Having trouble with this project? Contact us nikola.milosevic@manchester.ac.uk and we’ll help you sort it out.