Term candidates extractor
The foundation for building each dictionary is a subject heading, i.e. a list of entries and terms. With the help of specialized corpora, which are available for an ever-increasing number of scientific fields, it will be possible to quickly extract terminological candidates with the help of extraction tools, which will later be processed by terminologists and experts. This simplifies and shortens the initial phase of creating a terminology dictionary.
In order to work, the extractor needs basic language technologies for Slovene, namely tokenization, lemmatization and morphosyntactic annotation. For the statistical evaluation of the termhood, lemmatized frequency lists (n-grams) of the reference corpus are needed – in our case it will be Gigafida 2.0.
The extraction module will contain two basic tools. The first will extract terminological candidates in the form of a list that a user will be able to furtherly process, and the second tool will highlight the terms in the texts. When using the automatic terminology extraction system, the user will be able to include publicly available external resources from the national open access infrastructure or use only their own texts.
The module for term candidates’ extraction will be available on the terminology portal, while the extractor itself will be available as either an online service or as a local download via GitHub that will include the code and all instructions.