Software tool and training datasets for annotation of Slovene texts
The basis for natural language processing are tools that ascribe linguistic features to texts: we use automated procedures to segment texts into tokens and sentences, ascribe words with basic forms, part-of-speech tag and morphosyntactic characteristics, and at a higher level information on syntactic relations, semantic roles in sentences and the like. The underlying technologies for these tools are still changing – deep neural networks have had great success recently but are being overtaken by large pre-trained deep language models for contextual word embeddings. In the project, we will upgrade the existing annotation tools for Slovene at the listed levels and unite them into an open-source tool, which will connect the annotation to the current annotation pipeline.
The development of annotation tools takes place by means of training datasets, in which the corpus texts are manually assigned the types of information that the program is supposed to attribute in the next step. The training corpus available for Slovene is the ssj500k training corpus, which currently comprises 500,000 words. It is manually annotated at the level of tokenization, segmentation, morphosyntactic descriptors (MSDs), and lemmas. About half of the corpus is annotated at the level of dependency syntax following the JOS and Universal Dependencies systems, named entities and verb phrases, and about a quarter of it at the level of semantic roles. In the project, we will expand the training corpus to 1,000,000 words, we will manually annotate additional corpus texts, and finally add coreference and relations annotation, which is important for language processing at a semantic level.
In addition to comprehensive and accurate linguistic tags, the development of methods for understanding natural language also requires a series of demanding benchmark tasks that encourage the development of new approaches and comparison with existing methods. The frameworks that have established themselves for English are GLUE (General Language Understanding Evaluation) and the even more demanding SuperGLUE. The latter consists of tasks such as logical reasoning on the basis of given texts, question answering, word-sense disambiguation and coreference resolution. A set of 1,000,000 words from the SuperGLUE benchmark will be translated and adapted into Slovene, which will put Slovene on the list of the few languages with such a collection of language understanding tasks.