Language resources

Work package 1

In the first work package, we will upgrade Slovenian text corpora and the lexicon of word forms. We will renew training datasets and procedures for automatic linguistic annotation of modern Slovene. The results will be refreshed and increased language resources available to both the user community and for the purposes of machine learning. The developed procedures and tools will make for a faster and easier update of Slovenian corpora in the future.

Goals

  • We will upgrade the language resources, which are paramount for the development of Slovene language technologies but are currently only partially annotated or too small to yield good results.
  • We will upgrade the content of the corpora, which are important for both linguistic description and linguistic research, and make them available to users in corpus concordancers.
  • We will develop a comprehensive infrastructure for the efficient and continuous further development of basic language resources for Slovene (plans, workflows for text collection, annotation guidelines, expected file formats and so on).
  • We will develop new software tools for automatic linguistic annotation as well as those to support manual annotation of Slovenian texts at various levels.

Software tool and training datasets for annotation of Slovene texts

The basis for natural language processing are tools that ascribe linguistic features to texts: we use automated procedures to segment texts into tokens and sentences, ascribe words with basic forms, part-of-speech tag and morphosyntactic characteristics, and at a higher level information on syntactic relations, semantic roles in sentences and the like. The underlying technologies for these tools are still changing – deep neural networks have had great success recently but are being overtaken by large pre-trained deep language models for contextual word embeddings. In the project, we will upgrade the existing annotation tools for Slovene at the listed levels and unite them into an open-source tool, which will connect the annotation to the current annotation pipeline.

The development of annotation tools takes place by means of training datasets, in which the corpus texts are manually assigned the types of information that the program is supposed to attribute in the next step. The training corpus available for Slovene is the ssj500k training corpus, which currently comprises 500,000 words. It is manually annotated at the level of tokenization, segmentation, morphosyntactic descriptors (MSDs), and lemmas. About half of the corpus is annotated at the level of dependency syntax following the JOS and Universal Dependencies systems, named entities and verb phrases, and about a quarter of it at the level of semantic roles. In the project, we will expand the training corpus to 1,000,000 words, we will manually annotate additional corpus texts, and finally add coreference and relations annotation, which is important for language processing at a semantic level.

In addition to comprehensive and accurate linguistic tags, the development of methods for understanding natural language also requires a series of demanding benchmark tasks that encourage the development of new approaches and comparison with existing methods. The frameworks that have established themselves for English are GLUE (General Language Understanding Evaluation) and the even more demanding SuperGLUE. The latter consists of tasks such as logical reasoning on the basis of given texts, question answering, word-sense disambiguation and coreference resolution. A set of 1,000,000 words from the SuperGLUE benchmark will be translated and adapted into Slovene, which will put Slovene on the list of the few languages with such a collection of language understanding tasks.

Lexicon of Slovene word forms Sloleks

The lexicon of word forms Sloleks is an open-access collection of morphological and accentual features for Slovene. The current version contains data for approximately 100,000 words: manually annotated word forms, automatically assigned accents, and phonetic transcriptions. In addition to that, the lexicon also contains information on morphological variation, frequency of word forms in the Gigafida 2.0 reference corpus, and automatically generated pronunciation recordings. Beside the already mentioned training datasets, Sloleks is a fundamental language resource for Slovene language processing, and it is also important for linguistic description. The project will therefore provide several important improvements, primarily a manual review of automatically assigned accentual features. On the other hand, we will make the enlargement of the lexicon easier and more efficient: we will develop a tool that imports automatically prepared data from selected sources and enables a quick manual linguistic review in a user-friendly interface. We will test the new tool for expanding the lexicon, which we would like to enrich with data for at least 100,000 new words.

Reference corpora Gigafida, Janes and GOS

The text corpus Gigafida includes standard language (e.g. newspapers, magazines, technical texts and fiction), Janes includes user-generated content (e.g. forum posts, tweets, news comments) while GOS includes spoken Slovene (e.g. TV shows, lectures, conversation). These corpora form the foundation for both, language description and prescription, as well as language manuals, language technologies, and all kinds of procedures. The project aims to ensure their long-term upgradability, taking into account the experience of stakeholders, who use the corpora for product development, and user groups, who use the corpora for professional purposes. We will provide an infrastructure that will enable continuous corpora upgradability: we will also address the legal issues related to text acquisition, set up a website with information for text providers, a repository for said texts, and determine protocols for processing the texts acquired. Since non-standard language processing has certain specifics, we will provide a training dataset for automatic annotation of user-generated content. When planning corpus content, we will also pay special attention to the question of standardization and the related inclusion of the Slovene national minorities living abroad.

Slovene learners’ corpora Šolar and KOST and corpora of Slovene parliamentary debates

In addition to the reference corpora, we will upgrade specialized corpora as well: the corpus of school texts Šolar, the corpus of Slovene language acquisition KOST and the corpora of Slovene parliamentary debates (siParl and SlovParl). These corpora offer an important insight into the language use of a certain type. However, they require additional steps in the preparation of corpus material: the Šolar and KOST corpora contain information on teachers’ corrections of linguistic errors, while the corpora of Slovene parliamentary debates carry metadata on speakers, types of sessions, agenda items, structural and editorial annotations. In order to continue with the preparation and upgrading of corpora, it is necessary to establish procedures for continuous collection and processing of corpus material – with this in mind, we will develop new tools: one for (semi-)automatic annotation of Slovene parliamentary debates and one for manual annotation and categorization of linguistic corrections. The new tools will also be utilised during the project for improving the existing corpora, and will be made openly available for further use after the duration of the project.

Metacorpus of selected Slovenian corpora

Slovenian corpora are available for analysis in various concordancers. Currently, corpora are available separately: users can search for information in each corpus individually; if they want to compare results, they have to aggregate the information manually, which is time-consuming and can lead to analysis errors. Different corpora typically have different metadata and are possibly annotated at different linguistic levels, making it even more challenging to search through them. In the project, we will therefore make an overview of publicly accessible Slovene corpora, make a substantiated selection of them, and then unite them into a single colossal corpus, in which the search will be uniform and transparent. In doing so, we will unify metadata and harmonize linguistic and structural annotations between corpora. Conversions of individual corpora into the desired uniform format will be produced as well. The combined corpus will be available to the community through CLARIN.SI concordancers.

Learn more about other Work packages

Speech technologies

Speech technologies

Semantic resources and technologies

Semantic resources and technologies

Machine translation

Machine translation

© 2020. All rights reserved

Concept and implementation: ENKI, d.o.o. Legal notice Cookies