Semantic resources and technologies

In the third work package, we will create a central open-access digital dictionary database, which combines different types of linguistic data for Slovene. We will automatically create a semantic network and develop resources based on it, e.g. a corpus for carrying out semantic analyses, and tools for word-sense disambiguation and semantic shifts detection, automatic text summarization and automatic question answering. The developed tools will raise the development of language technologies for Slovene to a level comparable with major world languages.

Goals

We will upgrade the digital dictionary database, which unifies various language resources available for semantic language technologies production.
We will prepare guidelines for annotation and upgrading corpora for natural language semantic processing.
We will create a semantic network and support tools for extraction of meaningful units from texts.

Digital dictionary database and text processing tools

The Centre for Language Resources and Technologies of the University of Ljubljana is developing a digital dictionary database – a database model that envisages the inclusion of all data on Slovene, from morphological, syntactic, semantic and collocation information to multilingual and others – in a single organized relational database, which will allow for fast searching and building of specific resources to develop tools for semantic technologies.

As part of the project, we will complete the development of a general model of a digital dictionary database and show the usefulness of transforming existing data resources into a single model. All types of language data from acquired or purchased language resources will also be included. The data included in the relational database will enable researchers and other users to quickly access all types of openly accessible information on Slovene, such as: entry lists, corresponding morphological features (part of speech, properties, inflected forms), syntactic structures of multi-word units, semantic structure, relations between concepts (synonymy, antonymy, hypo- or hypernymy), collocation patterns, relations of concepts in different languages (via Wordnet, Babelnet) and the like.

We will create the framework for an online text analysis tool based on existing tools. Its purpose will be to facilitate the use of the developed tools for non-technical users who will be able to import their own data or corpus, sequentially perform individual natural language processing analyses, and evaluate and presented the results in a graphic format.

Semantic network

Semantic networks are a type of data representation incorporating linguistic information that describes concepts or objects and the relationship or dependency between them. It represents an important starting point for named entity recognition, relation extraction, and coreference resolution that make up the basic approach to information extraction. As part of the project, we will build tools to address each of these three processes with an appropriate deep neural network architecture based on the use of contextualized word embeddings such as ELMo and BERT.

The first appropriate approach to named entity recognition for Slovene was carried out on the ssj500k corpus, which contains manually annotated named entities of four types. To detect coreference, the smaller coref149 coreference corpus was built on the ssj500k corpus, and the even larger SentiCoref 1.0 coreference corpus was built recently, which will be used to upgrade the text databases. Both define coreference relation derived from named entities, but do not list different types of coreference. Therefore, as part of this activity, we will develop annotation guidelines for all three main processes of information extraction. This will allow work package one to create or upgrade the corpus, which will contain not only semantic tags but also syntactic tags.

Models for Slovene relation extraction have not yet been developed, and there are no corpora that would enable building a model for relation recognition. As part of the project, we will therefore develop a tool for automatic relation extraction, which will use logical rules based on semantic schemes to improve candidate relations.

Word-sense disambiguation and vector representation of words

Currently, the most successful machine approach to natural language processing and understanding is deep neural networks. For their operation, they need texts converted in a numerical form, which maps the words into a vector form in such a way that the semantic similarities between them are transferred to the distances between the vectors. Contextual word embeddings are a basic prerequisite for successful natural language processing and are essential for speech recognition and generation, for text summarization, question answering, word-sense disambiguation, resolving coreference, machine translation and terminology extraction. Some basic embeddings for Slovene already exist, e.g. the Slovene-only word2vec, fastText and ELMo and the multilingual embeddings such BERT and XLM-R. Research shows that to achieve the highest level of quality, we need the largest and highest quality text database. Therefore, we will build independent contextual word embeddings of the BERT and ELMo type based on the existing corpora Gigafida 2.0, KAS, FRENK and the text resources collected in the project framework.

Word-sense disambiguation is a process that determines which sense of a multi-sense word is used in a particular communication situation, e.g. 'mole' as a small animal or as a dark spot on the skin. Firstly, we will determine all the meanings of a word, by using the Slovenian WordNet, and various other dictionaries as an auxiliary source, e.g. the digital dictionary database and SSKJ. We will base the disambiguation on several versions of the Lesk algorithm, however, we expect to achieve better results by using deep neural networks. We will test several deep neural network architectures and combine them with different vector embeddings suitable for highly inflected languages such as Slovene. For disambiguation purposes we will build a dataset alongside work package one – it will be manually annotated to assign meaning to individual uses of different words from a given set. This dataset will be used to train machine learning models. As a final improvement of the disambiguation system, we will use this work package’s semantic network. The tool will be evaluated on the unused part of the dataset. Potential errors in meaning prediction will be analysed and, if necessary, the learning set will be supplemented with additional examples.

Semantic shifts and diachronic analyses

Language is an ever-changing structure that is constantly adapting to shifts in the cultural environment and to the new needs of its speakers. At the same time, the meaning of words, which are elements of language, also changes with each context in which they occur. Over time, the context of word usage can change to such an extent that there is almost no overlap between the original and current usage, and even more often the word takes on additional meaning, e.g. a 'mouse' increasingly used only in a computer context. In a shorter period, however, semantic shifts are less obvious and more difficult to measure, but at the same time more relevant, as their detection enables quantitative analysis of short-term social and cultural discourses and thus data-based understanding and planning of social policies.

As part of this work package, we will improve the existing method for measuring diachronic semantic shifts in media reporting on individual relevant topics with the help of contextual word embeddings and adapt it for wider use. The current method does not allow for the identification of the most common meanings of an individual word in a given time period and the measurement of differences in the distribution of word meanings in different periods, which would enable more accurate diachronic analyses and improve the method itself. We will address these shortcomings by upgrading the method and using an uncontrolled clustering of word senses. Additional work will also include the manual production of a test dataset for measuring the effectiveness of individual methods for detecting temporal semantic shifts, which does not yet exist for Slovene.

Automatic summarization

Machine text summarization addresses the issue of wanting to find important information quickly and efficiently in large amounts of textual data. There are two main approaches to automatic summarizing: the extractive text summarization technique, which excludes the most important sentences from a given text, and the abstractive text summarization technique, which expresses the most important information in its own way. Due to the weakness of extractive summarization, reflected in the incoherence of the extracted sentences, redundant information, and poor readability of the summary, contemporary approaches focus on abstract summarization.

We will test several models of deep neural networks suitable for summarization in Slovene. In order to successfully train summarization models, we will build a large learning set of summaries in collaboration with work package one. For shorter summaries, we will focus on the news, while for longer texts we will use student papers collected in the KAS corpus. We will evaluate the models with automatic metrics, while the fluency and accuracy of summaries will also be evaluated manually.

Automatic question answering

Answering systems return an answer to a query in the same language. The answer can be extracted directly from a text, it can be an automatically summarized answer, while the preparation of the answer might require additional knowledge. Recently, question answering tasks have focused on understanding the text and drawing conclusions from it. In domain-restricted areas, it is possible to prepare templates for processing questions and for preparing answers. In general, however, a problem arises since such suggestions cannot be generalized to a specific domain or problem.
As part of the project, we will test several question answering neural models that achieve the best results on datasets for the English language and adapt them for Slovene. Jointly with work package one, we will build a training corpus based on the evaluation benchmarks from SuperGLUE, as there is not yet a suitable dataset for Slovene that would include comprehensive texts and lists of questions and answers related to them. The models will be automatically evaluated according to the overlap of occurrences with the expected response. In doing so, we will focus primarily on accuracy, which more clearly defines the correctness of the returned answers.

All tools developed in Work Package 3 will be freely available online for further use with all associated source code.