Word-sense disambiguation and vector representation of words
Currently, the most successful machine approach to natural language processing and understanding is deep neural networks. For their operation, they need texts converted in a numerical form, which maps the words into a vector form in such a way that the semantic similarities between them are transferred to the distances between the vectors. Contextual word embeddings are a basic prerequisite for successful natural language processing and are essential for speech recognition and generation, for text summarization, question answering, word-sense disambiguation, resolving coreference, machine translation and terminology extraction. Some basic embeddings for Slovene already exist, e.g. the Slovene-only word2vec, fastText and ELMo and the multilingual embeddings such BERT and XLM-R. Research shows that to achieve the highest level of quality, we need the largest and highest quality text database. Therefore, we will build independent contextual word embeddings of the BERT and ELMo type based on the existing corpora Gigafida 2.0, KAS, FRENK and the text resources collected in the project framework.
Word-sense disambiguation is a process that determines which sense of a multi-sense word is used in a particular communication situation, e.g. 'mole' as a small animal or as a dark spot on the skin. Firstly, we will determine all the meanings of a word, by using the Slovenian WordNet, and various other dictionaries as an auxiliary source, e.g. the digital dictionary database and SSKJ. We will base the disambiguation on several versions of the Lesk algorithm, however, we expect to achieve better results by using deep neural networks. We will test several deep neural network architectures and combine them with different vector embeddings suitable for highly inflected languages such as Slovene. For disambiguation purposes we will build a dataset alongside work package one – it will be manually annotated to assign meaning to individual uses of different words from a given set. This dataset will be used to train machine learning models. As a final improvement of the disambiguation system, we will use this work package’s semantic network. The tool will be evaluated on the unused part of the dataset. Potential errors in meaning prediction will be analysed and, if necessary, the learning set will be supplemented with additional examples.