Maintenance of the infrastructure centre for language resources and technologies

Work package 6

In the sixth work package, the Slovenian research infrastructure for language resources and technologies CLARIN.SI will be responsible for the public availability of the language resources that will be developed within the project. This will be done in compliance with international standards and good practices in resource formatting. The resources will be securely archived for a long period in a certified repository, and the created corpora will be available in the CLARIN.SI web concordancers.

Goals

  • We will provide technically pristine language resources, which will be publicly available in the CLARIN.SI repository alongside descriptions and carefully verified data for linguistic analyses, to be carried out by researchers, students, and all other parties, who are interested in the Slovenian language.
  • We will provide the standardized XML schemas needed to document and validate the formatting of language resources produced within the project framework.

CLARIN.SI services

The European Research Infrastructure for Language Resources and Technology (CLARIN) provides access to language resources and services for the purposes of research in the humanities and social sciences, as well as in other research areas that include the research of language and linguistic data, an example of which being the field of artificial intelligence. The Slovenian infrastructure CLARIN.SI with its headquarters at JSI, is a member of CLARIN ERIC and is organized as a consortium of twelve partners, which unites all major institutions involved in the development or use of language resources and technologies in Slovenia.

Two online services maintained by CLARIN.SI are of utmost importance for the project – a repository and two online concordancers.

The repository enables long-term safe storage of language resources and tools. It is the second repository in Slovenia to obtain the Core Trust Seal certificate and is also certified as a CLARIN type B centre. Its volume currently exceeds 200 entries, of which 140 include data for the Slovenian language, which are crucial for computational linguistics.

Work on the repository includes maintenance of software and hardware, care for the undisturbed operation of the system and editorial work on new entries. Within the framework of the project, the editorial process will be extended to the validation of the data itself, as its formatting will have to correspond to the developed schemas. In addition to formal validation, resources will also need to be qualitatively evaluated.

Authorized editors will ensure that authors' entries comply with the requirements of the repository, in terms of completeness and consistency of metadata and compliance with open standards and good practices in data formatting. The evaluation process will be the basis for an entry to be accepted into the repository; in case of remarks, the entry will be returned to the authors with detailed guidelines for its improvement, or the technical deficiencies will be resolved by the CLARIN.SI staff in agreement with the authors.

CLARIN.SI also offers two online concordancers, i.e. powerful corpus analysis tools that are primarily useful for linguists. They currently offer access to 75 corpora in 27 languages; in total they contain over 15 billion words. All corpora to be made publicly available in the repository will be further converted to a vertical format, which includes the development of conversions, and this format then serves as a basis for including the corpora in the CLARIN.SI concordancers. By doing so, the corpora will also be accessible to linguists for corpus analyses.

In addition, we will train new collaborators accordingly, while the staff will be available to answer user questions related to the project.

Development and maintenance of XML schemas

The project will develop many valuable language resources for the Slovene language, which must be uniformly coded for long-term use, interoperability between applications and possible reusability, taking into account international standards and recommendations. In this work package, we will upgrade and develop XML schemas, which are already used for entries in the CLARIN.SI repository, in order to support the language resources developed or upgraded in the project framework, especially corpora and lexicons. We will also develop and maintain descriptions and formal vocabularies of linguistic designations at the level of morphology, syntax, semantic roles, noun entities and the like. In addition, we will ensure that these resources are findable, and easily accessible long-term, interoperable, reusable, and formatted in conformity with open standards and good practices.

Learn more about other Work packages

Language resources

Language resources

Speech technologies

Speech technologies

Semantic resources and technologies

Semantic resources and technologies

© 2020. All rights reserved

Concept and implementation: ENKI, d.o.o. Legal notice Cookies