Speech technologies

The second work package consists of the creation of a speech database, which is the basis for the development of support tools for speech recognition. We will also develop the speech recognizer itself – the project will result in the creation of one general and two specialized speech recognizers, which will be openly available to all users, alongside descriptions of the development process.

Goals

We will create a speech database, which is crucial for the development of a fluent speech recognizer for Slovene.
We will develop support tools that are of utmost importance for making a speech recognizer and using it in practice: a tool for syntactic normalization, a tool for acoustic normalization, a tool for grapheme-to-phoneme conversion, and a tool for raw text punctuation.
We will develop one general and two specialized speech recognizers, which will enable users to create a transcription for an audio file they upload – free of charge. The information regarding the creation process will also be made available.

Speech database

The speech database, i.e. a collection of speech recordings, is the foundation for the development of a speech recognizer. As part of the project, we will create a speech database comprised of 1000 hours of speech in Slovene, both read and freely spoken, prepared and improvised, and last but not least, the recordings of speeches read by a professional speaker that will be used to develop speech synthesis.

The aforementioned spoken text database will be developed from approximately 1,800 speakers. It will be publicly available under an open license and will be also made available for non-commercial and commercial development of technologies such as: voice control of devices, chatbots, smart virtual assistants (Amazon Alexa, Google Assistant and Apple Siri), automatic video subtitling and automatic speech translation. All programming code, databases and tools created in the project framework will be publicly available. They will be able to be tested and used by individuals, research and educational institutions, non-profit organizations, state bodies, entities exercising governmental authority and companies in Slovenia and abroad.

The speech database will contribute to a better position of Slovene in the fields of information and communication technologies, which enable spoken communication with machines. This will help its users to engage in the most modern communication methods and creatively participate in the speech situations of the future regarding the fields of work and leisure.

Support tools

We will develop four different support tools that will later be employed to build general or domain specialized speech recognizers: These tools are: an acoustic normalizer, a syntactic normalizer, a punctuator and a phonemizer.

The acoustic normalizer is used to pre-process the audio signal in order to remove additive noise, i.e. non-speech sound elements that can disrupt the recognizer training. An acoustic normalizer can help increase the robustness of the recognizer by making it less dependent on the purity of the speech signal. It will be created using approaches based on digital signal processing and/or deep neural networks. The dataset will consist of a speech corpus, which will also be developed in this project work package.

The syntactic normalizer can be used in both, text pre-processing and post-processing. Syntactic normalization is a process that transforms a text into a single canonical form that it may not have previously had. The recognizer views numbers, dates, acronyms, and abbreviations as examples of non-standard words, whose pronunciation varies in a given context. The transcription of the spoken language database will be formatted so the numbers and dates are spelled out, free of acronyms and abbreviations. This means that the recognizer will also return results in this format. The Gigafida 2.0 corpus, which is the intended source for the language model, will need to be re-tuned to match the recognizer results. Depending on the domain of the recognizer, the normalizer may also be useful in post-processing, where, for example, it will be able to convert spelled out numbers and dates into a numeric format. The syntactic normalizer will be created by using a rule-based approach, deep neural networks, or a combination of both. The updated Gigafida 2.0 corpus is expected to be used as the dataset.

A punctuator is a tool for basic punctuation mark placement in texts returned by a recognizer. A typical recognizer analyses an acoustic signal, recognizes phonemes, and then composes words from them. However, it cannot compose words into larger units such as sentences and clauses. The semantic value of these transcripts is therefore lower. The punctuator can enrich the recognized words with basic punctuation marks, such as commas, periods, question marks, and exclamations points, thus facilitating the semantic processing of the transcriptions. The punctuator is expected to be trained by using deep neural networks. The Gigafida 2.0 corpus is expected to be used as the dataset of the punctuator as well.

A phonemizer is a tool for converting a grapheme transcription into its corresponding phonemic transcription. It can serve as a method for adding missing words to a pronunciation dictionary during the recognition process. For a typical speech recognizer, the pronunciation dictionary is fundamental, as the it only recognizes words that are in the dictionary. During speech recognition we often come across words that are not yet written in the dictionary, therefore we need a process that allows for them to be added, either manually or automatically. The phonemizer is expected to be made with a combination of heuristic rules or a model learned by using neural networks.

General and two specialized speech recognizers

In addition to the speech database, the most important result of this work package will be the three actual recognizers, one for general use and two for specific domains. These tools will enable human communication with machines and other forms of artificial intelligence – in Slovene. The information regarding their production will also be made available.

The general recognizer will be designed so that it can be improved alongside the updates to the speech database, support tools and other learning resources. We will test different recognizer creation approaches and different sets of hyperparameters. We will also examine how acoustic signal pre-processing impacts the robustness of the recognizer and its recognition ability. This kind of pre-processing provides a cleaner signal and adds noise to the learning dataset, whereby the neural network learns to recognize noise. As opposed to building a recognizer in several stages, i.e. with a separate acoustic model, a language model, and a pronunciation model, contemporary end-to-end approaches are also widely adopted. With this approach we train a single model that encompasses all others in itself. Although these approaches typically require significantly larger learning datasets, there are already methods that go beyond these limitations. This raises the question of how large a learning dataset will really need to be in the future to produce an end-to-end recognizer that will be of comparable quality or even better than the traditionally built composite speech recognizers.

In addition to the general robust recognizer, two other domain-specific recognizers will be built in the framework of the work package, and we will also demonstrate their construction. Restricted domain recognizers can be built by preparing a special domain-specific speech database and continuing with the same approach used to build generic recognizers. Alternatively, we can use the acoustic model of the general recognizer, which contains general models of voices, with arbitrary words. For the needs of a narrower domain we can adapt the language model of the recognizer and pronunciation models of domain-specific words. From the point of view of providing voice resources, the first approach is more demanding, as we have to acquire an additional, sufficiently extensive specialized speech database for each individual domain. That is why we will adopt a different approach – we will use a different version of the general recognizer as a foundation, which will be built during previous activities. We will demonstrate the process on two selected domains – for managing a smart home and to communicate with a virtual assistant, who will be able to vocally describe images of people's faces. All the results of this work package will be made available on the portal alongside all other tools and project solutions. Looking ahead, we will also prepare a plan for further improvement of the general speech recognition of the Slovene language with an emphasis on education (e.g. translation of lectures in real time).

Goals

Speech database

Support tools

General and two specialized speech recognizers

Learn more about other Work packages

Semantic resources and technologies

Semantic resources and technologies

Machine translation

Machine translation

Terminology portal

Terminology portal