Aural Technology

An integrated, complete technological package, named ALLdio, is offered for audio-source information management. The package embraces the main issues of the R&D activity of ALL in the field of speech and sound technology.

ALLdio encompasses three technological layers – one for Sound Analytics (ALLdio SAn), another for Automatic Speech Recognition (ALLdio ASR), and a third one for audio-source Information Engineering (ALLdio IE).

ALLdio SAn

The primary goal of ALLdio SAn is to provide a reliable technological framework for the automatic revelation of how an arbitrary audio stream is logically segmented into a series of distinct audio passages by different sorts of aurally perceptible events – such as ‘speech in English’, ‘a piece of musical composition’, ‘street noise’, ‘dog barking’, and the like.

ALLdio SAn makes use of the relevant issues of the state-of-the-art technologies for different facets of sound analytics. Pursuant to these, the boundaries and the nature of a definite sort of audio event can be recognized on the basis of the acoustic attribute model (e.g. the voice model of a speaker or the sounding model of a particular language) of that sort of event – provided that the required acoustic attribute model has been trained from proper audio samples

The state-of-the-art technologies are standalone in the sense of focusing on the recognition of the global content (‘speech’, ‘music’, etc.) of the audio events or on the detection of those segments of the speech-content audio passages which are homogeneous in respect to the spoken language, to the speaking person or to the common acoustic characteristics of definite classes of speakers. The novelty of ALLdio SAn lies in that it integrates the standalone sound- and speech-analytical technologies and synthesizes them into a unified whole.

An ALLdio SAn application can embody arbitrary many acoustic attribute models, and the available services of the application are determined by the constituents of its acoustic-attribute-model repertoire. The Audio Monitoring system well exemplifies the typical services of an ALLdio SAn application.

ALLdio ASR

ALLdio ASR offers a novel, generic and complete technology that has been developed at Applied Logic Laboratory for offering a unified framework to the development of any sort of Automatic Speech Recognition applications that can work in multimodal contexts and are enabled to convert multilingual speech passages to text.

In view of functionality and architecture, the core of ALLdio ASR squares with the state-of-the-art LVCSR (Large Vocabulary Continuous Speech recognition) technology. In both cases, the recognition of speech is carried out by an STT (speech-to-text converting) engine what is geared with acoustic-level knowledge about the spoken version of a particular language and with linguistic knowledge about the customary use of the written version of the same language. Likewise the conventional LVCSR technology, ALLdio ASR aims at understanding and converting the speaking of arbitrary persons who express their sayings in the targeted language. The novelty of ALLdio ASR originates from the idea of preconditioning the speech-to-text conversion by the speech-analytical services of ALLdio SAn. The coupling of speech recognition with speech analytics is conducive to loosening up the limitations of conventional LVCSR in speech-recognizing accuracy, as well as to endowing the ALLdio ASR-based applications with certain sorts of functions what the conventional LVCSR technology fails or is unable to support.

The ALLdio SAn-style speech analytics results in cutting the audio speech passages into relatively small quanta such that each quantum is simultaneously homogeneous in respect to the language being spoken there and in respect to the voice characteristics of the speaker. The voice characteristics of the speaker may define the speaker’s belonging to a distinct speaker class (by gender, age, foreign accent, local dialect, etc.) and/or identify the speaker him/herself. This allows choosing the proper linguistic and/or acoustic knowledge for the STT engine at each speech quantum prior to starting the recognition of its content. Whilst the dynamic changeability of both the acoustic and the linguistic knowledge create the conditions for the ALLdio ASR-based applications to operate in multilingual contexts, the possibility for the speaker-class-dependent or speaker-dependent switch to the proper speaker-class-specific or speaker-specific acoustic knowledge may yield a significant improvement in the speech-recognizing accuracy of the targeted ALLdio ASR application without loosing its expected speaker-independent nature.

Other services of ALLdio ASR support gearing the end-product with such functions whose realizations the conventional LVCSR technology is unable to support. The speaker-tagging service allows, for instance, labelling the passages of the textual transcription by the name of the person who is speaking there.

Although being natively prepared for LVCSR purposes, the ALLdio ASR technology can easily be configured and adapted to the specific needs of any sort of speech-recognizing applications. The flexibility of ALLdio ASR is exemplified by five different sorts of use-cases. These include minutes transcription from recorded audio data, video subtitling, a vocal commander, the group of ALLdio Dictation Systems, and the group of ALLdio Dialogue Systems.

ALLdio IE

The term ALLdio IE stands for a generic technology for recognizing the overall information content of finite or infinite streams of audio-form data, and for expressing the recognized information in terms of a natural language.

ALLdio IE relies upon the theoretical issues of a novel semantic approach that has been developed at Applied Logic Laboratory and named Aural Semantics – where .the concept of ‘semantics’ is used in the sense of giving textual interpretation to non-textual information..

Aural Semantics models the whole process of audio-content understanding and interpretation as a strictly supervised activity of knowledge extraction, in the course of which a (finite or infinite) stream of audio-form data is gradually mapped into an organized body of textual information.

Aural Semantics are turned to technology by making use of the offerings of ALLdio SAn, ALLdio ASR, and of several sound-technological results being relevant to the subject. Complying with the aural semantic rules, the technological constituents of ALLdio IE are prepared for (1) segmenting of a finite or infinite audio stream into independently understandable pieces of separate information content, (2) interpreting the audio segments in textual form on different levels of abstraction, (3) deriving the topics of speech segments from their textual interpretation, (4) estimating the informational value of the segments, and (5) reasoning about the nature and extent of the occasional relationship between the revealed information content of the separately understood and interpreted audio pieces.

The core of a Alldo-IE-based software system is functionally and architecturally determined by the underlying Aural Semantics. However, the effective services of the targeted system can flexibly be adjusted to the specific needs of that system. The technological impact of Aural Semantics is exemplified by two ALLdio IE-based applications. The use of the Audio Reverse Engineer in the context of audio archives management creates the conditions for querying and retrieving the relevant pieces of the archive by on-demand textual searches for topics, facts or concepts, as well as for logical combinations of these. As a component of an artificial cognitive system, the Simultaneous Audio Interpreter may serve the other components of the system by continuously listening to the environment and giving the necessary information about the audible events.

Technological supplements

The main constituents of the ALLdio technology are completed by appropriate toolkits for training attribute models for an ALLdio SAn application, for modelling the acoustic characteristics of the spoken version of a particular language, for modelling the common and specific ways of using the written version of that language, for reasoning about the topics of texts resulting from speech-to-text conversions, as well as for evaluating formally the quality of an ALLdio-based application.

An overview

on the architecture and overall functionality of the complete ALLdio technology is available on alldio.eu. (Certain pages of the website are, at present, not accessible, for being under construction, completion and/or correction.)