Skip to content

AI Speech and in Azure

Source: My personal notes and comments from course series Introduction to AI in Azure, Get started with speech in Azure - Training | Microsoft Learn

AI speech can help with management system with voice, get answers with spoken questions to computers, and creating captions for audio/video.

Two capabilities are required for those functions:

  • Speech recognition - the ability to detect and interpret spoken input
  • Speech synthesis - the ability to generate spoken output
synthesis
Text --------> Speech
/\ |
| |
+---------------+
transcription
Can also do translation
Text -> Speech: synthesis
Speech -> Text: transcription
┌────┐ ┌──────┐
│Text│ │Speech│
└──┬─┘ └───┬──┘
│ synthesis │
│────────────────>│
│ │
│ transcription │
│<────────────────│
┌──┴─┐ ┌───┴──┐
│Text│ │Speech│
└────┘ └──────┘

Speech recognition takes spoken words and converts to data to be processed like transcribing into text. Speech patterns are analyzed in audio to determine patterns mapped to words. Software uses multiple models to do the work including:

  • An acoustic model that converts the audio signal into phonemes (representations of specific sounds).
  • A language model that maps phonemes to words, usually using a statistical algorithm that predicts the most probable sequence of words based on the phonemes.

Speech synthesis vocalizes data like converting text to speech and requires information on:

  • Text to be spoken
  • Voice to be used to vocalize speech

The system typically tokenizes the text and assigns phonetic sounds to each word. The phonetic transcription is broken in prosodic units such as phrases or sentences to create phonemes. Phonemes are converted to audio format and settings like voice, rate, pitch and volume can be set.

Azure AI Speech has different capabilities

  • For example, English → French

Speech could be:

  • Recognition - identification of speech
    • Call Transcription for example in meetings
  • Synthesis
    • Creating speech
    • Create from recording like from your voice
    • Avatar + Speech

The API can perform real time or batch transcription of audio into text from microphone or audio file. The model used is optimized for conversation and dictation. Custom models can be used for acoustics, language, and pronunciation.

Batch transcription runs asynchronously and are schedule on best effort basis.

API can convert text input to audible speech through computer speaker or written to audio file. The voice can be selected and there is ability to personalize the speech synthesis.

Service have pre-defined voices and support of languages and regional pronunciation. Neural voices that use neural networks can make a more natural sounding voice with intonation.

Azure AI speech can be used with the Studio interface, CLI, REST APIs, and SDKs.

A Speech Azure resource is needed or Azure AI services resource.