Conversational Guide to Speech Synthesis

Q: Who Invented the First Human Speech Synthesiser?

Homer Dudley invented the first electronic speech synthesiser .

Welcome to our conversational guide to speech synthesis. You're about to learn all the essential information about this fantastic technology. Feel free to read the contents of this guide in order or jump straight to the section that sparks your interest. Here's a list of topics covered in this guide:

Genesis of Speech Synthesis
Definition of Speech Synthesis
The Way Speech Synthesis Works
Effectiveness of Speech Synthesis
Benefits of Speech Synthesis
Challenges of Speech Synthesis
Applications of Speech Synthesis
Conclusion

What’s the Genesis of Speech Synthesis?

Here's a list of several essential facts that mark the genesis of electronic speech synthesis:

Name of the First Machine that Synthesised Human Speech Electronically
Creation Date of the First Machine that Synthesised Human Speech Electronically
Location Where the First Machine that Synthesised Human Speech Electronically Was Created
Inventors of the First Machine that Synthesised Human Speech Electronically
The Way the First Machine that Synthesised Speech Electronically Worked
Effectiveness of the First Machine that Synthesised Human Speech Electronically
Impact of the First Machine that Synthesised Human Speech Electronically

What’s the Name of the First Machine that Synthesised Human Speech?

The first machine that could electronically synthesise human speech was Voder, the Voice Operating Demonstrator.

When Was the First Human Speech Synthesiser Created?

The first machine that synthesised human speech electronically was created between 1937 and 1938.

Where Was the First Machine that Synthesised Human Speech Created?

The first, electronic synthesiser of human speech was created in Bell Labs.

Who Invented the First Human Speech Synthesiser?

Homer Dudley invented [the first electronic speech synthesiser](#whats-the-name-of-the-first-machine-that-synthesised-human-speech).

How Did the First Machine that Synthesised Human Speech Work?

The first electronic synthesiser of human speech imitated the effects of the human vocal tract. The relaxation oscillator produced two basic sounds: voiced vowels and nasal sounds. Initial sounds went through 10 key-controlled band-pass filters, and outputs were combined, amplified, and played on a loudspeaker.

How Effective Was the First Human Speech Synthesiser?

The speech quality of the first human speech synthesiser was limited and relied on manually controlled filters that required highly trained operators.

What Was the Impact of the First Machine that Synthesised Human Speech?

The first speech synthesiser demonstrated human voice synthesis, which laid the foundations for voice communication and helped save bandwidth and improve security by enabling voice encryption.

What’s Speech Synthesis?

Speech synthesis is an artificial production of spoken language used to communicate with users when reading on a screen is either not possible or inconvenient. It works by converting written text into speech, so it’s often called text-to-speech (TTS). The reverse process is called automatic speech recognition.

How Does Speech Synthesis Work?

The technology behind speech synthesis has evolved over the last few decades, but no matter the level of sophistication, a speech synthesiser has to implement the following high-level steps:

Text Normalisation
Symbolic Linguistic Representation
Sound Synthesis

What’s Text Normalisation?

Text normalisation is also called text pre-processing or tokenisation. It transforms text into a canonical form by converting symbols, numbers, and abbreviations into written-out equivalents, guaranteeing consistency before further operations.

What’s a Symbolic Linguistic Representation?

Symbolic linguistic representation is a process that combines phonetic transcription with prosody information to process normalised text and describe each utterance. It uses symbols to represent linguistic information, such as phonetics, morphology, syntax, or semantics.

What’s Phonetic Transcription?

Phonetic transcription (also called text-to-phoneme conversion) is the visual representation of speech sounds (or phones) using symbols. The most common application of phonetic transcription is a phonetic alphabet, such as the international phonetic alphabet.

How Does Phonetic Transcription Work?

Nearly all phonetic transcription systems rely on a combination of two different approaches:

Dictionary-Based Approach
Pronunciation Rules

How Does Dictionary-Based Approach to Phonetic Transcription Work?

The dictionary-based approach to phonetic transcription depends on a large list of words and their correct pronunciations. Determining the appropriate pronunciation requires looking up a word in a dictionary and replacing it with its pronunciation.

How Do Pronunciation Rules Help with Phonetic Transcription?

Pronunciation rules help with phonetic transcription by applying special conventions based on the spellings of words to determine correct pronunciation. It’s similar to the sounding out approach while learning reading, where the teacher first teaches the letter sounds and then combines them with blending rules to build up whole words.

What’s Prosody Information?

Prosody information consists of elements of speech that aren't part of individual phonetic segments but are properties of larger units such as syllables, phrases, clauses, and sentences. Thanks to technological advancements, it's now possible to produce very natural-sounding speech that includes changes to pitch, rate, pronunciation, and inflexion using deep learning.

How Does Sound Synthesis Work?

Synthesising sound is the last step of speech synthesis, which is responsible for converting natural linguistic representation into sound. Simple speech synthesisers work by concatenating pieces of prerecorded speech units stored in a database. More sophisticated voice synthesisers rely on deep learning to model human vocal tracts that resemble natural voice characteristics.

How Effective Is Speech Synthesis?

The effectiveness of speech synthesis depends on the similarity to a natural voice that people can recognise and understand. Although deep learning models used in modern speech synthesisers can guarantee the quality of spoken output that's easy to understand and even mimic accents characteristic for specific locations, it remains distinguishable from an actual human voice.

What Are the Benefits of Speech Synthesis?

Speech synthesis makes applications more accessible, allowing people to consume and comprehend information without focusing on screens or hand gestures. Here's a list of some of the most important benefits of using text-to-speech:

Accessibility
Enhanced Learning
Improved Mobility

How Does Speech Synthesis Help to Improve Accessibility?

Text-to-speech provides access to people who cannot read due to impairment or literacy challenges by offering an alternative way to access information. It lowers communication barriers and helps to ensure that no one is left behind.

How Does Speech Synthesis Help to Enhance Learning?

By enabling multimodal presentation, speech synthesis helps to improve comprehension, recall, and motivation by enhancing content where each modality is responsible for a specific emotional or informational aspect of communication.

How Does Speech Synthesis Help to Improve Mobility?

Text-to-speech turns digital interactions into multimedia experiences that don't require intense attention and free up hands, making it possible for people to listen to audio responses while on the go or doing other activities simultaneously.

What Are the Challenges of Speech Synthesis?

Although speech synthesis has numerous benefits, it's not free from limitations. Here's a list of some of the most important challenges of text-to-speech:

Challenges of Text Normalisation
Phonetic Transcription Challenges
Evaluation Challenges
Prosodic and Emotional Content Challenges

What Are Text Normalisation Challenges of Speech Synthesis?

Normalising text is rarely straightforward because it requires the speech synthesiser to convert heteronyms, numbers, and abbreviations into a phonetic representation. If this wasn't challenging enough, some words are pronounced differently depending on the context.

What Are Phonetic Transcription Challenges of Speech Synthesis?

Languages with regular spelling systems rely on more dynamic pronunciation rules and use dictionaries only for exceptions. However, the opposite approach gives better results for languages with irregular spelling systems, such as English.

What Are Evaluation Challenges of Speech Synthesis?

There are no universally accepted rules for evaluating the results of speech synthesis. Another problem is that the end user's device is responsible for converting synthesised voice into sound, and the audio quality that this device can produce affects the overall result.

What Are Prosodic and Emotional Content Challenges of Speech Synthesis?

People use oral and non-verbal expressions while they communicate using their voices. Recognising the meaning behind intonation, tempo, or tone of voice comes naturally to human beings. Machines, however, have to rely on additional metadata such as lexicons and SSML tags to specify these subtle emotional cues.

What Are the Applications of Speech Synthesis?

Speech synthesis combined with automatic speech recognition and natural language processing powers conversational experiences that engage users with naturally sounding voices and fluid pronunciation, making high-quality voice interactions possible and enabling the creation of sophisticated applications, such as:

Conclusion

Synthesising voice has come a long way from the first machine synthesising voice electronically to the modern version of speech synthesis. Converting text to spoken language brings many benefits, but it's not free from challenges. Text-to-Speech technology is an essential building block for creating conversation systems that enable two-way voice-based communication with users.