Python speech synthesis guide

Article source: Fanfan's Python Learning Road

Author: a grain of rice

Overview of Speech Synthesis

In general, speech synthesis is the technique of producing artificial speech by mechanical, electronic means. Among them, TTS, is the abbreviation of Text-To-Speech, that is, "from text to speech", which is part of the human-machine dialogue. It belongs to speech synthesis, which is a technology that converts text information generated by the computer itself or input from outside into intelligible and fluent speech output. Speech synthesis mentioned in this article refers to TTS. Lin Zhiling navigation and Guo Degang navigation used in life are implemented based on TTS.

speech synthesis method

Here we briefly discuss the traditional method of speech synthesis and the synthesis method based on deep learning in recent years. Students who are not interested in this part can skip it directly without affecting the subsequent reading.

traditional method

The traditional methods of speech synthesis can be divided into two types: concatenation and parameterization.

  • Splicing, the splicing method refers to selecting the required speech fragments from the speech database to synthesize the complete speech. This method requires a large amount of speech data, is less flexible, and cannot synthesize new speech segments.
  • Parameterization, the parameterization method refers to using a recorded human voice and a parameter-containing function to change the voice by adjusting the function parameters. This traditional method is relatively labor-intensive.

deep learning based methods

With the continuous development of neural networks this year, deep learning technology has also been widely used in speech synthesis, including the following directions:

  • WaveNet: The original audio generation model, WaveNET is an audio generation model based on PixelCNN, which is able to generate sounds similar to those made by humans. Paper address
  • Tacotron: End-to-end speech synthesis, Tacotron is a seq2seq model that includes an encoder, an attention-based decoder, and a back-end processing network. Paper link
  • Deep Voice 1: Real-time neural text-to-speech, a text-to-speech system developed using deep neural networks. Paper link
  • Deep Voice 2: Multi-speaker neural text-to-speech, it is similar to DeepVoice 1, but it offers significant improvements in audio quality. The model was able to learn hundreds of unique voices from less than half an hour of speech data per speaker. Paper link
  • Deep Voice 3: Scaled Text-to-Speech with Convolutional Sequence Learning, the authors propose a fully convolutional character-to-spectrogram framework that enables fully parallel computing. The framework is an attention-based sequence-to-sequence model. This model is trained on the LibriSpeech ASR dataset. Paper link
  • Parallel WaveNet: Fast High-Fidelity Speech Synthesis, this model from Google introduces a method called Probability Density Distillation, which trains a parallel feedforward network from a trained WaveNet. The method is constructed by combining the best features of Inverse Autoregressive Flow (IAFS) and WaveNet (WaveNet). These features represent efficient training of WaveNet and efficient sampling of IAF network. Paper link
  • Utilizing Small-Sample Neural Network Voice Cloning, a model from Baidu, which introduces a neural voice cloning system that learns to synthesize a person's voice from a small number of audio samples. Paper link
  • VoiceLoop: Voice Fitting and Synthesis via Voice Loops, this paper from Facebook introduces a neural text-to-speech (TTS) technique that converts text from voices collected in the wild to speech. Paper link
  • Natural TTS Synthesis with Conditional WaveNet on Mel-Map Prediction, paper from Google and UC Berkeley. They introduced Tacotron 2, a neural network architecture for text-to-speech synthesis. It consists of a recurrent sequence-to-sequence feature prediction network that embeds characters into Mel-scale maps. Then there is a modified WaveNet model that acts as a vocoder, using the spectrogram to synthesize time-domain waves. Paper link

Python speech synthesis

There are many ways to use Python for speech synthesis. Here are some typical open source libraries and domestic speech platforms for students' reference. Since Google's services cannot be used directly, it is not in the comparison list, and the speech synthesis method specific to windows is also not in the scope.

  1. Open source library, pyttsx3pyttsx3

It is an open source offline speech synthesis library. It can be used after installing it with pip. The installation command is as follows:

$ pip install pyttsx3

Pros: Free, easy to use

Disadvantages: The synthetic voice effect is average

2. iFLYTEK provides a wealth of pronunciation categories to synthesize special voices, and performs speech synthesis through the api interface, and can provide control marks for polyphonic words, silent pauses, numbers, and English reading.



Advantages: The effect of speech synthesis is good, and it can flexibly control the reading of polyphonic words, mute, and English. Disadvantages: There is a limit of 500 times to use the interface for free, which is often not enough in actual use.

3. Tencent

Tencent has multiple platforms providing speech synthesis interfaces, including Tencent AI Lab, Tencent Youtu, and Tencent Cloud. Among them, the synthesis effect of Tencent AI open platform is average; Tencent Youtu is currently free for trial, and does not limit the number of requests, but does not guarantee QPS; Tencent cloud speech synthesis effect is also good, and the free amount of synthesis is 1 million characters per month, which is equivalent to a book of " The word count of Journey to the West. The free quota is reset on the 1st of each month, which is usually enough.



Advantages: There are many choices, among which Tencent Youtu and Tencent Cloud have better speech synthesis results

Disadvantages: Unable to control polyphonic reading, number reading, English reading and pauses

4. Alibaba Cloud The Alibaba Cloud speech synthesis interface is currently changed to the websocket request method, which is charged by the number of times.



Advantages: The effect of speech synthesis is better, and the speech model is rich.

Disadvantages: It still costs money to use the official product.

5. Baidu

Supports online speech synthesis and offline speech synthesis. Offline speech synthesis can only be used on two terminals after personal authentication. Online speech synthesis is limited by QPS and validity period. Details are as follows:



Advantages: The synthetic voice effect is ok, the use is relatively simple, and the free quota is enough for development and testing.

Disadvantages: It still costs money to use the official product.

Example development

Here we take Tencent Cloud's speech synthesis as an example to implement a simple speech synthesis script.

  1. Log in, log in to Tencent Cloud's official website, and register an account if you haven't already.
  2. Real-name authentication. If you have not done real-name authentication, perform real-name authentication in the Tencent Cloud Account Center.
  3. To activate the speech synthesis service, enter the speech synthesis console to activate the speech synthesis function.



4. Enter the key management interface, click New Key to generate SecretId and SecretKey, which are used to generate signatures for API calls.



5. Use Python to call the interface for speech synthesis, where APP_ID, SECRET_ID, and SECRET_KEY are obtained in the previous step. The code is as follows:

# coding=UTF-8
import requests
import wave
import json
import base64
import time
import collections
import urllib
import base64
import hmac
import hashlib
import uuid
import os
OUTPUT_PATH = "./audio"

def generate_sign(request_data):
    url = ""
    sign_str = "POST" + url + "?"
    sort_dict = sorted(request_data.keys())
    for key in sort_dict:
        sign_str = sign_str + key + "=" + urllib.parse.unquote(str(request_data[key])) + '&'
    sign_str = sign_str[:-1]
    sign_bytes = sign_str.encode('utf-8')
    key_bytes = TCLOUD_SECRET_KEY.encode('utf-8')
    authorization = base64.b64encode(, sign_bytes, hashlib.sha1).digest())
    return authorization.decode('utf-8')
def text2wav(content):
    request_data = {
        "Action": "TextToStreamAudio",
        "AppId": TCLOUD_APP_ID,
        #Return audio format: Python SDK only supports pcm format
        #pcm: Returns binary pcm audio, simple to use, but large in data volume.
        "Codec": "pcm",
        "Expired": int(time.time()) + 3600,
        #model type, 1: default model
        "ModelType": 1,  
        #Primary language type:
        #1: Chinese (default)
        #2: English
        "PrimaryLanguage": 1,
        #Project ID, user-defined, default is 0.
        "ProjectId": 0,
        #Audio sample rate:
        #16000: 16k (default)
        #8000: 8k
        "SampleRate": 16000,
        "SecretId": TCLOUD_SECRET_ID,
        "SessionId": str(uuid.uuid1()),
        #Speech rate, range: [-2, 2], corresponding to different speech rates:
        #-2 means 0.6 times
        #-1 means 0.8 times
        #0 means 1.0 times (default)
        #1 means 1.2 times
        #2 means 1.5 times
        #Entering parameters other than the above integers will not take effect and will be processed by default.
        "Speed": 0,
        "Text": content,
        "Timestamp": int(time.time()),
        #0: Affinity female voice (default)
        #1: Affinity male voice
        #2: mature male voice
        #3: Vibrant male voice
        #4: Warm female voice
        #5: Emotional female voice
        #6: Emotional male voice
        "VoiceType": 5,
        #Volume size, range: [0, 10], corresponding to 11 levels of volume, the default value is 0, representing normal volume. There is no mute option.
        "Volume": 5, 
    signature = generate_sign(request_data)
    # print(f"signature: {signature}")
    header = {
        "Content-Type": "application/json",
        "Authorization": signature
    url = ""
    # print(request_data)
    r =, headers=header, data=json.dumps(request_data), stream = True)
    # print(r)
    i = 1
    t = int(time.time() * 1000)
    output_file = os.path.join(OUTPUT_PATH, f"{t}.wav")
    print(f"generate audio file: {output_file}")
    wavfile =, 'wb')
    wavfile.setparams((1, 2, 16000, 0, 'NONE', 'NONE'))
    for chunk in r.iter_content(1000):
        if (i == 1) & (str(chunk).find("Error") != -1) :
            return ""
        i = i + 1
    return output_file
if __name__ == "__main__":

You can also refer to the official SDK provided



Tags: Python

Posted by cbrknight on Tue, 24 May 2022 15:36:06 +0300