如何在 Python 中将语音转换为文本

2024-04-30 1270阅读

一、说明

学习如何使用语音识别 Python 库执行语音识别，以在 Python 中将音频语音转换为文本。想要更快地编码吗？我们的Python 代码生成器让您只需点击几下即可创建 Python 脚本。现在就现在试试！

（图片来源网络，侵删）

二、语言AI库

2.1 相当给力的转文字库

语音识别是计算机软件识别口语中的单词和短语并将其转换为人类可读文本的能力。在本教程中，您将学习如何使用SpeechRecognition 库在 Python 中将语音转换为文本。

因此，我们不需要从头开始构建任何机器学习模型，这个库为我们提供了各种知名公共语音识别 API（例如 Google Cloud Speech API、IBM Speech To Text 等）的便捷包装。

请注意，如果您不想使用 API，而是直接对机器学习模型进行推理，那么一定要查看本教程，其中我将向您展示如何使用当前最先进的机器学习模型在Python中执行语音识别。

另外，如果您想要其他方法来执行 ASR，请查看此语音识别综合教程。

另请学习：如何在 Python 中翻译文本。

2.2 安装过程

好吧，让我们开始使用以下命令安装库pip：

pip3 install SpeechRecognition pydub

好的，打开一个新的 Python 文件并导入它：

import speech_recognition as sr

这个库的好处是它支持多种识别引擎：

CMU Sphinx（离线）
谷歌语音识别
谷歌云语音API
维特人工智能
微软必应语音识别
Houndify API
IBM 语音转文本

Snowboy 热词检测（离线）

我们将在这里使用 Google 语音识别，因为它很简单并且不需要任何 API 密钥。

2.3 转录音频文件

确保当前目录中有一个包含英语演讲的音频文件（如果您想跟我一起学习，请在此处获取音频文件）：

filename = "16-122828-0002.wav"

该文件是从LibriSpeech数据集中获取的，但您可以使用任何您想要的音频 WAV 文件，只需更改文件名，让我们初始化我们的语音识别器：

# initialize the recognizer
r = sr.Recognizer()

下面的代码负责加载音频文件，并使用 Google 语音识别将语音转换为文本：

# open the file
with sr.AudioFile(filename) as source:
    # listen for the data (load audio to memory)
    audio_data = r.record(source)
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data)
    print(text)

这将需要几秒钟才能完成，因为它将文件上传到 Google 并获取输出，这是我的结果：

I believe you're just talking nonsense

上面的代码适用于小型或中型音频文件。在下一节中，我们将为大文件编写代码。

2.4 转录大型音频文件

如果您想对长音频文件执行语音识别，那么下面的函数可以很好地处理这个问题：

# importing libraries 
import speech_recognition as sr 
import os 
from pydub import AudioSegment
from pydub.silence import split_on_silence
# create a speech recognition object
r = sr.Recognizer()
# a function to recognize speech in the audio file
# so that we don't repeat ourselves in in other functions
def transcribe_audio(path):
    # use the audio file as the audio source
    with sr.AudioFile(path) as source:
        audio_listened = r.record(source)
        # try converting it to text
        text = r.recognize_google(audio_listened)
    return text
# a function that splits the audio file into chunks on silence
# and applies speech recognition
def get_large_audio_transcription_on_silence(path):
    """Splitting the large audio file into chunks
    and apply speech recognition on each of these chunks"""
    # open the audio file using pydub
    sound = AudioSegment.from_file(path)  
    # split audio sound where silence is 500 miliseconds or more and get chunks
    chunks = split_on_silence(sound,
        # experiment with this value for your target audio file
        min_silence_len = 500,
        # adjust this per requirement
        silence_thresh = sound.dBFS-14,
        # keep the silence for 1 second, adjustable as well
        keep_silence=500,
    )
    folder_name = "audio-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk 
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the `folder_name` directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        try:
            text = transcribe_audio(chunk_filename)
        except sr.UnknownValueError as e:
            print("Error:", str(e))
        else:
            text = f"{text.capitalize()}. "
            print(chunk_filename, ":", text)
            whole_text += text
    # return the text for all chunks detected
    return whole_text
        ```

     
注意：您需要安装Pydub才能pip使上述代码正常工作。
上述函数使用模块split_on_silence()中的函数pydub.silence在静音时将音频数据分割成块。该min_silence_len参数是用于分割的最小静音长度（以毫秒为单位）。
silence_thresh是阈值，任何比这更安静的东西都将被视为静音，我将其设置为平均dBFS - 14，keep_silence参数是在检测到的每个块的开头和结尾处留下的静音量（以毫秒为单位）。
这些参数并不适合所有声音文件，请尝试根据您的大量音频需求尝试这些参数。
之后，我们迭代所有块并将每个语音音频转换为文本，然后将它们加在一起，这是一个运行示例：
path = "7601-291468-0006.wav"
print("\nFull text:", get_large_audio_transcription_on_silence(path))
注意：您可以在此处7601-291468-0006.wav获取文件。
输出：
```python
audio-chunks\chunk1.wav : His abode which you had fixed in a bowery or country seat. 
audio-chunks\chunk2.wav : At a short distance from the city. 
audio-chunks\chunk3.wav : Just at what is now called dutch street. 
audio-chunks\chunk4.wav : Sooner bounded with proofs of his ingenuity. 
audio-chunks\chunk5.wav : Patent smokejacks. 
audio-chunks\chunk6.wav : It required a horse to work some. 
audio-chunks\chunk7.wav : Dutch oven roasted meat without fire. 
audio-chunks\chunk8.wav : Carts that went before the horses. 
audio-chunks\chunk9.wav : Weather cox that turned against the wind and other wrongheaded contrivances. 
audio-chunks\chunk10.wav : So just understand can found it all beholders. 
Full text: His abode which you had fixed in a bowery or country seat. At a short distance from the city. Just at what is now called dutch street. Sooner bounded with proofs of his ingenuity. Patent smokejacks. It required a horse to work some. Dutch oven roasted meat without fire. Carts that went before the horses. Weather cox that turned against the wind and other wrongheaded contrivances. So just understand can found it all beholders.

因此，该函数会自动为我们创建一个文件夹，并放置我们指定的原始音频文件块，然后对所有这些文件运行语音识别。

如果您想将音频文件分割成固定的间隔，我们可以使用以下函数：

# a function that splits the audio file into fixed interval chunks
# and applies speech recognition
def get_large_audio_transcription_fixed_interval(path, minutes=5):
    """Splitting the large audio file into fixed interval chunks
    and apply speech recognition on each of these chunks"""
    # open the audio file using pydub
    sound = AudioSegment.from_file(path)  
    # split the audio file into chunks
    chunk_length_ms = int(1000 * 60 * minutes) # convert to milliseconds
    chunks = [sound[i:i + chunk_length_ms] for i in range(0, len(sound), chunk_length_ms)]
    folder_name = "audio-fixed-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk 
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the `folder_name` directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        try:
            text = transcribe_audio(chunk_filename)
        except sr.UnknownValueError as e:
            print("Error:", str(e))
        else:
            text = f"{text.capitalize()}. "
            print(chunk_filename, ":", text)
            whole_text += text
    # return the text for all chunks detected
    return whole_text

上述函数将大音频文件分割成 5 分钟的块。您可以更改minutes参数以满足您的需要。由于我的音频文件不是那么大，我尝试将其分成 10 秒的块：

print("\nFull text:", get_large_audio_transcription_fixed_interval(path, minutes=1/6))

输出：

audio-fixed-chunks\chunk1.wav : His abode which you had fixed in a bowery or country seat at a short distance from the city just that one is now called. 
audio-fixed-chunks\chunk2.wav : Dutch street soon abounded with proofs of his ingenuity patent smokejacks that required a horse to work some. 
audio-fixed-chunks\chunk3.wav : Oven roasted meat without fire carts that went before the horses weather cox that turned against the wind and other wrong 
head.
audio-fixed-chunks\chunk4.wav : Contrivances that astonished and confound it all beholders. 
Full text: His abode which you had fixed in a bowery or country seat at a short distance from the city just that one is now called. Dutch street soon abounded with proofs of his ingenuity patent smokejacks that required a horse to work some. Oven roasted meat without fire carts that went before the horses weather cox that turned against the wind and other wrong head. Contrivances that astonished and confound it all beholders.

2.5 从麦克风读取

这需要在您的计算机上安装PyAudio ，以下是根据您的操作系统安装的过程：

windows
你可以直接pip 安装它：
```
$ pip3 install pyaudio
```
- Linux
  您需要先安装依赖项：
```
$ sudo apt-get install python-pyaudio python3-pyaudio
$ pip3 install pyaudio
```
  - 苹果系统
    你需要先安装portaudio，然后你可以直接 pip 安装它：
```
$ brew install portaudio
$ pip3 install pyaudio
```
    现在让我们使用麦克风来转换我们的语音：
```
import speech_recognition as sr
with sr.Microphone() as source:
    # read the audio data from the default microphone
    audio_data = r.record(source, duration=5)
    print("Recognizing...")
    # convert speech to text
    text = r.recognize_google(audio_data)
    print(text)
```
    这将从您的麦克风中听到 5 秒钟，然后尝试将语音转换为文本！
    
    它与前面的代码非常相似，但是我们在这里使用该Microphone()对象从默认麦克风读取音频，然后我们使用函数duration中的参数record()在5秒后停止读取，然后将音频数据上传到Google以获取输出文本。
    
    您还可以使用函数offset中的参数在几秒record()后开始录制offset。
    
    此外，您可以通过将language参数传递给recognize_google()函数来识别不同的语言。例如，如果您想识别西班牙语语音，您可以使用：
```
text = r.recognize_google(audio_data, language="es-ES")
```
    在此 StackOverflow 答案中查看支持的语言。
    
    三、结论
    
    正如您所看到的，使用这个库将语音转换为文本非常容易和简单。这个库在野外被广泛使用。查看官方文档。
    
    如果您也想在 Python 中将文本转换为语音，请查看本教程。
    
    另请阅读：如何使用 Python 识别图像中的光学字符。快乐编码！