MARS5 TTS Introduction
MARS5 TTS is a state-of-the-art speech model developed by CAMB.AI for text-to-speech applications. This model is designed to handle a wide range of prosodic challenges, making it suitable for scenarios like sports commentary, anime, and more. With its innovative two-stage AR-NAR pipeline and a distinctively novel NAR component, MARS5 TTS sets itself apart in the world of synthetic speech generation.
Features of MARS5 TTS
High-Level Architecture
MARS5 TTS follows a high-level architecture flow that starts with text and a reference audio. The model uses an autoregressive transformer to obtain coarse (L0) encoded speech features. These features, along with the text and reference, are then refined in a multinomial DDPM model to produce the remaining encoded codebook values. The final output is achieved by vocoding the output of the DDPM, resulting in high-quality speech.
Innovative Components
- Two-Stage AR-NAR Pipeline: The model employs a two-stage pipeline with an autoregressive (AR) component and a non-autoregressive (NAR) component, which is uniquely designed for MARS5.
- Prosody Control: MARS5 can be guided by punctuation and capitalization in the input text, allowing for natural prosody control.
- Reference-Based Speech Generation: The model can generate speech with a specified speaker identity using a reference audio file between 2-12 seconds in length.
Key Advantages
- Versatility: MARS5 can handle various prosodically challenging scenarios with ease.
- Speed and Quality: The model offers both shallow (fast) and deep (high-quality) cloning options for inference.
- Flexibility: Users can tune various inference settings to optimize the output for different use cases.
MARS5 TTS Setup and Usage
Quickstart
To get started with MARS5 TTS, users can install the required Python packages and load the model using torch.hub. The model comes with easy-to-follow instructions for performing inference, including the option to use a deep clone for higher quality results.
Installation
bash pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
Loading the Model
python import torch, librosa
mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
Perform Synthesis
python
Load reference audio and optional transcript
wav, sr = librosa.load('<path to arbitrary 24kHz waveform>.wav', sr=mars5.sr, mono=True) wav = torch.from_numpy(wav) ref_transcript = "<transcript of the reference audio>"
Choose deep or shallow clone and configure inference settings
deep_clone = True cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)
Generate speech
ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)
MARS5 TTS in Depth
Model Architecture
The architecture of MARS5 TTS is designed to be efficient and effective, with a focus on generating natural-sounding speech with correct prosody. The model's ability to use a reference audio and transcript for deep cloning enhances its capability to produce speech that closely matches the reference's characteristics.
Technical Details
- Checkpoints: The model provides AR and NAR checkpoints, each with approximately 750M and 450M parameters, respectively.
- Hardware Requirements: Users need a GPU with enough memory to store and inference with at least 750M active parameters.
Roadmap and Tasks
The developers of MARS5 TTS are continuously working on improving the model's quality, stability, and performance. The roadmap includes tasks such as improving inference consistency, optimizing performance, and enhancing the reference audio selection process.
MARS5 TTS Community and Contributions
Community Engagement
The MARS5 TTS community is active on Forum and Discord, where users can share suggestions, feedback, and questions with the development team.
Contributing to MARS5 TTS
Contributions to the MARS5 TTS repository are welcome. The preferred way to contribute is by forking the master repository on GitHub, making changes, and submitting a pull request.
License
MARS5 TTS is open-sourced under the GNU AGPL 3.0 license, with the option to request a different license by contacting the developers.
MARS5 TTS FAQs
What is the minimum hardware requirement for running MARS5 TTS?
MARS5 TTS requires a GPU with at least 750M+450M parameters of storage and the ability to perform inference with 750M of active parameters.
Can I use MARS5 TTS without a GPU?
While a GPU is recommended for running MARS5 TTS, users without the necessary hardware can access the model through the CAMB.AI API.
How do I contribute to the development of MARS5 TTS?
To contribute to MARS5 TTS, fork the repository on GitHub, make your changes, and submit a pull request with a description of your modifications.
Is there a demo available for MARS5 TTS?
Yes, there is an online demo available for users to experience the capabilities of MARS5 TTS.
Can MARS5 TTS be used for commercial purposes?
MARS5 TTS is open-sourced under the GNU AGPL 3.0 license, which allows for commercial use with certain conditions. For more information, contact the developers.
How does MARS5 TTS handle punctuation and capitalization in the input text?
MARS5 TTS uses punctuation and capitalization to guide the prosody of the generated speech, allowing for more natural-sounding outputs.
What languages are supported by MARS5 TTS?
Currently, MARS5 TTS supports English, but the developers plan to expand language support in the future.
Can MARS5 TTS generate long-form content?
As of now, MARS5 TTS is designed for generating short-form content. However, the development team is exploring ways to enable long-form generation.
What is the difference between shallow cloning and deep cloning?
Shallow cloning is a faster inference method that does not require the transcript of the reference audio, while deep cloning produces higher quality results but requires the transcript and takes longer to generate speech.