βοΈ Replicate Models Catalog
Pre-Curated Best-in-Class Audio AI Models - Ready to Deploy
π― Why Replicate Matters
Replicate has already done the hard work of identifying, hosting, and optimizing the hottest audio AI models. Their catalog represents a curated "best of" list with proven usage (millions of runs). This is our rapid prototyping playground.
Replicate is perfect for MVP and validation, but costs compound quickly at scale. At 100K users generating 5 tracks/month each:
- MusicGen @ $0.02/run: $10,000/month
- ACE-Step @ $0.017/min: $8,500/month
- Total audio AI: $20,000-50,000/month
Strategy: Use Replicate to validate which models users love, then migrate those to self-hosted on Fly.io GPU or Modal.
π΅ Music Generation
Tier 1: Full Song Generation (with Vocals)End-to-end song creation with lyrics, vocals, and full instrumentation.
| Model | Creator | Runs | Description | Pricing |
|---|---|---|---|---|
| minimax/music-01 | MiniMax | 470.4K | Up to 1 min of music with lyrics & vocals matching reference styles | ~$0.03/song |
| minimax/music-1.5 | MiniMax | 22.1K | Full-length songs up to 4 mins with natural vocals & rich instrumentation | ~$0.03/song |
| lucataco/ace-step | lucataco | 80.2K | State-of-the-art music foundation model. 4 min in 20s. Lyrics support. | ~$0.017/min |
High-quality instrumentals from text prompts - the bread and butter of beat making.
| Model | Creator | Runs | Description | Pricing |
|---|---|---|---|---|
| google/lyria-2 | 49.3K | 48kHz stereo audio from text. SynthID watermarked. 30s output. | $0.0001/sec | |
| meta/musicgen | Meta | 3.2M | Generate music from prompt or melody. 300M-3.3B params. Industry standard. | ~$0.02/run |
| stability-ai/stable-audio-2.5 | Stability AI | 9.9K | High-quality music and sound from text prompts | Variable |
| riffusion/riffusion | Riffusion | 1.1M | Music via spectrogram diffusion. Real-time capable. | ~$0.039/run |
| lucataco/magnet | lucataco | 2.8K | Non-autoregressive transformer music generation | Variable |
| stackadoc/stable-audio-open-1.0 | stackadoc | - | Open source Stable Audio for short samples & sound effects | Variable |
Music generation with harmonic control - specify chord progressions and tempo.
| Model | Creator | Runs | Description |
|---|---|---|---|
| sakemin/musicgen-stereo-chord | sakemin | 3.3K | Generate stereo music restricted to chord sequences and tempo |
| sakemin/musicgen-chord | sakemin | 3K | MusicGen with chord progression input |
| sakemin/musicgen-remixer | sakemin | 18.3K | Remix existing music with MusicGen |
| pollinations/music-gen | Pollinations | - | Music generation variant |
| Model | Creator | Runs | Description |
|---|---|---|---|
| zsxkib/flux-music | zsxkib | 8.7K | Music generation with Flux architecture |
| andreasjansson/loop-gen | andreasjansson | - | Generate fixed-BPM loops from text prompts |
π£οΈ Text-to-Speech / Voice Synthesis
Tier 1: Production-Grade TTSIndustry-leading text-to-speech with emotion control and voice cloning.
| Model | Creator | Runs | Description | Pricing |
|---|---|---|---|---|
| minimax/speech-02-turbo | MiniMax | 7.1M | Real-time TTS with emotion, 30+ languages, 300+ voices | $30/1M chars |
| minimax/speech-02-hd | MiniMax | 1.2M | High-fidelity TTS for voiceovers/audiobooks | $50/1M chars |
| resemble-ai/chatterbox | Resemble AI | 193.7K | Expressive speech, emotion control, instant voice cloning | Variable |
| resemble-ai/chatterbox-multilingual | Resemble AI | 5.7K | 23 languages, voice cloning, cross-language transfer | Variable |
| jaaari/kokoro-82m | jaaari | 69.5M | Lightweight 82M param TTS based on StyleTTS2 | Cheap |
Zero-shot voice cloning and specialized speech synthesis.
| Model | Creator | Runs | Description |
|---|---|---|---|
| lucataco/xtts-v2 | lucataco | 4.6M | Coqui XTTS v2 - multilingual voice cloning |
| chenxwh/openvoice | chenxwh | 80.8K | MyShell OpenVoice - zero-shot voice cloning |
| adirik/styletts2 | adirik | 132K | StyleTTS2 - style-based TTS |
| suno-ai/bark | Suno AI | 303.2K | Generates speech, music, sound effects |
| afiaka87/tortoise-tts | afiaka87 | 173K | High-quality slow TTS with voice cloning |
| lucataco/orpheus-3b-0.1-ft | lucataco | 32.7K | Orpheus 3B - Llama-based expressive TTS |
| x-lance/f5-tts | x-lance | 37.3K | F5-TTS model |
| zsxkib/dia | zsxkib | 10K | 1.6B dialogue TTS with voice cloning |
| lucataco/csm-1b | lucataco | 1.1K | Sesame CSM - conversational speech model |
| Model | Creator | Runs | Description |
|---|---|---|---|
| microsoft/vibevoice | Microsoft | - | Long-form multi-speaker podcast generation (up to 90 min, 4 speakers) |
| cjwbw/voicecraft | cjwbw | 10.7K | VoiceCraft speech editing |
| cjwbw/parler-tts | cjwbw | 2.7K | Parler TTS |
| cjwbw/seamless_communication | cjwbw | 91.9K | Meta's seamless translation + TTS |
| awerks/neon-tts | awerks | 173.1K | Neon TTS |
| minimax/voice-cloning | MiniMax | 25.4K | 10-second voice cloning |
π€ Singing Voice & RVC
Voice conversion for creating AI covers and custom vocal performances.
| Model | Creator | Runs | Description |
|---|---|---|---|
| zsxkib/realistic-voice-cloning | zsxkib | 1.3M | Create song covers with RVC v2 AI voice |
| pseudoram/rvc-v2 | PseudoRAM | 1.3M | Speech-to-speech with RVC v2 |
| replicate/train-rvc-model | Replicate | 397.7K | Train custom RVC models |
| zsxkib/create-rvc-dataset | zsxkib | 18.6K | Create RVC dataset from YouTube |
| lucataco/singing_voice_conversion | lucataco | 1.1K | Amphion DiffWaveNetSVC |
| nateraw/autotune | nateraw | 605 | Pitch correction |
π§ Audio Understanding
Analyze, caption, and understand audio content with AI.
| Model | Creator | Runs | Description |
|---|---|---|---|
| zsxkib/kimi-audio-7b-instruct | zsxkib | - | Kimi-Audio: speech-to-text, audio Q&A, captioning, emotion tags, voice responses |
ποΈ FlowState DAW Recommendations
Best model picks for hip-hop production workflows.
| Use Case | Best Model | Why |
|---|---|---|
| Full beat generation | ACE-Step or Music-1.5 | Fast, full songs with vocals |
| Instrumental loops | MusicGen or Lyria-2 | Proven, high quality, millions of runs |
| Chord-based backing | musicgen-stereo-chord | Control over harmony |
| Scratch vocals | Chatterbox or Orpheus | Expressive, cloneable |
| Voice cloning for hooks | OpenVoice or XTTS-v2 | Zero-shot cloning |
| Vocal covers/AI voices | RVC v2 | 1.3M+ runs, proven |
| Audio understanding | Kimi-Audio-7B | Analyze audio, Q&A |
π° Cost Comparison (per generation)
π§ Self-Hosting Strategy
Replicate shows us what works. Now we need to run it cheaper.
Self-Hosting Candidates (Open Source)
| Replicate Model | Open Source Version | Self-Host Cost | Savings at Scale |
|---|---|---|---|
| meta/musicgen | AudioCraft (GitHub) | ~$0.001/run | 95% |
| lucataco/ace-step | ACE-Step (GitHub) | ~$0.002/min | 88% |
| resemble-ai/chatterbox | Chatterbox (GitHub) | ~$0.0005/run | 90%+ |
| lucataco/xtts-v2 | Coqui TTS (GitHub) | ~$0.0003/run | 95%+ |
| pseudoram/rvc-v2 | RVC WebUI (GitHub) | ~$0.001/run | 90% |
π Sources
- replicate.com/collections/ai-music-generation
- replicate.com/collections/text-to-speech
- replicate.com/collections/sing-with-voices
- replicate.com/lucataco/ace-step
- replicate.com/blog/minimax-text-to-speech
Research compiled December 2025. Run counts and pricing subject to change. Check Replicate for current pricing.