🎯 Why Replicate Matters

Replicate has already done the hard work of identifying, hosting, and optimizing the hottest audio AI models. Their catalog represents a curated "best of" list with proven usage (millions of runs). This is our rapid prototyping playground.

⚠️ The Challenge: Cost at Scale

Replicate is perfect for MVP and validation, but costs compound quickly at scale. At 100K users generating 5 tracks/month each:

Strategy: Use Replicate to validate which models users love, then migrate those to self-hosted on Fly.io GPU or Modal.

🎡 Music Generation

Tier 1: Full Song Generation (with Vocals)

End-to-end song creation with lyrics, vocals, and full instrumentation.

Model Creator Runs Description Pricing
minimax/music-01 MiniMax 470.4K Up to 1 min of music with lyrics & vocals matching reference styles ~$0.03/song
minimax/music-1.5 MiniMax 22.1K Full-length songs up to 4 mins with natural vocals & rich instrumentation ~$0.03/song
lucataco/ace-step lucataco 80.2K State-of-the-art music foundation model. 4 min in 20s. Lyrics support. ~$0.017/min
Tier 2: Instrumental Music Generation

High-quality instrumentals from text prompts - the bread and butter of beat making.

Model Creator Runs Description Pricing
google/lyria-2 Google 49.3K 48kHz stereo audio from text. SynthID watermarked. 30s output. $0.0001/sec
meta/musicgen Meta 3.2M Generate music from prompt or melody. 300M-3.3B params. Industry standard. ~$0.02/run
stability-ai/stable-audio-2.5 Stability AI 9.9K High-quality music and sound from text prompts Variable
riffusion/riffusion Riffusion 1.1M Music via spectrogram diffusion. Real-time capable. ~$0.039/run
lucataco/magnet lucataco 2.8K Non-autoregressive transformer music generation Variable
stackadoc/stable-audio-open-1.0 stackadoc - Open source Stable Audio for short samples & sound effects Variable
Tier 3: Chord/Melody Conditioned

Music generation with harmonic control - specify chord progressions and tempo.

Model Creator Runs Description
sakemin/musicgen-stereo-chord sakemin 3.3K Generate stereo music restricted to chord sequences and tempo
sakemin/musicgen-chord sakemin 3K MusicGen with chord progression input
sakemin/musicgen-remixer sakemin 18.3K Remix existing music with MusicGen
pollinations/music-gen Pollinations - Music generation variant
Specialized
Model Creator Runs Description
zsxkib/flux-music zsxkib 8.7K Music generation with Flux architecture
andreasjansson/loop-gen andreasjansson - Generate fixed-BPM loops from text prompts

πŸ—£οΈ Text-to-Speech / Voice Synthesis

Tier 1: Production-Grade TTS

Industry-leading text-to-speech with emotion control and voice cloning.

Model Creator Runs Description Pricing
minimax/speech-02-turbo MiniMax 7.1M Real-time TTS with emotion, 30+ languages, 300+ voices $30/1M chars
minimax/speech-02-hd MiniMax 1.2M High-fidelity TTS for voiceovers/audiobooks $50/1M chars
resemble-ai/chatterbox Resemble AI 193.7K Expressive speech, emotion control, instant voice cloning Variable
resemble-ai/chatterbox-multilingual Resemble AI 5.7K 23 languages, voice cloning, cross-language transfer Variable
jaaari/kokoro-82m jaaari 69.5M Lightweight 82M param TTS based on StyleTTS2 Cheap
Tier 2: Voice Cloning & Specialty TTS

Zero-shot voice cloning and specialized speech synthesis.

Model Creator Runs Description
lucataco/xtts-v2 lucataco 4.6M Coqui XTTS v2 - multilingual voice cloning
chenxwh/openvoice chenxwh 80.8K MyShell OpenVoice - zero-shot voice cloning
adirik/styletts2 adirik 132K StyleTTS2 - style-based TTS
suno-ai/bark Suno AI 303.2K Generates speech, music, sound effects
afiaka87/tortoise-tts afiaka87 173K High-quality slow TTS with voice cloning
lucataco/orpheus-3b-0.1-ft lucataco 32.7K Orpheus 3B - Llama-based expressive TTS
x-lance/f5-tts x-lance 37.3K F5-TTS model
zsxkib/dia zsxkib 10K 1.6B dialogue TTS with voice cloning
lucataco/csm-1b lucataco 1.1K Sesame CSM - conversational speech model
Tier 3: Specialized TTS
Model Creator Runs Description
microsoft/vibevoice Microsoft - Long-form multi-speaker podcast generation (up to 90 min, 4 speakers)
cjwbw/voicecraft cjwbw 10.7K VoiceCraft speech editing
cjwbw/parler-tts cjwbw 2.7K Parler TTS
cjwbw/seamless_communication cjwbw 91.9K Meta's seamless translation + TTS
awerks/neon-tts awerks 173.1K Neon TTS
minimax/voice-cloning MiniMax 25.4K 10-second voice cloning

🎀 Singing Voice & RVC

Voice conversion for creating AI covers and custom vocal performances.

Model Creator Runs Description
zsxkib/realistic-voice-cloning zsxkib 1.3M Create song covers with RVC v2 AI voice
pseudoram/rvc-v2 PseudoRAM 1.3M Speech-to-speech with RVC v2
replicate/train-rvc-model Replicate 397.7K Train custom RVC models
zsxkib/create-rvc-dataset zsxkib 18.6K Create RVC dataset from YouTube
lucataco/singing_voice_conversion lucataco 1.1K Amphion DiffWaveNetSVC
nateraw/autotune nateraw 605 Pitch correction

🧠 Audio Understanding

Analyze, caption, and understand audio content with AI.

Model Creator Runs Description
zsxkib/kimi-audio-7b-instruct zsxkib - Kimi-Audio: speech-to-text, audio Q&A, captioning, emotion tags, voice responses

πŸŽ›οΈ FlowState DAW Recommendations

Best model picks for hip-hop production workflows.

Use Case Best Model Why
Full beat generation ACE-Step or Music-1.5 Fast, full songs with vocals
Instrumental loops MusicGen or Lyria-2 Proven, high quality, millions of runs
Chord-based backing musicgen-stereo-chord Control over harmony
Scratch vocals Chatterbox or Orpheus Expressive, cloneable
Voice cloning for hooks OpenVoice or XTTS-v2 Zero-shot cloning
Vocal covers/AI voices RVC v2 1.3M+ runs, proven
Audio understanding Kimi-Audio-7B Analyze audio, Q&A

πŸ’° Cost Comparison (per generation)

Lyria-2
$0.003
30 sec output
ACE-Step
$0.017
60 sec output
Music-1.5
$0.03
up to 4 min
MusicGen
$0.02
30 sec output
Riffusion
$0.039
variable
Speech-02-Turbo
$0.03
per 1K chars

πŸ”§ Self-Hosting Strategy

Replicate shows us what works. Now we need to run it cheaper.

πŸ“Š Phase 1: Validate with Replicate
Use Replicate during MVP to identify which models users actually use. Track generation counts per model type.
🎯 Phase 2: Identify Top 3
Find the 3 models that account for 80% of usage. These are candidates for self-hosting.
πŸ–₯️ Phase 3: Self-Host Winners
Deploy top models on Fly.io GPU ($0.50-2/hr) or Modal. 10-50x cost reduction at scale.
βš–οΈ Phase 4: Hybrid Architecture
Self-host high-volume models, use Replicate for long-tail/experimental features.

Self-Hosting Candidates (Open Source)

Replicate Model Open Source Version Self-Host Cost Savings at Scale
meta/musicgen AudioCraft (GitHub) ~$0.001/run 95%
lucataco/ace-step ACE-Step (GitHub) ~$0.002/min 88%
resemble-ai/chatterbox Chatterbox (GitHub) ~$0.0005/run 90%+
lucataco/xtts-v2 Coqui TTS (GitHub) ~$0.0003/run 95%+
pseudoram/rvc-v2 RVC WebUI (GitHub) ~$0.001/run 90%
πŸ’‘
Key Insight: Every popular Replicate model has an open-source version. Replicate's value is convenience, not exclusivity. At 10K+ users, self-hosting the top 3 models could save $5,000-15,000/month.

πŸ“š Sources

Research compiled December 2025. Run counts and pricing subject to change. Check Replicate for current pricing.