Replicate Models - FlowState DAW

📊 Full Technology Survey 🖥️ Self-Hosted Models ☁️ Replicate Models

🎯 Why Replicate Matters

Replicate has already done the hard work of identifying, hosting, and optimizing the hottest audio AI models. Their catalog represents a curated "best of" list with proven usage (millions of runs). This is our rapid prototyping playground.

⚠️ The Challenge: Cost at Scale

Replicate is perfect for MVP and validation, but costs compound quickly at scale. At 100K users generating 5 tracks/month each:

MusicGen @ $0.02/run: $10,000/month
ACE-Step @ $0.017/min: $8,500/month
Total audio AI: $20,000-50,000/month

Strategy: Use Replicate to validate which models users love, then migrate those to self-hosted on Fly.io GPU or Modal.

🎵 Music Generation

Tier 1: Full Song Generation (with Vocals)

End-to-end song creation with lyrics, vocals, and full instrumentation.

Model	Creator	Runs	Description	Pricing
minimax/music-01	MiniMax	470.4K	Up to 1 min of music with lyrics & vocals matching reference styles	~$0.03/song
minimax/music-1.5	MiniMax	22.1K	Full-length songs up to 4 mins with natural vocals & rich instrumentation	~$0.03/song
lucataco/ace-step	lucataco	80.2K	State-of-the-art music foundation model. 4 min in 20s. Lyrics support.	~$0.017/min

Tier 2: Instrumental Music Generation

High-quality instrumentals from text prompts - the bread and butter of beat making.

Model	Creator	Runs	Description	Pricing
google/lyria-2	Google	49.3K	48kHz stereo audio from text. SynthID watermarked. 30s output.	$0.0001/sec
meta/musicgen	Meta	3.2M	Generate music from prompt or melody. 300M-3.3B params. Industry standard.	~$0.02/run
stability-ai/stable-audio-2.5	Stability AI	9.9K	High-quality music and sound from text prompts	Variable
riffusion/riffusion	Riffusion	1.1M	Music via spectrogram diffusion. Real-time capable.	~$0.039/run
lucataco/magnet	lucataco	2.8K	Non-autoregressive transformer music generation	Variable
stackadoc/stable-audio-open-1.0	stackadoc	-	Open source Stable Audio for short samples & sound effects	Variable

Tier 3: Chord/Melody Conditioned

Music generation with harmonic control - specify chord progressions and tempo.

Model	Creator	Runs	Description
sakemin/musicgen-stereo-chord	sakemin	3.3K	Generate stereo music restricted to chord sequences and tempo
sakemin/musicgen-chord	sakemin	3K	MusicGen with chord progression input
sakemin/musicgen-remixer	sakemin	18.3K	Remix existing music with MusicGen
pollinations/music-gen	Pollinations	-	Music generation variant

Specialized

Model	Creator	Runs	Description
zsxkib/flux-music	zsxkib	8.7K	Music generation with Flux architecture
andreasjansson/loop-gen	andreasjansson	-	Generate fixed-BPM loops from text prompts

🗣️ Text-to-Speech / Voice Synthesis

Tier 1: Production-Grade TTS

Industry-leading text-to-speech with emotion control and voice cloning.

Model	Creator	Runs	Description	Pricing
minimax/speech-02-turbo	MiniMax	7.1M	Real-time TTS with emotion, 30+ languages, 300+ voices	$30/1M chars
minimax/speech-02-hd	MiniMax	1.2M	High-fidelity TTS for voiceovers/audiobooks	$50/1M chars
resemble-ai/chatterbox	Resemble AI	193.7K	Expressive speech, emotion control, instant voice cloning	Variable
resemble-ai/chatterbox-multilingual	Resemble AI	5.7K	23 languages, voice cloning, cross-language transfer	Variable
jaaari/kokoro-82m	jaaari	69.5M	Lightweight 82M param TTS based on StyleTTS2	Cheap

Tier 2: Voice Cloning & Specialty TTS

Zero-shot voice cloning and specialized speech synthesis.

Model	Creator	Runs	Description
lucataco/xtts-v2	lucataco	4.6M	Coqui XTTS v2 - multilingual voice cloning
chenxwh/openvoice	chenxwh	80.8K	MyShell OpenVoice - zero-shot voice cloning
adirik/styletts2	adirik	132K	StyleTTS2 - style-based TTS
suno-ai/bark	Suno AI	303.2K	Generates speech, music, sound effects
afiaka87/tortoise-tts	afiaka87	173K	High-quality slow TTS with voice cloning
lucataco/orpheus-3b-0.1-ft	lucataco	32.7K	Orpheus 3B - Llama-based expressive TTS
x-lance/f5-tts	x-lance	37.3K	F5-TTS model
zsxkib/dia	zsxkib	10K	1.6B dialogue TTS with voice cloning
lucataco/csm-1b	lucataco	1.1K	Sesame CSM - conversational speech model

Tier 3: Specialized TTS

Model	Creator	Runs	Description
microsoft/vibevoice	Microsoft	-	Long-form multi-speaker podcast generation (up to 90 min, 4 speakers)
cjwbw/voicecraft	cjwbw	10.7K	VoiceCraft speech editing
cjwbw/parler-tts	cjwbw	2.7K	Parler TTS
cjwbw/seamless_communication	cjwbw	91.9K	Meta's seamless translation + TTS
awerks/neon-tts	awerks	173.1K	Neon TTS
minimax/voice-cloning	MiniMax	25.4K	10-second voice cloning

🎤 Singing Voice & RVC

Voice conversion for creating AI covers and custom vocal performances.

Model	Creator	Runs	Description
zsxkib/realistic-voice-cloning	zsxkib	1.3M	Create song covers with RVC v2 AI voice
pseudoram/rvc-v2	PseudoRAM	1.3M	Speech-to-speech with RVC v2
replicate/train-rvc-model	Replicate	397.7K	Train custom RVC models
zsxkib/create-rvc-dataset	zsxkib	18.6K	Create RVC dataset from YouTube
lucataco/singing_voice_conversion	lucataco	1.1K	Amphion DiffWaveNetSVC
nateraw/autotune	nateraw	605	Pitch correction

🧠 Audio Understanding

Analyze, caption, and understand audio content with AI.

Model	Creator	Runs	Description
zsxkib/kimi-audio-7b-instruct	zsxkib	-	Kimi-Audio: speech-to-text, audio Q&A, captioning, emotion tags, voice responses

🎛️ FlowState DAW Recommendations

Best model picks for hip-hop production workflows.

Use Case	Best Model	Why
Full beat generation	ACE-Step or Music-1.5	Fast, full songs with vocals
Instrumental loops	MusicGen or Lyria-2	Proven, high quality, millions of runs
Chord-based backing	musicgen-stereo-chord	Control over harmony
Scratch vocals	Chatterbox or Orpheus	Expressive, cloneable
Voice cloning for hooks	OpenVoice or XTTS-v2	Zero-shot cloning
Vocal covers/AI voices	RVC v2	1.3M+ runs, proven
Audio understanding	Kimi-Audio-7B	Analyze audio, Q&A

💰 Cost Comparison (per generation)

Lyria-2

$0.003

30 sec output

ACE-Step

$0.017

60 sec output

Music-1.5

$0.03

up to 4 min

MusicGen

$0.02

30 sec output

Riffusion

$0.039

variable

Speech-02-Turbo

$0.03

per 1K chars

🔧 Self-Hosting Strategy

Replicate shows us what works. Now we need to run it cheaper.

📊 Phase 1: Validate with Replicate

Use Replicate during MVP to identify which models users actually use. Track generation counts per model type.

🎯 Phase 2: Identify Top 3

Find the 3 models that account for 80% of usage. These are candidates for self-hosting.

🖥️ Phase 3: Self-Host Winners

Deploy top models on Fly.io GPU ($0.50-2/hr) or Modal. 10-50x cost reduction at scale.

⚖️ Phase 4: Hybrid Architecture

Self-host high-volume models, use Replicate for long-tail/experimental features.

Self-Hosting Candidates (Open Source)

Replicate Model	Open Source Version	Self-Host Cost	Savings at Scale
meta/musicgen	AudioCraft (GitHub)	~$0.001/run	95%
lucataco/ace-step	ACE-Step (GitHub)	~$0.002/min	88%
resemble-ai/chatterbox	Chatterbox (GitHub)	~$0.0005/run	90%+
lucataco/xtts-v2	Coqui TTS (GitHub)	~$0.0003/run	95%+
pseudoram/rvc-v2	RVC WebUI (GitHub)	~$0.001/run	90%

💡

Key Insight: Every popular Replicate model has an open-source version. Replicate's value is convenience, not exclusivity. At 10K+ users, self-hosting the top 3 models could save $5,000-15,000/month.

📚 Sources

Research compiled December 2025. Run counts and pricing subject to change. Check Replicate for current pricing.