πŸ“Š Research Overview

This research document analyzes the complete landscape of AI audio technologies suitable for FlowState DAW, covering everything from full song generation to mastering and stem separation. Each model is evaluated for cost, latency, quality, and integration complexity.

42+
Models Analyzed
12
Categories
15
Free/OSS Options
8
High Priority

πŸ“‘ Table of Contents

🎡

Full Song Generation

End-to-end music creation from text prompts

YuE (Tencent) Experimental
Full song with lyrics, structure (verse/chorus/bridge). Handles Chinese & English vocals. Multi-track output with stems.
⏱️ 2-5 min generation 🎯 Music + Lyrics πŸ”— Apache 2.0
ACE-Step Experimental
Lyrics-in-place generation with precise word timing. Single-step diffusion for fast inference. Higher quality vocals than YuE.
⏱️ 30-60s generation 🎯 Precise Timing πŸ”— Apache 2.0
DiffRhythm Experimental
Full-length songs (4:30+) with multi-section structure. Uses chunked generation for consistency. Best for instrumental focus.
⏱️ 1-3 min generation 🎯 Long Form πŸ”— MIT License
MusicGen (Meta) $0.064/gen
Industry standard for text-to-music. Melody conditioning, stereo output. Excellent for 8-30s clips. Available via Replicate.
⏱️ 10-30s generation 🎯 Production Ready πŸ”— CC-BY-NC
Stable Audio 2.0 $0.05/gen
Up to 3 minutes of high-quality audio. Excellent prompt following. Stereo 44.1kHz output. Commercial license available.
⏱️ 20-60s generation 🎯 High Quality πŸ”— Commercial
Suno v4 $0.10/song
Consumer favorite. Full songs with vocals, lyrics, structure. Most "finished" sounding output. Closed API.
⏱️ 30-60s generation 🎯 Consumer Ready πŸ”— Proprietary
πŸ’‘
Recommendation: Start with MusicGen via Replicate for MVP. Consider self-hosting ACE-Step or DiffRhythm for free tier to manage costs at scale.
πŸ—£οΈ

Voice Assistant / Conversational Audio

Real-time speech understanding and generation

Sesame CSM Free/OSS
1B parameter speech model with emotional understanding. Ultra-realistic voice cloning in seconds. Low latency streaming. Apache 2.0 license.
⏱️ 50-100ms latency 🎯 Voice Clone 🏷️ HIGH PRIORITY
Kimi-Audio 7B Free/OSS
Audio understanding + generation in one model. 13M hours training data. Can analyze beats, understand music context. End-to-end speech conversation.
⏱️ 100-200ms latency 🎯 Audio Understanding 🏷️ HIGH PRIORITY
Whisper (CF Workers) $0.0005/min
Industry standard STT. 99 languages. Built into Cloudflare Workers AI. Near-instant for short commands. Excellent for DAW voice control.
⏱️ 500ms average 🎯 STT Only 🏷️ IMPLEMENTED
Chatterbox TTS Free/Self-host
Emotional TTS with 5-second voice cloning. Expressive and natural sounding. Self-hostable on Fly.io. Zero marginal cost.
⏱️ 75-150ms latency 🎯 Emotional TTS 🏷️ PLANNED
Gemini 2.5 Native Audio $0.10/min
30 HD voices, 24 languages. Direct audio output from LLM. Emotional expression. Best for premium avatar personalities.
⏱️ 200-500ms latency 🎯 Premium TTS πŸ”— Google API
ElevenLabs $0.30/min
Best-in-class voice quality. Instant voice cloning. Multilingual. Professional grade for commercial applications.
⏱️ 75-250ms latency 🎯 Premium Quality πŸ”— Paid API
🎯
Strategy: Use Whisper (CF Workers) for STT, Chatterbox for 95% of TTS needs (free), reserve Gemini/ElevenLabs for premium avatar personalities.
πŸŽ›οΈ

Stem Separation / Source Separation

Extract vocals, drums, bass, and other instruments from mixed audio

Demucs v4 (HTDemucs) Free/Self-host
State-of-the-art open source. Hybrid Transformer architecture. 4 stems (vocals, drums, bass, other) or 6 stems. Used by iZotope and others.
⏱️ 10-60s processing 🎯 4-6 Stems 🏷️ HIGH PRIORITY
UVR5 (Ultimate Vocal Remover) Free/OSS
GUI tool with multiple model options. MDX-Net, VR Architecture. Best for extracting clean vocals. Community-driven model zoo.
⏱️ 20-90s processing 🎯 Vocal Focus πŸ”— MIT License
Spleeter (Deezer) Free/OSS
Original open-source solution. 2, 4, or 5 stem separation. Older but lightweight and fast. Good for real-time preview.
⏱️ 5-15s processing 🎯 Fast/Light πŸ”— MIT License
AudioShake Enterprise
Commercial-grade stem separation. Used by major labels. Highest quality for professional applications. API available.
⏱️ 10-30s processing 🎯 Professional πŸ”— Paid API
🎹

MIDI Generation / Pattern AI

Generate melodies, chord progressions, and drum patterns

Magenta.js (MusicVAE) Free/Browser
Google's music ML in the browser. Melody continuation, drum pattern generation, interpolation. Zero server cost. Currently integrated!
⏱️ 50-200ms 🎯 Browser Native 🏷️ IMPLEMENTED
ChordSeqAI Free/OSS
Specialized chord progression generator. Understanding of music theory. Can generate contextually appropriate progressions.
⏱️ 100-500ms 🎯 Chords πŸ”— MIT License
MusicLM MIDI Mode Research
Google's music generation with MIDI output option. Text-to-MIDI capabilities. Not publicly available but architecture is documented.
⏱️ Variable 🎯 Text-to-MIDI πŸ”— Research
RAVE (IRCAM) Free/OSS
Real-time Audio Variational autoEncoder. Can be used for timbre transfer and audio manipulation. Very fast inference.
⏱️ Real-time 🎯 Timbre πŸ”— Open Source
βœ…
Already Integrated: Magenta.js is in FlowState for drum variations and pattern interpolation. Zero cost, runs entirely in browser.
✨

Audio Enhancement / Restoration

Noise removal, upscaling, and audio quality improvement

AudioSR Free/OSS
Audio super-resolution. Upscale low-quality audio to 48kHz. Can restore MP3 compression artifacts. Based on diffusion.
⏱️ 5-20s processing 🎯 Upscaling 🏷️ HIGH PRIORITY
DeepFilterNet Free/OSS
Real-time noise suppression. Runs on CPU efficiently. Great for cleaning vocal recordings. Can run in browser via WASM.
⏱️ Real-time 🎯 Noise Removal πŸ”— MIT License
Resemble Enhance Free/OSS
AI speech enhancement. Denoising + super-resolution in one. Optimized for voice content. Good for podcast/vocal cleanup.
⏱️ 3-10s processing 🎯 Voice Enhance πŸ”— MIT License
NVIDIA Maxine SDK License
Professional-grade audio effects. Noise removal, room echo cancellation, audio super-res. Requires NVIDIA GPU.
⏱️ Real-time 🎯 Professional πŸ”— SDK
🎚️

Mastering / Mixing AI

Automated mixing and mastering assistance

Matchering 2.0 Free/OSS
Reference-based audio mastering. Match EQ, loudness, and stereo width to a reference track. Python library, easy to integrate.
⏱️ 5-15s processing 🎯 Reference Match 🏷️ MEDIUM PRIORITY
LANDR API $2-10/track
Industry standard AI mastering. Multiple style options. Used by millions. API available for integration.
⏱️ 30-120s 🎯 Full Master πŸ”— Paid API
iZotope Ozone Plugin License
Master Assistant AI. Genre-aware mastering suggestions. Professional-grade. Plugin-based, not API accessible.
⏱️ Real-time 🎯 Professional πŸ”— Plugin
CloudBounce $1-5/track
Budget AI mastering. Good for demos and quick masters. API available. Lower quality than LANDR but more affordable.
⏱️ 30-60s 🎯 Budget πŸ”— Paid API
πŸ₯

Sample Generation / One-Shots

Generate drum hits, loops, and individual sounds

AudioCraft (AudioGen) Free/OSS
Meta's audio generation model. Text-to-sound effects. Good for generating one-shots and ambient textures.
⏱️ 5-15s generation 🎯 SFX Focus πŸ”— CC-BY-NC
Riffusion Free/OSS
Stable Diffusion fine-tuned on spectrograms. Can generate short loops and riffs. Interesting for experimental sounds.
⏱️ 5-10s generation 🎯 Loops/Riffs πŸ”— MIT License
Drumify Research
Specialized drum sample generation. Style-conditioned one-shot generation. Research project with code available.
⏱️ 1-3s generation 🎯 Drums Only πŸ”— Research
Splash Pro Subscription
AI sample generation service. Loop and one-shot generation. Genre-specific options. Web-based tool.
⏱️ 5-20s generation 🎯 All Types πŸ”— Subscription
πŸ“ˆ

Audio Analysis / Understanding

BPM detection, key detection, genre classification

Essentia.js Free/Browser
Audio analysis in the browser via WASM. BPM, key, loudness, spectral features. Production-ready, used in production DAWs.
⏱️ Real-time 🎯 Full Analysis 🏷️ HIGH PRIORITY
Librosa Free/Python
Standard Python audio analysis library. Comprehensive features. Server-side only. Foundation for many audio ML projects.
⏱️ Variable 🎯 Full Analysis πŸ”— ISC License
CLAP (Audio Embeddings) Free/OSS
Contrastive Language-Audio Pretraining. Like CLIP but for audio. Can search audio with text queries. Semantic audio understanding.
⏱️ 100-500ms 🎯 Semantic 🏷️ HIGH PRIORITY
basic-pitch Free/OSS
Spotify's polyphonic pitch detection. Audio-to-MIDI transcription. Very accurate for melodic content.
⏱️ 2-10s processing 🎯 Pitch/MIDI πŸ”— Apache 2.0
πŸ”Š

Sound Design / Synthesis

Neural synthesizers and timbre manipulation

DDSP (Google) Free/OSS
Differentiable Digital Signal Processing. Neural synth with interpretable parameters. Timbre transfer. Real-time capable.
⏱️ Real-time 🎯 Synthesis πŸ”— Apache 2.0
RAVE (IRCAM) Free/OSS
Real-time Audio Variational autoEncoder. Ultra-fast timbre transfer. Can morph between sounds. Works in Max/MSP, PureData.
⏱️ Real-time 🎯 Timbre πŸ”— Open Source
Neutone Free Plugin
Platform for running neural audio models as plugins. Community-contributed models. Easy to deploy research models.
⏱️ Real-time 🎯 Plugin Host πŸ”— Free
Vital + DALL-E Concept
Concept: Use image generation to create wavetables. Spectrogram-to-audio conversion. Experimental but interesting for sound design.
⏱️ Variable 🎯 Wavetables πŸ”— Experimental
🎀

Vocal Processing

Voice conversion, pitch correction, and vocal effects

RVC (Retrieval Voice Conversion) Free/OSS
Real-time voice conversion. Train custom voice models with 10 min of audio. Very popular in music production community.
⏱️ Real-time 🎯 Voice Conversion 🏷️ MEDIUM PRIORITY
So-VITS-SVC Free/OSS
Singing Voice Conversion. Train models to sing in any voice. Higher quality than RVC but slower. Popular for covers.
⏱️ 2-10x slower 🎯 Singing Voice πŸ”— MIT License
CREPE Free/OSS
Monophonic pitch tracking. Very accurate for vocals. Used in pitch correction. Python and TensorFlow.js versions.
⏱️ Near real-time 🎯 Pitch Track πŸ”— MIT License
PYIN / Praat Free/Classic
Classic pitch detection algorithms. Lightweight, fast, proven. Good baseline for comparison. Available in Essentia.
⏱️ Real-time 🎯 Pitch πŸ”— Various
βš™οΈ

Infrastructure & Deployment

Platforms for running AI audio models

Replicate Pay per use
GPU cloud for ML models. Pre-built models available. Easy API. Good for prototyping. ~$0.064/gen for MusicGen.
⏱️ Cold start: 10-30s 🎯 Serverless GPU 🏷️ CURRENT
Fly.io GPU $0.50-2/hr
GPU VMs for self-hosting. L40S and A100 available. Better for sustained workloads. Good for Demucs, TTS.
⏱️ Always on 🎯 Self-host 🏷️ PLANNED
Modal Pay per use
Serverless GPU with Python-native API. Fast cold starts. Good developer experience. Popular for AI apps.
⏱️ Cold start: 1-5s 🎯 Serverless πŸ”— Alternative
Hugging Face Inference Free + Paid
Run HF models via API. Dedicated endpoints for reliability. Good model selection. Free tier has rate limits.
⏱️ Variable 🎯 HF Models πŸ”— API
Cloudflare Workers AI 10K neurons/day
Edge AI inference. Limited audio models (Whisper). Excellent for text tasks. Free tier generous for STT.
⏱️ 50-500ms 🎯 Edge AI 🏷️ IMPLEMENTED
RunPod $0.20-0.50/hr
Budget GPU cloud. Spot instances available. Good for batch processing. Less reliable than Replicate.
⏱️ Variable 🎯 Budget GPU πŸ”— Alternative

🎯 Integration Priority Matrix

Recommended order of integration based on user impact, cost, and complexity.

Priority Technology Use Case Cost Complexity
P0 - Done Magenta.js Pattern variations Free Low
P0 - Done Whisper (CF) Voice commands ~Free Low
P0 - Done BGE + Vectorize Sample search Free Medium
P1 - Now MusicGen (Replicate) Text-to-music $0.064/gen Low
P2 - Next Essentia.js Audio analysis Free Medium
P2 - Next Chatterbox TTS Voice responses Self-host Medium
P2 - Next Demucs Stem separation Self-host High
P3 - Soon CLAP + Audio Search Semantic sample search Self-host High
P3 - Soon Kimi-Audio 7B Audio understanding Self-host High
P4 - Future ACE-Step / YuE Full song generation Self-host Very High
P4 - Future RVC Voice conversion Self-host High
P4 - Future Sesame CSM Voice cloning Self-host Medium

πŸ’° Cost Scaling Analysis

Projected costs at different user scales with optimization strategies.

Users Replicate (current) Hybrid (Magenta free tier) Self-hosted
1,000 $320/mo $80/mo $50/mo
10,000 $3,200/mo $800/mo $200/mo
100,000 $32,000/mo $4,000/mo $1,000/mo
⚠️
Cost Strategy: Use Replicate for MVP to validate demand. Migrate to self-hosted on Fly.io GPU when reaching 5-10k users. Offer Magenta.js-based free tier with limited generations.

πŸ“ Research Notes & Sources

Research compiled December 2024. Model availability and pricing subject to change.