π Research Overview
This research document analyzes the complete landscape of AI audio technologies suitable for FlowState DAW, covering everything from full song generation to mastering and stem separation. Each model is evaluated for cost, latency, quality, and integration complexity.
Full song with lyrics, structure (verse/chorus/bridge). Handles Chinese & English vocals. Multi-track output with stems.
β±οΈ 2-5 min generation
π― Music + Lyrics
π Apache 2.0
Lyrics-in-place generation with precise word timing. Single-step diffusion for fast inference. Higher quality vocals than YuE.
β±οΈ 30-60s generation
π― Precise Timing
π Apache 2.0
Full-length songs (4:30+) with multi-section structure. Uses chunked generation for consistency. Best for instrumental focus.
β±οΈ 1-3 min generation
π― Long Form
π MIT License
Industry standard for text-to-music. Melody conditioning, stereo output. Excellent for 8-30s clips. Available via Replicate.
β±οΈ 10-30s generation
π― Production Ready
π CC-BY-NC
Up to 3 minutes of high-quality audio. Excellent prompt following. Stereo 44.1kHz output. Commercial license available.
β±οΈ 20-60s generation
π― High Quality
π Commercial
Consumer favorite. Full songs with vocals, lyrics, structure. Most "finished" sounding output. Closed API.
β±οΈ 30-60s generation
π― Consumer Ready
π Proprietary
π‘
Recommendation: Start with MusicGen via Replicate for MVP. Consider self-hosting ACE-Step or DiffRhythm for free tier to manage costs at scale.
1B parameter speech model with emotional understanding. Ultra-realistic voice cloning in seconds. Low latency streaming. Apache 2.0 license.
β±οΈ 50-100ms latency
π― Voice Clone
π·οΈ HIGH PRIORITY
Audio understanding + generation in one model. 13M hours training data. Can analyze beats, understand music context. End-to-end speech conversation.
β±οΈ 100-200ms latency
π― Audio Understanding
π·οΈ HIGH PRIORITY
Industry standard STT. 99 languages. Built into Cloudflare Workers AI. Near-instant for short commands. Excellent for DAW voice control.
β±οΈ 500ms average
π― STT Only
π·οΈ IMPLEMENTED
Emotional TTS with 5-second voice cloning. Expressive and natural sounding. Self-hostable on Fly.io. Zero marginal cost.
β±οΈ 75-150ms latency
π― Emotional TTS
π·οΈ PLANNED
30 HD voices, 24 languages. Direct audio output from LLM. Emotional expression. Best for premium avatar personalities.
β±οΈ 200-500ms latency
π― Premium TTS
π Google API
Best-in-class voice quality. Instant voice cloning. Multilingual. Professional grade for commercial applications.
β±οΈ 75-250ms latency
π― Premium Quality
π Paid API
π―
Strategy: Use Whisper (CF Workers) for STT, Chatterbox for 95% of TTS needs (free), reserve Gemini/ElevenLabs for premium avatar personalities.
State-of-the-art open source. Hybrid Transformer architecture. 4 stems (vocals, drums, bass, other) or 6 stems. Used by iZotope and others.
β±οΈ 10-60s processing
π― 4-6 Stems
π·οΈ HIGH PRIORITY
GUI tool with multiple model options. MDX-Net, VR Architecture. Best for extracting clean vocals. Community-driven model zoo.
β±οΈ 20-90s processing
π― Vocal Focus
π MIT License
Original open-source solution. 2, 4, or 5 stem separation. Older but lightweight and fast. Good for real-time preview.
β±οΈ 5-15s processing
π― Fast/Light
π MIT License
Commercial-grade stem separation. Used by major labels. Highest quality for professional applications. API available.
β±οΈ 10-30s processing
π― Professional
π Paid API
Google's music ML in the browser. Melody continuation, drum pattern generation, interpolation. Zero server cost. Currently integrated!
β±οΈ 50-200ms
π― Browser Native
π·οΈ IMPLEMENTED
Specialized chord progression generator. Understanding of music theory. Can generate contextually appropriate progressions.
β±οΈ 100-500ms
π― Chords
π MIT License
Google's music generation with MIDI output option. Text-to-MIDI capabilities. Not publicly available but architecture is documented.
β±οΈ Variable
π― Text-to-MIDI
π Research
Real-time Audio Variational autoEncoder. Can be used for timbre transfer and audio manipulation. Very fast inference.
β±οΈ Real-time
π― Timbre
π Open Source
β
Already Integrated: Magenta.js is in FlowState for drum variations and pattern interpolation. Zero cost, runs entirely in browser.
Audio super-resolution. Upscale low-quality audio to 48kHz. Can restore MP3 compression artifacts. Based on diffusion.
β±οΈ 5-20s processing
π― Upscaling
π·οΈ HIGH PRIORITY
Real-time noise suppression. Runs on CPU efficiently. Great for cleaning vocal recordings. Can run in browser via WASM.
β±οΈ Real-time
π― Noise Removal
π MIT License
AI speech enhancement. Denoising + super-resolution in one. Optimized for voice content. Good for podcast/vocal cleanup.
β±οΈ 3-10s processing
π― Voice Enhance
π MIT License
Professional-grade audio effects. Noise removal, room echo cancellation, audio super-res. Requires NVIDIA GPU.
β±οΈ Real-time
π― Professional
π SDK
Reference-based audio mastering. Match EQ, loudness, and stereo width to a reference track. Python library, easy to integrate.
β±οΈ 5-15s processing
π― Reference Match
π·οΈ MEDIUM PRIORITY
Industry standard AI mastering. Multiple style options. Used by millions. API available for integration.
β±οΈ 30-120s
π― Full Master
π Paid API
Master Assistant AI. Genre-aware mastering suggestions. Professional-grade. Plugin-based, not API accessible.
β±οΈ Real-time
π― Professional
π Plugin
Budget AI mastering. Good for demos and quick masters. API available. Lower quality than LANDR but more affordable.
β±οΈ 30-60s
π― Budget
π Paid API
Meta's audio generation model. Text-to-sound effects. Good for generating one-shots and ambient textures.
β±οΈ 5-15s generation
π― SFX Focus
π CC-BY-NC
Stable Diffusion fine-tuned on spectrograms. Can generate short loops and riffs. Interesting for experimental sounds.
β±οΈ 5-10s generation
π― Loops/Riffs
π MIT License
Specialized drum sample generation. Style-conditioned one-shot generation. Research project with code available.
β±οΈ 1-3s generation
π― Drums Only
π Research
AI sample generation service. Loop and one-shot generation. Genre-specific options. Web-based tool.
β±οΈ 5-20s generation
π― All Types
π Subscription
Audio analysis in the browser via WASM. BPM, key, loudness, spectral features. Production-ready, used in production DAWs.
β±οΈ Real-time
π― Full Analysis
π·οΈ HIGH PRIORITY
Standard Python audio analysis library. Comprehensive features. Server-side only. Foundation for many audio ML projects.
β±οΈ Variable
π― Full Analysis
π ISC License
Contrastive Language-Audio Pretraining. Like CLIP but for audio. Can search audio with text queries. Semantic audio understanding.
β±οΈ 100-500ms
π― Semantic
π·οΈ HIGH PRIORITY
Spotify's polyphonic pitch detection. Audio-to-MIDI transcription. Very accurate for melodic content.
β±οΈ 2-10s processing
π― Pitch/MIDI
π Apache 2.0
Differentiable Digital Signal Processing. Neural synth with interpretable parameters. Timbre transfer. Real-time capable.
β±οΈ Real-time
π― Synthesis
π Apache 2.0
Real-time Audio Variational autoEncoder. Ultra-fast timbre transfer. Can morph between sounds. Works in Max/MSP, PureData.
β±οΈ Real-time
π― Timbre
π Open Source
Platform for running neural audio models as plugins. Community-contributed models. Easy to deploy research models.
β±οΈ Real-time
π― Plugin Host
π Free
Concept: Use image generation to create wavetables. Spectrogram-to-audio conversion. Experimental but interesting for sound design.
β±οΈ Variable
π― Wavetables
π Experimental
Real-time voice conversion. Train custom voice models with 10 min of audio. Very popular in music production community.
β±οΈ Real-time
π― Voice Conversion
π·οΈ MEDIUM PRIORITY
Singing Voice Conversion. Train models to sing in any voice. Higher quality than RVC but slower. Popular for covers.
β±οΈ 2-10x slower
π― Singing Voice
π MIT License
Monophonic pitch tracking. Very accurate for vocals. Used in pitch correction. Python and TensorFlow.js versions.
β±οΈ Near real-time
π― Pitch Track
π MIT License
Classic pitch detection algorithms. Lightweight, fast, proven. Good baseline for comparison. Available in Essentia.
β±οΈ Real-time
π― Pitch
π Various
Text-to-audio search. "Find a punchy 808 kick" actually works. CLAP for embeddings, FAISS for fast similarity search.
β±οΈ 50-200ms
π― Semantic
π·οΈ HIGH PRIORITY
Text embeddings in Cloudflare Workers AI. Use for sample metadata search. Combine with Vectorize for similarity.
β±οΈ 50ms
π― Text Search
π·οΈ IMPLEMENTED
Audio fingerprinting. Find duplicate samples, detect copyright content. Used by Shazam-like apps.
β±οΈ 100-500ms
π― Fingerprint
π LGPL
Vector database for embeddings. 5M vectors free. Works with Workers. Fast similarity search at edge.
β±οΈ 20-50ms
π― Vector DB
π·οΈ IMPLEMENTED
GPU cloud for ML models. Pre-built models available. Easy API. Good for prototyping. ~$0.064/gen for MusicGen.
β±οΈ Cold start: 10-30s
π― Serverless GPU
π·οΈ CURRENT
GPU VMs for self-hosting. L40S and A100 available. Better for sustained workloads. Good for Demucs, TTS.
β±οΈ Always on
π― Self-host
π·οΈ PLANNED
Serverless GPU with Python-native API. Fast cold starts. Good developer experience. Popular for AI apps.
β±οΈ Cold start: 1-5s
π― Serverless
π Alternative
Run HF models via API. Dedicated endpoints for reliability. Good model selection. Free tier has rate limits.
β±οΈ Variable
π― HF Models
π API
Edge AI inference. Limited audio models (Whisper). Excellent for text tasks. Free tier generous for STT.
β±οΈ 50-500ms
π― Edge AI
π·οΈ IMPLEMENTED
Budget GPU cloud. Spot instances available. Good for batch processing. Less reliable than Replicate.
β±οΈ Variable
π― Budget GPU
π Alternative
π― Integration Priority Matrix
Recommended order of integration based on user impact, cost, and complexity.
| Priority |
Technology |
Use Case |
Cost |
Complexity |
| P0 - Done |
Magenta.js |
Pattern variations |
Free |
Low |
| P0 - Done |
Whisper (CF) |
Voice commands |
~Free |
Low |
| P0 - Done |
BGE + Vectorize |
Sample search |
Free |
Medium |
| P1 - Now |
MusicGen (Replicate) |
Text-to-music |
$0.064/gen |
Low |
| P2 - Next |
Essentia.js |
Audio analysis |
Free |
Medium |
| P2 - Next |
Chatterbox TTS |
Voice responses |
Self-host |
Medium |
| P2 - Next |
Demucs |
Stem separation |
Self-host |
High |
| P3 - Soon |
CLAP + Audio Search |
Semantic sample search |
Self-host |
High |
| P3 - Soon |
Kimi-Audio 7B |
Audio understanding |
Self-host |
High |
| P4 - Future |
ACE-Step / YuE |
Full song generation |
Self-host |
Very High |
| P4 - Future |
RVC |
Voice conversion |
Self-host |
High |
| P4 - Future |
Sesame CSM |
Voice cloning |
Self-host |
Medium |
π° Cost Scaling Analysis
Projected costs at different user scales with optimization strategies.
| Users |
Replicate (current) |
Hybrid (Magenta free tier) |
Self-hosted |
| 1,000 |
$320/mo |
$80/mo |
$50/mo |
| 10,000 |
$3,200/mo |
$800/mo |
$200/mo |
| 100,000 |
$32,000/mo |
$4,000/mo |
$1,000/mo |
β οΈ
Cost Strategy: Use Replicate for MVP to validate demand. Migrate to self-hosted on Fly.io GPU when reaching 5-10k users. Offer Magenta.js-based free tier with limited generations.
π Research Notes & Sources
Research compiled December 2024. Model availability and pricing subject to change.