AI Audio Research - FlowState DAW

📑 Table of Contents

🎵 Full Song Generation 🗣️ Voice Assistant 🎛️ Stem Separation 🎹 MIDI Generation ✨ Audio Enhancement 🎚️ Mastering 🥁 Sample Generation 📈 Audio Analysis 🔊 Sound Design 🎤 Vocal Processing 🔍 Retrieval & Search ⚙️ Infrastructure

🎵

Full Song Generation

End-to-end music creation from text prompts

YuE (Tencent) Experimental

Full song with lyrics, structure (verse/chorus/bridge). Handles Chinese & English vocals. Multi-track output with stems.

⏱️ 2-5 min generation 🎯 Music + Lyrics 🔗 Apache 2.0

ACE-Step Experimental

Lyrics-in-place generation with precise word timing. Single-step diffusion for fast inference. Higher quality vocals than YuE.

⏱️ 30-60s generation 🎯 Precise Timing 🔗 Apache 2.0

DiffRhythm Experimental

Full-length songs (4:30+) with multi-section structure. Uses chunked generation for consistency. Best for instrumental focus.

⏱️ 1-3 min generation 🎯 Long Form 🔗 MIT License

MusicGen (Meta) $0.064/gen

Industry standard for text-to-music. Melody conditioning, stereo output. Excellent for 8-30s clips. Available via Replicate.

⏱️ 10-30s generation 🎯 Production Ready 🔗 CC-BY-NC

Stable Audio 2.0 $0.05/gen

Up to 3 minutes of high-quality audio. Excellent prompt following. Stereo 44.1kHz output. Commercial license available.

⏱️ 20-60s generation 🎯 High Quality 🔗 Commercial

Suno v4 $0.10/song

Consumer favorite. Full songs with vocals, lyrics, structure. Most "finished" sounding output. Closed API.

⏱️ 30-60s generation 🎯 Consumer Ready 🔗 Proprietary

💡

Recommendation: Start with MusicGen via Replicate for MVP. Consider self-hosting ACE-Step or DiffRhythm for free tier to manage costs at scale.

🗣️

Voice Assistant / Conversational Audio

Real-time speech understanding and generation

Sesame CSM Free/OSS

1B parameter speech model with emotional understanding. Ultra-realistic voice cloning in seconds. Low latency streaming. Apache 2.0 license.

⏱️ 50-100ms latency 🎯 Voice Clone 🏷️ HIGH PRIORITY

Kimi-Audio 7B Free/OSS

Audio understanding + generation in one model. 13M hours training data. Can analyze beats, understand music context. End-to-end speech conversation.

⏱️ 100-200ms latency 🎯 Audio Understanding 🏷️ HIGH PRIORITY

Whisper (CF Workers) $0.0005/min

Industry standard STT. 99 languages. Built into Cloudflare Workers AI. Near-instant for short commands. Excellent for DAW voice control.

⏱️ 500ms average 🎯 STT Only 🏷️ IMPLEMENTED

Chatterbox TTS Free/Self-host

Emotional TTS with 5-second voice cloning. Expressive and natural sounding. Self-hostable on Fly.io. Zero marginal cost.

⏱️ 75-150ms latency 🎯 Emotional TTS 🏷️ PLANNED

Gemini 2.5 Native Audio $0.10/min

30 HD voices, 24 languages. Direct audio output from LLM. Emotional expression. Best for premium avatar personalities.

⏱️ 200-500ms latency 🎯 Premium TTS 🔗 Google API

ElevenLabs $0.30/min

Best-in-class voice quality. Instant voice cloning. Multilingual. Professional grade for commercial applications.

⏱️ 75-250ms latency 🎯 Premium Quality 🔗 Paid API

🎯

Strategy: Use Whisper (CF Workers) for STT, Chatterbox for 95% of TTS needs (free), reserve Gemini/ElevenLabs for premium avatar personalities.

🎛️

Stem Separation / Source Separation

Extract vocals, drums, bass, and other instruments from mixed audio

Demucs v4 (HTDemucs) Free/Self-host

State-of-the-art open source. Hybrid Transformer architecture. 4 stems (vocals, drums, bass, other) or 6 stems. Used by iZotope and others.

⏱️ 10-60s processing 🎯 4-6 Stems 🏷️ HIGH PRIORITY

UVR5 (Ultimate Vocal Remover) Free/OSS

GUI tool with multiple model options. MDX-Net, VR Architecture. Best for extracting clean vocals. Community-driven model zoo.

⏱️ 20-90s processing 🎯 Vocal Focus 🔗 MIT License

Spleeter (Deezer) Free/OSS

Original open-source solution. 2, 4, or 5 stem separation. Older but lightweight and fast. Good for real-time preview.

⏱️ 5-15s processing 🎯 Fast/Light 🔗 MIT License

AudioShake Enterprise

Commercial-grade stem separation. Used by major labels. Highest quality for professional applications. API available.

⏱️ 10-30s processing 🎯 Professional 🔗 Paid API

🎹

MIDI Generation / Pattern AI

Generate melodies, chord progressions, and drum patterns

Magenta.js (MusicVAE) Free/Browser

Google's music ML in the browser. Melody continuation, drum pattern generation, interpolation. Zero server cost. Currently integrated!

⏱️ 50-200ms 🎯 Browser Native 🏷️ IMPLEMENTED

ChordSeqAI Free/OSS

Specialized chord progression generator. Understanding of music theory. Can generate contextually appropriate progressions.

⏱️ 100-500ms 🎯 Chords 🔗 MIT License

MusicLM MIDI Mode Research

Google's music generation with MIDI output option. Text-to-MIDI capabilities. Not publicly available but architecture is documented.

⏱️ Variable 🎯 Text-to-MIDI 🔗 Research

RAVE (IRCAM) Free/OSS

Real-time Audio Variational autoEncoder. Can be used for timbre transfer and audio manipulation. Very fast inference.

⏱️ Real-time 🎯 Timbre 🔗 Open Source

✅

Already Integrated: Magenta.js is in FlowState for drum variations and pattern interpolation. Zero cost, runs entirely in browser.

✨

Audio Enhancement / Restoration

Noise removal, upscaling, and audio quality improvement

AudioSR Free/OSS

Audio super-resolution. Upscale low-quality audio to 48kHz. Can restore MP3 compression artifacts. Based on diffusion.

⏱️ 5-20s processing 🎯 Upscaling 🏷️ HIGH PRIORITY

DeepFilterNet Free/OSS

Real-time noise suppression. Runs on CPU efficiently. Great for cleaning vocal recordings. Can run in browser via WASM.

⏱️ Real-time 🎯 Noise Removal 🔗 MIT License

Resemble Enhance Free/OSS

AI speech enhancement. Denoising + super-resolution in one. Optimized for voice content. Good for podcast/vocal cleanup.

⏱️ 3-10s processing 🎯 Voice Enhance 🔗 MIT License

NVIDIA Maxine SDK License

Professional-grade audio effects. Noise removal, room echo cancellation, audio super-res. Requires NVIDIA GPU.

⏱️ Real-time 🎯 Professional 🔗 SDK

🎚️

Mastering / Mixing AI

Automated mixing and mastering assistance

Matchering 2.0 Free/OSS

Reference-based audio mastering. Match EQ, loudness, and stereo width to a reference track. Python library, easy to integrate.

⏱️ 5-15s processing 🎯 Reference Match 🏷️ MEDIUM PRIORITY

LANDR API $2-10/track

Industry standard AI mastering. Multiple style options. Used by millions. API available for integration.

⏱️ 30-120s 🎯 Full Master 🔗 Paid API

iZotope Ozone Plugin License

Master Assistant AI. Genre-aware mastering suggestions. Professional-grade. Plugin-based, not API accessible.

⏱️ Real-time 🎯 Professional 🔗 Plugin

CloudBounce $1-5/track

Budget AI mastering. Good for demos and quick masters. API available. Lower quality than LANDR but more affordable.

⏱️ 30-60s 🎯 Budget 🔗 Paid API

🥁

Sample Generation / One-Shots

Generate drum hits, loops, and individual sounds

AudioCraft (AudioGen) Free/OSS

Meta's audio generation model. Text-to-sound effects. Good for generating one-shots and ambient textures.

⏱️ 5-15s generation 🎯 SFX Focus 🔗 CC-BY-NC

Riffusion Free/OSS

Stable Diffusion fine-tuned on spectrograms. Can generate short loops and riffs. Interesting for experimental sounds.

⏱️ 5-10s generation 🎯 Loops/Riffs 🔗 MIT License

Drumify Research

Specialized drum sample generation. Style-conditioned one-shot generation. Research project with code available.

⏱️ 1-3s generation 🎯 Drums Only 🔗 Research

Splash Pro Subscription

AI sample generation service. Loop and one-shot generation. Genre-specific options. Web-based tool.

⏱️ 5-20s generation 🎯 All Types 🔗 Subscription

📈

Audio Analysis / Understanding

BPM detection, key detection, genre classification

Essentia.js Free/Browser

Audio analysis in the browser via WASM. BPM, key, loudness, spectral features. Production-ready, used in production DAWs.

⏱️ Real-time 🎯 Full Analysis 🏷️ HIGH PRIORITY

Librosa Free/Python

Standard Python audio analysis library. Comprehensive features. Server-side only. Foundation for many audio ML projects.

⏱️ Variable 🎯 Full Analysis 🔗 ISC License

CLAP (Audio Embeddings) Free/OSS

Contrastive Language-Audio Pretraining. Like CLIP but for audio. Can search audio with text queries. Semantic audio understanding.

⏱️ 100-500ms 🎯 Semantic 🏷️ HIGH PRIORITY

basic-pitch Free/OSS

Spotify's polyphonic pitch detection. Audio-to-MIDI transcription. Very accurate for melodic content.

⏱️ 2-10s processing 🎯 Pitch/MIDI 🔗 Apache 2.0

🔊

Sound Design / Synthesis

Neural synthesizers and timbre manipulation

DDSP (Google) Free/OSS

Differentiable Digital Signal Processing. Neural synth with interpretable parameters. Timbre transfer. Real-time capable.

⏱️ Real-time 🎯 Synthesis 🔗 Apache 2.0

RAVE (IRCAM) Free/OSS

Real-time Audio Variational autoEncoder. Ultra-fast timbre transfer. Can morph between sounds. Works in Max/MSP, PureData.

⏱️ Real-time 🎯 Timbre 🔗 Open Source

Neutone Free Plugin

Platform for running neural audio models as plugins. Community-contributed models. Easy to deploy research models.

⏱️ Real-time 🎯 Plugin Host 🔗 Free

Vital + DALL-E Concept

Concept: Use image generation to create wavetables. Spectrogram-to-audio conversion. Experimental but interesting for sound design.

⏱️ Variable 🎯 Wavetables 🔗 Experimental

🎤

Vocal Processing

Voice conversion, pitch correction, and vocal effects

RVC (Retrieval Voice Conversion) Free/OSS

Real-time voice conversion. Train custom voice models with 10 min of audio. Very popular in music production community.

⏱️ Real-time 🎯 Voice Conversion 🏷️ MEDIUM PRIORITY

So-VITS-SVC Free/OSS

Singing Voice Conversion. Train models to sing in any voice. Higher quality than RVC but slower. Popular for covers.

⏱️ 2-10x slower 🎯 Singing Voice 🔗 MIT License

CREPE Free/OSS

Monophonic pitch tracking. Very accurate for vocals. Used in pitch correction. Python and TensorFlow.js versions.

⏱️ Near real-time 🎯 Pitch Track 🔗 MIT License

PYIN / Praat Free/Classic

Classic pitch detection algorithms. Lightweight, fast, proven. Good baseline for comparison. Available in Essentia.

⏱️ Real-time 🎯 Pitch 🔗 Various

🔍

Retrieval & Search

Semantic sample search and audio similarity

CLAP + FAISS Free/OSS

Text-to-audio search. "Find a punchy 808 kick" actually works. CLAP for embeddings, FAISS for fast similarity search.

⏱️ 50-200ms 🎯 Semantic 🏷️ HIGH PRIORITY

BGE Embeddings (CF) $0.02/1M

Text embeddings in Cloudflare Workers AI. Use for sample metadata search. Combine with Vectorize for similarity.

⏱️ 50ms 🎯 Text Search 🏷️ IMPLEMENTED

Chromaprint / AcoustID Free/OSS

Audio fingerprinting. Find duplicate samples, detect copyright content. Used by Shazam-like apps.

⏱️ 100-500ms 🎯 Fingerprint 🔗 LGPL

Cloudflare Vectorize Free Tier

Vector database for embeddings. 5M vectors free. Works with Workers. Fast similarity search at edge.

⏱️ 20-50ms 🎯 Vector DB 🏷️ IMPLEMENTED

⚙️

Infrastructure & Deployment

Platforms for running AI audio models

Replicate Pay per use

GPU cloud for ML models. Pre-built models available. Easy API. Good for prototyping. ~$0.064/gen for MusicGen.

⏱️ Cold start: 10-30s 🎯 Serverless GPU 🏷️ CURRENT

Fly.io GPU $0.50-2/hr

GPU VMs for self-hosting. L40S and A100 available. Better for sustained workloads. Good for Demucs, TTS.

⏱️ Always on 🎯 Self-host 🏷️ PLANNED

Modal Pay per use

Serverless GPU with Python-native API. Fast cold starts. Good developer experience. Popular for AI apps.

⏱️ Cold start: 1-5s 🎯 Serverless 🔗 Alternative

Hugging Face Inference Free + Paid

Run HF models via API. Dedicated endpoints for reliability. Good model selection. Free tier has rate limits.

⏱️ Variable 🎯 HF Models 🔗 API

Cloudflare Workers AI 10K neurons/day

Edge AI inference. Limited audio models (Whisper). Excellent for text tasks. Free tier generous for STT.

⏱️ 50-500ms 🎯 Edge AI 🏷️ IMPLEMENTED

RunPod $0.20-0.50/hr

Budget GPU cloud. Spot instances available. Good for batch processing. Less reliable than Replicate.

⏱️ Variable 🎯 Budget GPU 🔗 Alternative

🎯 Integration Priority Matrix

Recommended order of integration based on user impact, cost, and complexity.

Priority	Technology	Use Case	Cost	Complexity
P0 - Done	Magenta.js	Pattern variations	Free	Low
P0 - Done	Whisper (CF)	Voice commands	~Free	Low
P0 - Done	BGE + Vectorize	Sample search	Free	Medium
P1 - Now	MusicGen (Replicate)	Text-to-music	$0.064/gen	Low
P2 - Next	Essentia.js	Audio analysis	Free	Medium
P2 - Next	Chatterbox TTS	Voice responses	Self-host	Medium
P2 - Next	Demucs	Stem separation	Self-host	High
P3 - Soon	CLAP + Audio Search	Semantic sample search	Self-host	High
P3 - Soon	Kimi-Audio 7B	Audio understanding	Self-host	High
P4 - Future	ACE-Step / YuE	Full song generation	Self-host	Very High
P4 - Future	RVC	Voice conversion	Self-host	High
P4 - Future	Sesame CSM	Voice cloning	Self-host	Medium

💰 Cost Scaling Analysis

Projected costs at different user scales with optimization strategies.

Users	Replicate (current)	Hybrid (Magenta free tier)	Self-hosted
1,000	$320/mo	$80/mo	$50/mo
10,000	$3,200/mo	$800/mo	$200/mo
100,000	$32,000/mo	$4,000/mo	$1,000/mo

⚠️

Cost Strategy: Use Replicate for MVP to validate demand. Migrate to self-hosted on Fly.io GPU when reaching 5-10k users. Offer Magenta.js-based free tier with limited generations.

📝 Research Notes & Sources

YuE: github.com/multimodal-art-projection/YuE
ACE-Step: ace-step.github.io
DiffRhythm: github.com/AMAAI-Lab/DiffRhythm
Kimi-Audio: github.com/MoonshotAI/Kimi-Audio
Sesame CSM: github.com/SesameAI/csm
Demucs: github.com/facebookresearch/demucs
Magenta.js: magenta.tensorflow.org/js
CLAP: github.com/LAION-AI/CLAP
AudioSR: github.com/haoheliu/versatile_audio_super_resolution
Matchering: github.com/sergree/matchering

Research compiled December 2024. Model availability and pricing subject to change.

🔬 AI Audio Research

📊 Research Overview