π§ AI Integration Layer
ACE-Step, Magenta.js, Workers AI, and Smart Inference Routing
π― AI Philosophy
AI in FlowState should be optional, contextual, and friction-reducing. It's not about replacing the producer's creativityβit's about removing barriers and accelerating workflows.
π Smart Inference Router
The router decides which model to use based on task complexity, latency requirements, and cost.
function routeInference(request: AIRequest): Provider {
// TIER 1: Cloudflare Edge (realtime, simple)
if (request.latency === 'realtime' && request.complexity === 'simple') {
return 'cloudflare-workers-ai'; // FREE tier
}
// TIER 2: Free Models (background, creative)
if (request.latency === 'background') {
if (request.type === 'stem_separation') return 'htdemucs';
if (request.type === 'music_gen') return 'stable-audio-open';
}
// TIER 3: Gemini (complex, audio understanding)
if (request.audio_input || request.complexity === 'complex') {
return 'gemini-3-flash';
}
return 'cloudflare-llama-4-scout'; // FREE, good default
}
π AI Capabilities Matrix (v0.1.99 - Implemented)
| Feature | Model | Cost | Latency | Status |
|---|---|---|---|---|
| π Pattern AI | Magenta.js (browser) | FREE | 1-3s | β Live |
| π΅ Music Generation | ACE-Step (with lyrics) | ~$0.05/gen | 15-25s | β Live |
| π₯ SFX Generation | Stable Audio Open | ~$0.02/gen | 5-15s | β Live |
| π€ Voice Clone | OpenVoice v2 | ~$0.05/gen | 5-10s | β Live |
| ποΈ Text-to-Speech | Chatterbox TTS | ~$0.03/gen | 2-5s | β Live |
| π Stem Separation | Meta Demucs | ~$0.02/gen | 10-30s | β Live |
| π¬ Audio Analysis | Essentia.js (browser) | FREE | 100-500ms | β Live |
| π£οΈ Voice Commands | Whisper (CF Workers AI) | $0.0005/min | 500ms | β Live |
| π Sample Search | CLAP Embeddings | FREE (D1/Vectorize) | 50ms | β Live |
| π¬ AI Chat | Gemini 3 Flash | $0.075-0.30/1M | 200-2000ms | β Live |
| πΉ AI Mastering | Custom pipeline | ~$0.10/gen | 30-60s | β Live |
π AI Studio Architecture (v0.1.99)
The AI Studio is built with a tiered architecture - FREE features run entirely in the browser, PRO features call cloud APIs:
FREE Tier (Magenta.js + Essentia.js)
- Pattern AI: Generate drums & melodies using MusicVAE, 100% client-side
- Audio Analysis: BPM, key detection, onset analysis using Essentia.js
- Morph Slider: Interpolate between patterns using VAE latent space
- Cost: $0/month - runs in user's browser
PRO Tier (Cloud APIs)
- ACE-Step: Full song generation with lyrics support (up to 4 min)
- Stable Audio Open: SFX generation (impacts, ambience, foley, up to 47s)
- OpenVoice v2: Voice cloning from 5-10s samples, 6 languages
- Chatterbox TTS: Expressive text-to-speech for scratch vocals
- Meta Demucs: Stem separation into vocals/drums/bass/other
- Cost: ~$0.02-0.10 per generation
β‘ Gemini 3 Flash Integration
Gemini 3 Flash (December 2025) is the core premium AI for complex tasks.
Key Capabilities
- Native Audio Input: Analyze beats, detect patterns without transcription
- 1M Token Context: Entire project history in one conversation
- Thinking Levels: MINIMAL β LOW β MEDIUM β HIGH (cost vs quality)
- Agentic Mode: Multi-step tasks with tool calling
When to Use Gemini
- β "Analyze this beat and suggest improvements"
- β "Help me write a hook for this track"
- β Complex multi-turn conversations
- β Audio understanding tasks
- β Simple transport commands (use Llama instead)
- β Basic text classification
Pricing
| Thinking Level | Input | Output | Use Case |
|---|---|---|---|
| None (fast) | $0.075/1M | $0.30/1M | Quick answers |
| LOW | $0.15/1M | $0.60/1M | Standard queries |
| MEDIUM | $0.50/1M | $2.00/1M | Complex analysis |
| HIGH | $3.50/1M | $14.00/1M | Deep reasoning |
ποΈ Gemini 2.5 Flash Native Audio
For voice OUTPUT (avatar responses, assistant speech).
Features
- 30 HD voices in 24 languages
- Real-time streaming output
- Emotional expression
- Direct audio response (no TTS needed)
Cost Comparison
| Option | Cost/min | Quality | Latency |
|---|---|---|---|
| Gemini Native Audio | $0.10 | Excellent | 200-500ms |
| MeloTTS (Workers AI) | $0.0002 | Good | 100-200ms |
| Chatterbox (self-host) | FREE | Excellent | 75-150ms |
| ElevenLabs | $0.30 | Excellent | 75-250ms |
βοΈ Cloudflare Workers AI Models
Available Models
| Model | Use Case | Free Tier |
|---|---|---|
@cf/openai/whisper |
Speech-to-text | 10K neurons/day |
@cf/meta/llama-4-scout |
Intent classification, simple queries | 10K neurons/day |
@cf/baai/bge-base-en-v1.5 |
Text embeddings for search | 10K neurons/day |
@cf/stabilityai/stable-diffusion-xl |
Album art generation | 10K neurons/day |
Example: Whisper Integration
// workers/api/transcribe.ts
export async function transcribe(audio: ArrayBuffer, env: Env) {
const result = await env.AI.run('@cf/openai/whisper', {
audio: [...new Uint8Array(audio)]
});
return {
text: result.text,
language: result.detected_language,
segments: result.segments
};
}
π€ Voice Command Pipeline
Supported Commands (MVP)
| Category | Examples |
|---|---|
| Transport | "play", "stop", "pause", "loop section" |
| Mixer | "mute track 1", "solo drums", "turn up the bass" |
| Tempo | "set BPM 90", "faster", "slower" |
| Samples | "find a snare", "add kick to track 2" |
| Project | "save project", "export", "new track" |
π¬ AI Assistant Design
System Prompt Template
You are FlowState AI, a hip-hop production assistant.
Current project context:
- Tempo: ${project.tempo} BPM
- Key: ${project.key || 'not set'}
- Tracks: ${project.tracks.length}
- Current selection: ${selection}
You can help with:
- Production tips and techniques
- Sample recommendations
- Mixing advice
- Workflow optimization
Keep responses concise and actionable.
If the user asks you to do something in the DAW,
respond with a JSON action block.
Conversation Flow
- User types or speaks query
- Include current project context in system prompt
- Route to appropriate model (Llama for simple, Gemini for complex)
- Stream response to chat panel
- If action needed, execute DAW command
- Cache response in AI Gateway
π° Cost Optimization Strategies
1. AI Gateway Caching
Cache common queries to reduce API calls by 40-70%.
// AI Gateway automatically caches based on:
// - Query similarity
// - TTL settings
// - Cache rules
// Configure in Cloudflare dashboard:
// AI Gateway > flowstate > Caching > Enable
2. Smart Model Routing
Use Llama 4 Scout (FREE) for 80% of queries, Gemini for 20%.
3. Context Caching
Gemini offers 75% discount on repeated system prompts. Cache project context.
4. Batch Embeddings
Embed samples in batches during off-peak hours.
Monthly Cost Estimate (10K users)
| Service | Baseline | Optimized |
|---|---|---|
| Workers AI | $150 | $100 |
| Gemini API | $400 | $200 |
| TTS | $100 | $0 (Chatterbox) |
| Total AI | $650 | $300 |