π§ AI Integration Layer
Gemini, Workers AI, and Smart Inference Routing
π― AI Philosophy
AI in FlowState should be optional, contextual, and friction-reducing. It's not about replacing the producer's creativityβit's about removing barriers and accelerating workflows.
Key Principle: Use the cheapest model that works. Reserve premium Gemini for tasks that truly need it.
π Smart Inference Router
The router decides which model to use based on task complexity, latency requirements, and cost.
function routeInference(request: AIRequest): Provider {
// TIER 1: Cloudflare Edge (realtime, simple)
if (request.latency === 'realtime' && request.complexity === 'simple') {
return 'cloudflare-workers-ai'; // FREE tier
}
// TIER 2: Free Models (background, creative)
if (request.latency === 'background') {
if (request.type === 'stem_separation') return 'htdemucs';
if (request.type === 'music_gen') return 'stable-audio-open';
}
// TIER 3: Gemini (complex, audio understanding)
if (request.audio_input || request.complexity === 'complex') {
return 'gemini-3-flash';
}
return 'cloudflare-llama-4-scout'; // FREE, good default
}
π AI Capabilities Matrix
| Feature | Model | Cost | Latency | Tier |
|---|---|---|---|---|
| Voice Commands | Whisper (CF Workers AI) | $0.0005/min | 500ms | Edge |
| Intent Classification | Llama 4 Scout | FREE | 100ms | Edge |
| Complex Queries | Gemini 3 Flash | $0.075-0.30/1M | 200-2000ms | Premium |
| TTS Response | Chatterbox/MeloTTS | FREE | 75-150ms | Free |
| Stem Separation | HTDemucs | FREE (self-host) | 10-60s | Background |
| Beat Generation | Stable Audio Open | $0.05/gen | 30-60s | Background |
| Sample Search | BGE Embeddings | $0.02/1M | 50ms | Edge |
| Audio Analysis | Gemini 3 (audio) | $0.15/min | 1-5s | Premium |
β‘ Gemini 3 Flash Integration
Gemini 3 Flash (December 2025) is the core premium AI for complex tasks.
Key Capabilities
- Native Audio Input: Analyze beats, detect patterns without transcription
- 1M Token Context: Entire project history in one conversation
- Thinking Levels: MINIMAL β LOW β MEDIUM β HIGH (cost vs quality)
- Agentic Mode: Multi-step tasks with tool calling
When to Use Gemini
- β "Analyze this beat and suggest improvements"
- β "Help me write a hook for this track"
- β Complex multi-turn conversations
- β Audio understanding tasks
- β Simple transport commands (use Llama instead)
- β Basic text classification
Pricing
| Thinking Level | Input | Output | Use Case |
|---|---|---|---|
| None (fast) | $0.075/1M | $0.30/1M | Quick answers |
| LOW | $0.15/1M | $0.60/1M | Standard queries |
| MEDIUM | $0.50/1M | $2.00/1M | Complex analysis |
| HIGH | $3.50/1M | $14.00/1M | Deep reasoning |
ποΈ Gemini 2.5 Flash Native Audio
For voice OUTPUT (avatar responses, assistant speech).
Features
- 30 HD voices in 24 languages
- Real-time streaming output
- Emotional expression
- Direct audio response (no TTS needed)
Cost Comparison
| Option | Cost/min | Quality | Latency |
|---|---|---|---|
| Gemini Native Audio | $0.10 | Excellent | 200-500ms |
| MeloTTS (Workers AI) | $0.0002 | Good | 100-200ms |
| Chatterbox (self-host) | FREE | Excellent | 75-150ms |
| ElevenLabs | $0.30 | Excellent | 75-250ms |
Recommendation: Use Chatterbox/MeloTTS for 95% of TTS needs. Reserve Gemini Native Audio for special avatar personalities.
βοΈ Cloudflare Workers AI Models
Available Models
| Model | Use Case | Free Tier |
|---|---|---|
@cf/openai/whisper |
Speech-to-text | 10K neurons/day |
@cf/meta/llama-4-scout |
Intent classification, simple queries | 10K neurons/day |
@cf/baai/bge-base-en-v1.5 |
Text embeddings for search | 10K neurons/day |
@cf/stabilityai/stable-diffusion-xl |
Album art generation | 10K neurons/day |
Example: Whisper Integration
// workers/api/transcribe.ts
export async function transcribe(audio: ArrayBuffer, env: Env) {
const result = await env.AI.run('@cf/openai/whisper', {
audio: [...new Uint8Array(audio)]
});
return {
text: result.text,
language: result.detected_language,
segments: result.segments
};
}
π€ Voice Command Pipeline
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Browser β β Workers β β Workers AI β β DAW β
β Microphone βββββΆβ API βββββΆβ Whisper βββββΆβ Action β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β²
β ββββββββββββββββ β
ββββββββββΆβ Llama 4 ββββββββββββββββ
β Scout β
β (intent) β
ββββββββββββββββ
Supported Commands (MVP)
| Category | Examples |
|---|---|
| Transport | "play", "stop", "pause", "loop section" |
| Mixer | "mute track 1", "solo drums", "turn up the bass" |
| Tempo | "set BPM 90", "faster", "slower" |
| Samples | "find a snare", "add kick to track 2" |
| Project | "save project", "export", "new track" |
π¬ AI Assistant Design
System Prompt Template
You are FlowState AI, a hip-hop production assistant.
Current project context:
- Tempo: ${project.tempo} BPM
- Key: ${project.key || 'not set'}
- Tracks: ${project.tracks.length}
- Current selection: ${selection}
You can help with:
- Production tips and techniques
- Sample recommendations
- Mixing advice
- Workflow optimization
Keep responses concise and actionable.
If the user asks you to do something in the DAW,
respond with a JSON action block.
Conversation Flow
- User types or speaks query
- Include current project context in system prompt
- Route to appropriate model (Llama for simple, Gemini for complex)
- Stream response to chat panel
- If action needed, execute DAW command
- Cache response in AI Gateway
π° Cost Optimization Strategies
1. AI Gateway Caching
Cache common queries to reduce API calls by 40-70%.
// AI Gateway automatically caches based on:
// - Query similarity
// - TTL settings
// - Cache rules
// Configure in Cloudflare dashboard:
// AI Gateway > flowstate > Caching > Enable
2. Smart Model Routing
Use Llama 4 Scout (FREE) for 80% of queries, Gemini for 20%.
3. Context Caching
Gemini offers 75% discount on repeated system prompts. Cache project context.
4. Batch Embeddings
Embed samples in batches during off-peak hours.
Monthly Cost Estimate (10K users)
| Service | Baseline | Optimized |
|---|---|---|
| Workers AI | $150 | $100 |
| Gemini API | $400 | $200 |
| TTS | $100 | $0 (Chatterbox) |
| Total AI | $650 | $300 |