AI Integration - FlowState DAW

🎯 AI Philosophy

AI in FlowState should be optional, contextual, and friction-reducing. It's not about replacing the producer's creativity—it's about removing barriers and accelerating workflows.

💡

Key Principle: Use the cheapest model that works. Reserve premium Gemini for tasks that truly need it.

🔀 Smart Inference Router

The router decides which model to use based on task complexity, latency requirements, and cost.

function routeInference(request: AIRequest): Provider {
  // TIER 1: Cloudflare Edge (realtime, simple)
  if (request.latency === 'realtime' && request.complexity === 'simple') {
    return 'cloudflare-workers-ai'; // FREE tier
  }

  // TIER 2: Free Models (background, creative)
  if (request.latency === 'background') {
    if (request.type === 'stem_separation') return 'htdemucs';
    if (request.type === 'music_gen') return 'stable-audio-open';
  }

  // TIER 3: Gemini (complex, audio understanding)
  if (request.audio_input || request.complexity === 'complex') {
    return 'gemini-3-flash';
  }

  return 'cloudflare-llama-4-scout'; // FREE, good default
}

📊 AI Capabilities Matrix (v0.1.99 - Implemented)

Feature	Model	Cost	Latency	Status
🆓 Pattern AI	Magenta.js (browser)	FREE	1-3s	✅ Live
🎵 Music Generation	ACE-Step (with lyrics)	~$0.05/gen	15-25s	✅ Live
💥 SFX Generation	Stable Audio Open	~$0.02/gen	5-15s	✅ Live
🎤 Voice Clone	OpenVoice v2	~$0.05/gen	5-10s	✅ Live
🎙️ Text-to-Speech	Chatterbox TTS	~$0.03/gen	2-5s	✅ Live
🔀 Stem Separation	Meta Demucs	~$0.02/gen	10-30s	✅ Live
🔬 Audio Analysis	Essentia.js (browser)	FREE	100-500ms	✅ Live
🗣️ Voice Commands	Whisper (CF Workers AI)	$0.0005/min	500ms	✅ Live
🔍 Sample Search	CLAP Embeddings	FREE (D1/Vectorize)	50ms	✅ Live
💬 AI Chat	Gemini 3 Flash	$0.075-0.30/1M	200-2000ms	✅ Live
🎹 AI Mastering	Custom pipeline	~$0.10/gen	30-60s	✅ Live

🚀 AI Studio Architecture (v0.1.99)

The AI Studio is built with a tiered architecture - FREE features run entirely in the browser, PRO features call cloud APIs:

FREE Tier (Magenta.js + Essentia.js)

Pattern AI: Generate drums & melodies using MusicVAE, 100% client-side
Audio Analysis: BPM, key detection, onset analysis using Essentia.js
Morph Slider: Interpolate between patterns using VAE latent space
Cost: $0/month - runs in user's browser

PRO Tier (Cloud APIs)

ACE-Step: Full song generation with lyrics support (up to 4 min)
Stable Audio Open: SFX generation (impacts, ambience, foley, up to 47s)
OpenVoice v2: Voice cloning from 5-10s samples, 6 languages
Chatterbox TTS: Expressive text-to-speech for scratch vocals
Meta Demucs: Stem separation into vocals/drums/bass/other
Cost: ~$0.02-0.10 per generation

⚡ Gemini 3 Flash Integration

Gemini 3 Flash (December 2025) is the core premium AI for complex tasks.

Key Capabilities

Native Audio Input: Analyze beats, detect patterns without transcription
1M Token Context: Entire project history in one conversation
Thinking Levels: MINIMAL → LOW → MEDIUM → HIGH (cost vs quality)
Agentic Mode: Multi-step tasks with tool calling

When to Use Gemini

✅ "Analyze this beat and suggest improvements"
✅ "Help me write a hook for this track"
✅ Complex multi-turn conversations
✅ Audio understanding tasks
❌ Simple transport commands (use Llama instead)
❌ Basic text classification

Pricing

Thinking Level	Input	Output	Use Case
None (fast)	$0.075/1M	$0.30/1M	Quick answers
LOW	$0.15/1M	$0.60/1M	Standard queries
MEDIUM	$0.50/1M	$2.00/1M	Complex analysis
HIGH	$3.50/1M	$14.00/1M	Deep reasoning

🎙️ Gemini 2.5 Flash Native Audio

For voice OUTPUT (avatar responses, assistant speech).

Features

30 HD voices in 24 languages
Real-time streaming output
Emotional expression
Direct audio response (no TTS needed)

Cost Comparison

Option	Cost/min	Quality	Latency
Gemini Native Audio	$0.10	Excellent	200-500ms
MeloTTS (Workers AI)	$0.0002	Good	100-200ms
Chatterbox (self-host)	FREE	Excellent	75-150ms
ElevenLabs	$0.30	Excellent	75-250ms

💡

Recommendation: Use Chatterbox/MeloTTS for 95% of TTS needs. Reserve Gemini Native Audio for special avatar personalities.

☁️ Cloudflare Workers AI Models

Available Models

Model	Use Case	Free Tier
`@cf/openai/whisper`	Speech-to-text	10K neurons/day
`@cf/meta/llama-4-scout`	Intent classification, simple queries	10K neurons/day
`@cf/baai/bge-base-en-v1.5`	Text embeddings for search	10K neurons/day
`@cf/stabilityai/stable-diffusion-xl`	Album art generation	10K neurons/day

Example: Whisper Integration

// workers/api/transcribe.ts
export async function transcribe(audio: ArrayBuffer, env: Env) {
  const result = await env.AI.run('@cf/openai/whisper', {
    audio: [...new Uint8Array(audio)]
  });

  return {
    text: result.text,
    language: result.detected_language,
    segments: result.segments
  };
}

🎤 Voice Command Pipeline

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Browser │ │ Workers │ │ Workers AI │ │ DAW │ │ Microphone │───▶│ API │───▶│ Whisper │───▶│ Action │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ ▲ │ ┌──────────────┐ │ └────────▶│ Llama 4 │──────────────┘ │ Scout │ │ (intent) │ └──────────────┘

Supported Commands (MVP)

Category	Examples
Transport	"play", "stop", "pause", "loop section"
Mixer	"mute track 1", "solo drums", "turn up the bass"
Tempo	"set BPM 90", "faster", "slower"
Samples	"find a snare", "add kick to track 2"
Project	"save project", "export", "new track"

💬 AI Assistant Design

System Prompt Template

You are FlowState AI, a hip-hop production assistant.

Current project context:
- Tempo: ${project.tempo} BPM
- Key: ${project.key || 'not set'}
- Tracks: ${project.tracks.length}
- Current selection: ${selection}

You can help with:
- Production tips and techniques
- Sample recommendations
- Mixing advice
- Workflow optimization

Keep responses concise and actionable.
If the user asks you to do something in the DAW,
respond with a JSON action block.

Conversation Flow

User types or speaks query
Include current project context in system prompt
Route to appropriate model (Llama for simple, Gemini for complex)
Stream response to chat panel
If action needed, execute DAW command
Cache response in AI Gateway

💰 Cost Optimization Strategies

1. AI Gateway Caching

Cache common queries to reduce API calls by 40-70%.

// AI Gateway automatically caches based on:
// - Query similarity
// - TTL settings
// - Cache rules

// Configure in Cloudflare dashboard:
// AI Gateway > flowstate > Caching > Enable

2. Smart Model Routing

Use Llama 4 Scout (FREE) for 80% of queries, Gemini for 20%.

3. Context Caching

Gemini offers 75% discount on repeated system prompts. Cache project context.

4. Batch Embeddings

Embed samples in batches during off-peak hours.

Monthly Cost Estimate (10K users)

Service	Baseline	Optimized
Workers AI	$150	$100
Gemini API	$400	$200
TTS	$100	$0 (Chatterbox)
Total AI	$650	$300