Avatar Battles - FlowState DAW

🎯 Concept Overview

Avatar Battle Mode lets users rap battle anonymously using virtual avatars. Your voice is transformed and your face is hidden behind a customizable 3D character that lip-syncs in real-time.

🔥

Why This Matters: Many aspiring rappers are shy about showing their face. Avatar battles lower the barrier to entry, encourage participation, and create viral-worthy content.

⚡ Technical Pipeline

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ User's │ │ Whisper │ │ Voice │ │ Avatar │ │ Voice │───▶│ (ASR) │───▶│ Transform │───▶│ Render │ │ Input │ │ 35ms │ │ 20ms │ │ 16ms │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ │ WebRTC │◀───│ Lip Sync │ │ Stream │ │ (52 blend) │ │ to Peer │ │ │ └─────────────┘ └─────────────┘

Latency Budget

Stage	Target	Technology
Audio Capture	5ms	Web Audio API
Voice Transform	20ms	RVC/WASM
Lip Sync Inference	15ms	MediaPipe/Rhubarb
Avatar Render	16ms	Three.js @ 60fps
WebRTC Transmission	50-100ms	Cloudflare Calls
Total E2E	~150ms	Acceptable for battles

🎭 Avatar Technology Options

Technology	Type	Quality	Performance	License
TalkingHead 3D	3D WebGL	Excellent	60fps	MIT
ReadyPlayerMe	3D Avatar	Great	60fps	Free tier
MediaPipe Face	Tracking	High accuracy	<10ms	Apache 2.0
MuseTalk	2D Lip Sync	Photorealistic	30fps	Research
Live2D	2D Animation	Anime style	60fps	Commercial

💡

MVP Recommendation: TalkingHead 3D + MediaPipe for real-time face tracking. Fully client-side, no server cost.

🔊 Voice Transformation

Users can optionally transform their voice for additional anonymity and creative expression.

Effect	Technology	Latency	Quality
Pitch Shift	Web Audio API	<5ms	Good
Formant Shift	WASM DSP	10ms	Better
RVC Clone	Server-side	50-100ms	Excellent
Robot/Effects	Tone.js	<5ms	Stylized

Voice Transform Code

// voice-transform.ts
class VoiceTransformer {
  private pitchShifter: Tone.PitchShift;
  private distortion: Tone.Distortion;
  private reverb: Tone.Reverb;

  constructor() {
    this.pitchShifter = new Tone.PitchShift();
    this.distortion = new Tone.Distortion(0);
    this.reverb = new Tone.Reverb(0.5);
  }

  applyPreset(preset: 'deep' | 'high' | 'robot' | 'alien') {
    switch (preset) {
      case 'deep':
        this.pitchShifter.pitch = -5;
        break;
      case 'high':
        this.pitchShifter.pitch = 7;
        break;
      case 'robot':
        this.pitchShifter.pitch = 0;
        this.distortion.distortion = 0.3;
        break;
      case 'alien':
        this.pitchShifter.pitch = 12;
        this.reverb.decay = 3;
        break;
    }
  }
}

👄 Lip Sync Pipeline

Real-time lip sync uses ARKit-compatible 52-blendshape system for realistic mouth movements.

Phoneme to Viseme Mapping

Phoneme	Viseme	Blendshapes
AA, AH	Open	jawOpen: 0.7
B, M, P	Closed	mouthClose: 1.0
EE, IY	Wide	mouthSmile: 0.6
OO, UW	Pucker	mouthPucker: 0.8
F, V	Lip-tooth	mouthFunnel: 0.5
TH	Tongue	tongueOut: 0.3

Lip Sync Implementation

// lip-sync.ts
import { FaceLandmarker } from '@mediapipe/tasks-vision';

class LipSyncEngine {
  private faceLandmarker: FaceLandmarker;
  private blendshapes: Map<string, number> = new Map();

  async init() {
    this.faceLandmarker = await FaceLandmarker.createFromOptions({
      baseOptions: {
        modelAssetPath: 'face_landmarker.task',
        delegate: 'GPU'
      },
      outputFaceBlendshapes: true,
      runningMode: 'VIDEO'
    });
  }

  processFrame(video: HTMLVideoElement, timestamp: number) {
    const results = this.faceLandmarker.detectForVideo(video, timestamp);

    if (results.faceBlendshapes?.[0]) {
      for (const shape of results.faceBlendshapes[0].categories) {
        this.blendshapes.set(shape.categoryName, shape.score);
      }
    }

    return this.blendshapes;
  }

  // Audio-only lip sync (no camera)
  processAudio(audioLevel: number, frequency: number): Map<string, number> {
    const jawOpen = Math.min(audioLevel * 2, 1);
    const mouthSmile = frequency > 2000 ? 0.3 : 0;

    return new Map([
      ['jawOpen', jawOpen],
      ['mouthSmileLeft', mouthSmile],
      ['mouthSmileRight', mouthSmile]
    ]);
  }
}

🎮 Battle Flow

Matchmaking: Join queue, get matched by skill rating
Avatar Select: Choose/customize your avatar
Beat Selection: Both players vote on instrumental
Coin Flip: Random selection for who goes first
Round 1: Player A raps (60 seconds)
Round 2: Player B responds (60 seconds)
Round 3: Player A rebuttal (30 seconds)
Round 4: Player B rebuttal (30 seconds)
Voting: Audience votes for winner
Recording: Battle saved for replay/sharing

Battle State Machine

// battle-state.ts
type BattleState =
  | 'idle'
  | 'matchmaking'
  | 'avatar_select'
  | 'beat_select'
  | 'countdown'
  | 'round_active'
  | 'round_transition'
  | 'voting'
  | 'results'
  | 'complete';

interface Battle {
  id: string;
  state: BattleState;
  players: [Player, Player];
  currentRound: number;
  rounds: Round[];
  beat: Beat;
  votes: Vote[];
  recording: Recording | null;
}

interface Round {
  player: 'A' | 'B';
  duration: number;  // seconds
  audio: Blob | null;
  transcript: string | null;
}

📡 WebRTC Architecture

Cloudflare Calls handles the WebRTC infrastructure for real-time streaming.

┌─────────────┐ ┌─────────────────────┐ ┌─────────────┐ │ Player A │◀───────▶│ Cloudflare Calls │◀───────▶│ Player B │ │ (WebRTC) │ │ (TURN/SFU) │ │ (WebRTC) │ └─────────────┘ └─────────────────────┘ └─────────────┘ │ ▼ ┌─────────────────────┐ │ Spectators (N) │ │ (WebRTC viewers) │ └─────────────────────┘

Cloudflare Calls Integration

// webrtc.ts
class BattleRTC {
  private localStream: MediaStream | null = null;
  private peerConnection: RTCPeerConnection | null = null;

  async joinBattle(battleId: string, userId: string) {
    // Get Cloudflare Calls session token
    const response = await fetch('/api/battle/join', {
      method: 'POST',
      body: JSON.stringify({ battleId, userId })
    });
    const { iceServers, sessionId } = await response.json();

    // Setup peer connection with Cloudflare TURN
    this.peerConnection = new RTCPeerConnection({ iceServers });

    // Get user media (audio only for voice)
    this.localStream = await navigator.mediaDevices.getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: true,
        autoGainControl: true
      },
      video: false  // Avatar renders locally
    });

    // Add tracks to connection
    this.localStream.getTracks().forEach(track => {
      this.peerConnection!.addTrack(track, this.localStream!);
    });
  }

  async startRecording(): Promise<MediaRecorder> {
    const combinedStream = new MediaStream([
      ...this.localStream!.getAudioTracks(),
      // Canvas capture for avatar
      this.avatarCanvas.captureStream(30).getVideoTracks()[0]
    ]);

    return new MediaRecorder(combinedStream, {
      mimeType: 'video/webm;codecs=vp9,opus'
    });
  }
}

🏆 Ranking System

Rank	ELO Range	Badge
Bronze	0 - 999	🥉
Silver	1000 - 1499	🥈
Gold	1500 - 1999	🥇
Platinum	2000 - 2499	💎
Diamond	2500+	👑

ELO Calculation

// elo.ts
function calculateEloChange(
  winnerElo: number,
  loserElo: number,
  kFactor: number = 32
): { winner: number; loser: number } {
  const expectedWin = 1 / (1 + Math.pow(10, (loserElo - winnerElo) / 400));
  const change = Math.round(kFactor * (1 - expectedWin));

  return {
    winner: winnerElo + change,
    loser: Math.max(0, loserElo - change)
  };
}

💰 Cost Estimates

Component	Cost (10K battles/mo)
Cloudflare Calls (WebRTC)	$50 (100GB @ $0.05/GB)
Recording Storage (R2)	$15 (1TB @ $0.015/GB)
Whisper Transcription	$10 (Workers AI)
Avatar Assets (R2)	$5
Total	~$80/mo

🚀 MVP Scope

Phase 1 (Post-MVP)

1v1 battles only
3 preset avatars
Basic voice effects (pitch only)
Simple ELO ranking
Audio-only lip sync

Phase 2

Avatar customization
Camera-based lip sync
RVC voice cloning
Spectator mode
Battle clips for sharing

Phase 3

Tournament mode
AI judges (Gemini analysis)
Monetization (avatar skins)
Leaderboards
Battle highlights

⚠️

Development Note: Avatar Battles is a post-MVP feature. Focus on core DAW first, then add battle mode in v1.1.