From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents

1 min read Engineering
From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents

From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents

Creating professional meditation audio isn’t just about layering a voice track over background music. It’s an intricate dance of frequencies, dynamics, and timing that traditionally required a skilled audio engineer with years of experience.

The Challenge: When Art Meets Algorithm

Our challenge was ambitious: automate the creation of studio-quality meditation mantras by intelligently mixing voice narrations with background music—while ensuring the result felt organic, not robotic.

We didn’t want to simply apply static presets. We wanted a system that learns, adapts, and improves with every successful mix.

The Architecture: A Symphony of AI Agents

At the heart of our solution is a sophisticated multi-agent system built on the Vizra ADK (Agent Development Kit), orchestrating several specialized components:

┌─────────────────────────────────────────────────────────────┐
│                 MantraGenerationWorkflow                    │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐    ┌──────────────────────────────┐   │
│  │ Content Finder  │ →  │  MantraGenerationAgent       │   │
│  │ (FindUnmixed)   │    │  (Gemini 2.5 Flash)          │   │
│  └─────────────────┘    └──────────────────────────────┘   │
│            ↓                           ↓                    │
│  ┌─────────────────────────────────────────────────────┐   │
│  │          IntelligentFFmpegMixerTool                 │   │
│  │  ┌───────────┐  ┌────────────┐  ┌────────────────┐  │   │
│  │  │ Gemini AI │→ │ Vector RAG │→ │ FFmpeg 8 Engine│  │   │
│  │  │ Parameter │  │  Learning  │  │   Execution    │  │   │
│  │  │ Generator │  │   System   │  │                │  │   │
│  │  └───────────┘  └────────────┘  └────────────────┘  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

The MantraGenerationAgent

Our primary agent operates as an “AI Audio Engineer” that coordinates the entire mixing process:

class MantraGenerationAgent extends BaseLlmAgent
{
    protected string $model = 'gemini-2.5-flash';
    protected ?float $temperature = 0.5;

    protected array $tools = [
        FindUnmixedContentTool::class,
        IntelligentFFmpegMixerTool::class,
        VectorMemoryTool::class,
    ];
}

The agent receives high-level instructions and autonomously decides:

  • Which voice and music content to pair together
  • When to consult past successful mixes
  • How to adjust parameters based on time-of-day context (morning energizing vs. night calming)

The Intelligence Layer: Gemini + Audio Analysis

Real-Time Audio Intelligence

Before any mixing occurs, our system performs comprehensive audio analysis using FFmpeg’s probing capabilities:

private function analyzeAudio(string $path): array
{
    $command = sprintf(
        'ffprobe -v quiet -print_format json -show_format -show_streams %s',
        escapeshellarg($path)
    );

    return [
        'duration' => (float) ($info['format']['duration'] ?? 0),
        'bitrate' => (int) ($info['format']['bit_rate'] ?? 0) / 1000,
        'channels' => (int) ($info['streams'][0]['channels'] ?? 0),
        'sample_rate' => (int) ($info['streams'][0]['sample_rate'] ?? 44100),
    ];
}

AI-Driven Parameter Generation

The magic happens when we feed audio characteristics and actual audio files to Gemini 2.0 Flash for parameter optimization:

$response = Prism::structured()
    ->using(Provider::Gemini, 'gemini-2.0-flash-exp')
    ->withPrompt($prompt, [$voiceAttachment, $musicAttachment])
    ->withSchema($mixingParametersSchema)
    ->asStructured();

Gemini analyzes both the metadata and the actual audio content, then returns optimized FFmpeg parameters tailored to the specific voice-music pairing.

The Heart of the Mix: FFmpeg 8 Filter Chains

Dynamic Filter Construction

Our system constructs sophisticated FFmpeg filter chains based on AI recommendations:

private function buildFilters(array $params): string
{
    $filters = [];

    // Voice processing: normalize, gain, convert to stereo
    $filters[] = '[0:a]loudnorm=I=-18:LRA=10:TP=-1.5,volume=1.5dB,aformat=channel_layouts=stereo[voice_norm]';

    // Voice delay (let the music establish ambiance first)
    $delayMs = $params['voice_delay_seconds'] * 1000;
    $filters[] = "[voice_norm]adelay={$delayMs}|{$delayMs}[voice_for_mix]";

    // Music processing: normalize, EQ, gain
    $filters[] = '[1:a]loudnorm=I=-19:LRA=9:TP=-2,' .
                 'highpass=f=80,lowpass=f=12000,' .
                 'equalizer=f=4000:width_type=h:width=800:g=-3,' .
                 'volume=-1dB[music_proc]';

    // The mix with intelligent weighting
    $filters[] = '[music_proc][voice_for_mix]amix=inputs=2:normalize=0:dropout_transition=2:weights=1 1[mix_combined]';

    // Final mastering: fade-in and loudness normalization
    $filters[] = '[mix_combined]afade=t=in:ss=0:d=0.12,loudnorm=I=-15:LRA=8:TP=-2[final]';

    return implode(';', $filters);
}

Time-Context Aware Mixing

The system adapts mixing parameters based on when the mantra will be consumed:

ContextVoice DelayDucking RatioCharacter
Morning3-4 seconds4-6 (gentle)Energizing, music-forward
Afternoon5-6 seconds6-8 (moderate)Balanced focus
Night6-8 seconds9-12 (strong)Voice-dominant, calming

The Learning System: Vector Memory & RAG

Storing Successful Mixes

Every successful mix becomes training data for future generations:

private function storeSuccessfulMix(
    array $params,
    array $voiceInfo,
    array $musicInfo,
    AgentContext $context
): void {
    $description = sprintf(
        'Successful audio mix: Voice duration %.1fs, Music duration %.1fs. ' .
        'Ducking: %s threshold, %.2f ratio, %dms attack, %dms release.',
        $voiceInfo['duration'],
        $musicInfo['duration'],
        $params['ducking_threshold'],
        $params['ducking_ratio'],
        $params['ducking_attack'],
        $params['ducking_release']
    );

    $agent->vector()->addDocument([
        'content' => $description,
        'metadata' => [
            'parameters' => $params,
            'duration_ratio' => $voiceInfo['duration'] / max($musicInfo['duration'], 1),
            'timestamp' => now()->toIso8601String(),
        ],
        'namespace' => 'audio_mixing',
    ]);
}

Retrieval-Augmented Generation (RAG)

Before mixing new content, the system queries its vector memory for similar successful mixes:

$similarMixes = $agent->rag()->search([
    'query' => "meditation mantra mixing with {$voice->label} and {$music->label}",
    'namespace' => 'audio_mixing',
    'limit' => 3,
    'threshold' => 0.7,
]);

This means a morning energizing mantra automatically inherits parameters from past successful morning mixes—while still allowing Gemini to fine-tune for the specific content.

Quality Assurance: AI-Powered Evaluation

We don’t just generate audio—we evaluate it. Our evaluation framework uses LLM judges to assess mix quality:

Voice Clarity Assertion

class VoiceClarityAssertion extends BaseAssertion
{
    protected function getPrompt(
        string $input,
        string $output,
        ?string $expected = null
    ): string {
        return <<<PROMPT
        Evaluate voice clarity in this meditation mantra mix:

        **Clarity Criteria:**
        1. Voice Prominence (30 points)
        2. Intelligibility (30 points)
        3. Frequency Clarity (20 points)
        4. Mix Balance (20 points)

        Voice should be effortlessly intelligible and feel natural,
        as if narrator is speaking directly to listener with gentle
        ambient music in background.
        PROMPT;
    }
}

Ducking Effectiveness Assertion

class DuckingEffectivenessAssertion extends BaseAssertion
{
    // Evaluates smoothness of transitions
    // Validates music reduction when voice is present
    // Ensures meditation suitability
}

Safety First: Parameter Guardrails

AI can be creative—sometimes too creative. We enforce strict parameter bounds to prevent audio disasters:

private const SAFE_VOICE_NORMALIZATION = 'loudnorm=I=-18:LRA=10:TP=-1.5:dual_mono=true:linear=true';
private const SAFE_DUCKING_RATIO = 3;
private const SAFE_DUCKING_THRESHOLD_DB = -22;
private const MIN_VOICE_DELAY_SECONDS = 2;
private const MAX_VOICE_DELAY_SECONDS = 20;

private function enforceParameterSafety(
    array $params,
    array $voiceInfo,
    array $musicInfo
): array {
    // Ensure voice delay respects music duration
    $availableIntro = $musicDuration - $voiceDuration - self::MUSIC_TAIL_BUFFER_SECONDS;
    $params['voice_delay_seconds'] = max(
        self::MIN_VOICE_DELAY_SECONDS,
        min(self::MAX_VOICE_DELAY_SECONDS, (int) round($availableIntro))
    );

    // Clamp all parameters to safe ranges
    $params['ducking_ratio'] = max(3, min(12, $params['ducking_ratio']));
    $params['voice_gain_db'] = max(-2.0, min(3.0, $params['voice_gain_db']));
    $params['music_gain_db'] = max(-4.0, min(1.0, $params['music_gain_db']));

    return $params;
}

The Complete Pipeline

Here’s how a mantra is born:

flowchart TD
    A[User Request] --> B{Content IDs Provided?}
    B -->|No| C[FindUnmixedContentTool]
    B -->|Yes| D[Analyze Audio Files]
    C --> D
    D --> E[Query Vector Memory for Similar Mixes]
    E --> F[Gemini Analyzes Audio + Metadata]
    F --> G[Generate FFmpeg Parameters]
    G --> H[Apply Safety Guardrails]
    H --> I[Execute FFmpeg Mix]
    I --> J{Mix Successful?}
    J -->|Yes| K[Store in Vector Memory]
    J -->|No| L[Retry with Fallback Params]
    L --> I
    K --> M[Create Content Record]
    M --> N[Generate Social Cards]
    N --> O[Mantra Published! 🧘]

Results: The Numbers Speak

Since deploying this system:

  • Processing Time: ~30-60 seconds per mantra (vs. 15-30 minutes for manual mixing)
  • Quality Consistency: 75%+ pass rate on automated quality evaluations
  • Learning Curve: System performance improves measurably after ~50 successful mixes
  • Human Intervention: <5% of mixes require manual adjustment

What We Learned

1. AI Needs Guardrails

Gemini occasionally suggests parameters that would produce technically valid but aesthetically poor results. Our safety layer is essential.

2. Vector Memory is Powerful

The RAG system’s ability to recall successful mixes transformed our quality curve. New content benefits from all past successes.

3. Context Matters

A “good mix” for morning meditation is different from night. Time-aware parameter selection dramatically improved user satisfaction.

4. FFmpeg is Incredibly Capable

FFmpeg 8’s filter graph system is essentially a visual programming language for audio. Combined with AI parameter generation, it’s unstoppable.

What’s Next

We’re exploring:

  • Real-time audio analysis during mixing for adaptive parameter adjustment
  • User feedback loops to incorporate listener preferences into the learning system
  • Multi-track mixing for more complex compositions with ambient sounds
  • Personalized mixing profiles based on individual user listening patterns

Try It Yourself

The core concepts are transferable to any audio processing pipeline:

  1. Analyze your inputs with FFprobe
  2. Query past successes from a vector database
  3. Generate parameters with a capable LLM
  4. Apply safety bounds before execution
  5. Store successful results for future learning

The result? A system that gets better with every mix—turning the art of audio engineering into a learnable, scalable process.


Built with Laravel 12, Filament 4, Vizra ADK, FFmpeg 8, Google Gemini, and a deep appreciation for the meditation practitioners who use our content every day. 🧘‍♀️

Related Posts