From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents
From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents
Creating professional meditation audio isn’t just about layering a voice track over background music. It’s an intricate dance of frequencies, dynamics, and timing that traditionally required a skilled audio engineer with years of experience.
The Challenge: When Art Meets Algorithm
Our challenge was ambitious: automate the creation of studio-quality meditation mantras by intelligently mixing voice narrations with background music—while ensuring the result felt organic, not robotic.
We didn’t want to simply apply static presets. We wanted a system that learns, adapts, and improves with every successful mix.
The Architecture: A Symphony of AI Agents
At the heart of our solution is a sophisticated multi-agent system built on the Vizra ADK (Agent Development Kit), orchestrating several specialized components:
┌─────────────────────────────────────────────────────────────┐
│ MantraGenerationWorkflow │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ Content Finder │ → │ MantraGenerationAgent │ │
│ │ (FindUnmixed) │ │ (Gemini 2.5 Flash) │ │
│ └─────────────────┘ └──────────────────────────────┘ │
│ ↓ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ IntelligentFFmpegMixerTool │ │
│ │ ┌───────────┐ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ Gemini AI │→ │ Vector RAG │→ │ FFmpeg 8 Engine│ │ │
│ │ │ Parameter │ │ Learning │ │ Execution │ │ │
│ │ │ Generator │ │ System │ │ │ │ │
│ │ └───────────┘ └────────────┘ └────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘The MantraGenerationAgent
Our primary agent operates as an “AI Audio Engineer” that coordinates the entire mixing process:
class MantraGenerationAgent extends BaseLlmAgent
{
protected string $model = 'gemini-2.5-flash';
protected ?float $temperature = 0.5;
protected array $tools = [
FindUnmixedContentTool::class,
IntelligentFFmpegMixerTool::class,
VectorMemoryTool::class,
];
}The agent receives high-level instructions and autonomously decides:
- Which voice and music content to pair together
- When to consult past successful mixes
- How to adjust parameters based on time-of-day context (morning energizing vs. night calming)
The Intelligence Layer: Gemini + Audio Analysis
Real-Time Audio Intelligence
Before any mixing occurs, our system performs comprehensive audio analysis using FFmpeg’s probing capabilities:
private function analyzeAudio(string $path): array
{
$command = sprintf(
'ffprobe -v quiet -print_format json -show_format -show_streams %s',
escapeshellarg($path)
);
return [
'duration' => (float) ($info['format']['duration'] ?? 0),
'bitrate' => (int) ($info['format']['bit_rate'] ?? 0) / 1000,
'channels' => (int) ($info['streams'][0]['channels'] ?? 0),
'sample_rate' => (int) ($info['streams'][0]['sample_rate'] ?? 44100),
];
}AI-Driven Parameter Generation
The magic happens when we feed audio characteristics and actual audio files to Gemini 2.0 Flash for parameter optimization:
$response = Prism::structured()
->using(Provider::Gemini, 'gemini-2.0-flash-exp')
->withPrompt($prompt, [$voiceAttachment, $musicAttachment])
->withSchema($mixingParametersSchema)
->asStructured();Gemini analyzes both the metadata and the actual audio content, then returns optimized FFmpeg parameters tailored to the specific voice-music pairing.
The Heart of the Mix: FFmpeg 8 Filter Chains
Dynamic Filter Construction
Our system constructs sophisticated FFmpeg filter chains based on AI recommendations:
private function buildFilters(array $params): string
{
$filters = [];
// Voice processing: normalize, gain, convert to stereo
$filters[] = '[0:a]loudnorm=I=-18:LRA=10:TP=-1.5,volume=1.5dB,aformat=channel_layouts=stereo[voice_norm]';
// Voice delay (let the music establish ambiance first)
$delayMs = $params['voice_delay_seconds'] * 1000;
$filters[] = "[voice_norm]adelay={$delayMs}|{$delayMs}[voice_for_mix]";
// Music processing: normalize, EQ, gain
$filters[] = '[1:a]loudnorm=I=-19:LRA=9:TP=-2,' .
'highpass=f=80,lowpass=f=12000,' .
'equalizer=f=4000:width_type=h:width=800:g=-3,' .
'volume=-1dB[music_proc]';
// The mix with intelligent weighting
$filters[] = '[music_proc][voice_for_mix]amix=inputs=2:normalize=0:dropout_transition=2:weights=1 1[mix_combined]';
// Final mastering: fade-in and loudness normalization
$filters[] = '[mix_combined]afade=t=in:ss=0:d=0.12,loudnorm=I=-15:LRA=8:TP=-2[final]';
return implode(';', $filters);
}Time-Context Aware Mixing
The system adapts mixing parameters based on when the mantra will be consumed:
| Context | Voice Delay | Ducking Ratio | Character |
|---|---|---|---|
| Morning | 3-4 seconds | 4-6 (gentle) | Energizing, music-forward |
| Afternoon | 5-6 seconds | 6-8 (moderate) | Balanced focus |
| Night | 6-8 seconds | 9-12 (strong) | Voice-dominant, calming |
The Learning System: Vector Memory & RAG
Storing Successful Mixes
Every successful mix becomes training data for future generations:
private function storeSuccessfulMix(
array $params,
array $voiceInfo,
array $musicInfo,
AgentContext $context
): void {
$description = sprintf(
'Successful audio mix: Voice duration %.1fs, Music duration %.1fs. ' .
'Ducking: %s threshold, %.2f ratio, %dms attack, %dms release.',
$voiceInfo['duration'],
$musicInfo['duration'],
$params['ducking_threshold'],
$params['ducking_ratio'],
$params['ducking_attack'],
$params['ducking_release']
);
$agent->vector()->addDocument([
'content' => $description,
'metadata' => [
'parameters' => $params,
'duration_ratio' => $voiceInfo['duration'] / max($musicInfo['duration'], 1),
'timestamp' => now()->toIso8601String(),
],
'namespace' => 'audio_mixing',
]);
}Retrieval-Augmented Generation (RAG)
Before mixing new content, the system queries its vector memory for similar successful mixes:
$similarMixes = $agent->rag()->search([
'query' => "meditation mantra mixing with {$voice->label} and {$music->label}",
'namespace' => 'audio_mixing',
'limit' => 3,
'threshold' => 0.7,
]);This means a morning energizing mantra automatically inherits parameters from past successful morning mixes—while still allowing Gemini to fine-tune for the specific content.
Quality Assurance: AI-Powered Evaluation
We don’t just generate audio—we evaluate it. Our evaluation framework uses LLM judges to assess mix quality:
Voice Clarity Assertion
class VoiceClarityAssertion extends BaseAssertion
{
protected function getPrompt(
string $input,
string $output,
?string $expected = null
): string {
return <<<PROMPT
Evaluate voice clarity in this meditation mantra mix:
**Clarity Criteria:**
1. Voice Prominence (30 points)
2. Intelligibility (30 points)
3. Frequency Clarity (20 points)
4. Mix Balance (20 points)
Voice should be effortlessly intelligible and feel natural,
as if narrator is speaking directly to listener with gentle
ambient music in background.
PROMPT;
}
}Ducking Effectiveness Assertion
class DuckingEffectivenessAssertion extends BaseAssertion
{
// Evaluates smoothness of transitions
// Validates music reduction when voice is present
// Ensures meditation suitability
}Safety First: Parameter Guardrails
AI can be creative—sometimes too creative. We enforce strict parameter bounds to prevent audio disasters:
private const SAFE_VOICE_NORMALIZATION = 'loudnorm=I=-18:LRA=10:TP=-1.5:dual_mono=true:linear=true';
private const SAFE_DUCKING_RATIO = 3;
private const SAFE_DUCKING_THRESHOLD_DB = -22;
private const MIN_VOICE_DELAY_SECONDS = 2;
private const MAX_VOICE_DELAY_SECONDS = 20;
private function enforceParameterSafety(
array $params,
array $voiceInfo,
array $musicInfo
): array {
// Ensure voice delay respects music duration
$availableIntro = $musicDuration - $voiceDuration - self::MUSIC_TAIL_BUFFER_SECONDS;
$params['voice_delay_seconds'] = max(
self::MIN_VOICE_DELAY_SECONDS,
min(self::MAX_VOICE_DELAY_SECONDS, (int) round($availableIntro))
);
// Clamp all parameters to safe ranges
$params['ducking_ratio'] = max(3, min(12, $params['ducking_ratio']));
$params['voice_gain_db'] = max(-2.0, min(3.0, $params['voice_gain_db']));
$params['music_gain_db'] = max(-4.0, min(1.0, $params['music_gain_db']));
return $params;
}The Complete Pipeline
Here’s how a mantra is born:
flowchart TD
A[User Request] --> B{Content IDs Provided?}
B -->|No| C[FindUnmixedContentTool]
B -->|Yes| D[Analyze Audio Files]
C --> D
D --> E[Query Vector Memory for Similar Mixes]
E --> F[Gemini Analyzes Audio + Metadata]
F --> G[Generate FFmpeg Parameters]
G --> H[Apply Safety Guardrails]
H --> I[Execute FFmpeg Mix]
I --> J{Mix Successful?}
J -->|Yes| K[Store in Vector Memory]
J -->|No| L[Retry with Fallback Params]
L --> I
K --> M[Create Content Record]
M --> N[Generate Social Cards]
N --> O[Mantra Published! 🧘]Results: The Numbers Speak
Since deploying this system:
- Processing Time: ~30-60 seconds per mantra (vs. 15-30 minutes for manual mixing)
- Quality Consistency: 75%+ pass rate on automated quality evaluations
- Learning Curve: System performance improves measurably after ~50 successful mixes
- Human Intervention: <5% of mixes require manual adjustment
What We Learned
1. AI Needs Guardrails
Gemini occasionally suggests parameters that would produce technically valid but aesthetically poor results. Our safety layer is essential.
2. Vector Memory is Powerful
The RAG system’s ability to recall successful mixes transformed our quality curve. New content benefits from all past successes.
3. Context Matters
A “good mix” for morning meditation is different from night. Time-aware parameter selection dramatically improved user satisfaction.
4. FFmpeg is Incredibly Capable
FFmpeg 8’s filter graph system is essentially a visual programming language for audio. Combined with AI parameter generation, it’s unstoppable.
What’s Next
We’re exploring:
- Real-time audio analysis during mixing for adaptive parameter adjustment
- User feedback loops to incorporate listener preferences into the learning system
- Multi-track mixing for more complex compositions with ambient sounds
- Personalized mixing profiles based on individual user listening patterns
Try It Yourself
The core concepts are transferable to any audio processing pipeline:
- Analyze your inputs with FFprobe
- Query past successes from a vector database
- Generate parameters with a capable LLM
- Apply safety bounds before execution
- Store successful results for future learning
The result? A system that gets better with every mix—turning the art of audio engineering into a learnable, scalable process.
Built with Laravel 12, Filament 4, Vizra ADK, FFmpeg 8, Google Gemini, and a deep appreciation for the meditation practitioners who use our content every day. 🧘♀️