Orchestrating Multimedia Magic: How I Built Content Generation with Vizra ADK Workflows

In my recent work, I wasn’t just generating text; I was building immersive multimedia experiences. Whether it was generating daily briefings, educational content, or dynamic updates, relying on a single LLM prompt often fell short.

To create high-quality content that combined researched text, synthesized audio (via ElevenLabs), and generated imagery, I needed orchestration. Enter Vizra ADK Workflows.

Here is a deep dive into how I leveraged Vizra’s workflow patterns to turn a simple topic into a full multimedia package.

The Challenge: Coordination vs. Chaos

Generating multimedia requires distinct steps:

Research & Writing: Ensuring factual accuracy (RAG) and engaging copy.
Audio Synthesis: Converting text to speech using specific voice profiles.
Visuals: Generating thumbnails or accompanying images.

Doing this linearly is slow. Doing it without structure is error-prone. I needed a system that could handle sequential logic for writing and parallel execution for asset generation.

The Solution: The Vizra Workflow

I utilized the Workflow facade provided by Vizra ADK to compose a pipeline that mixes sequential and parallel execution patterns.

1. The Architect: Sequential Planning

Everything starts with a script. I used a Sequential Workflow to ensure I had a solid foundation before generating expensive assets.

use Vizra\VizraADK\Facades\Workflow;
use App\Agents\Content\ResearcherAgent;
use App\Agents\Content\ScriptWriterAgent;

public function generateContent(string $topic)
{
    // Step 1: Research and Write
    $scriptData = Workflow::sequential()
        ->then(ResearcherAgent::class) // Uses Meilisearch Vector Store
        ->then(ScriptWriterAgent::class) // Uses Gemini Pro for reasoning
        ->run($topic);

    // ... pass to next stage
}

In this stage, the ResearcherAgent uses the VectorMemoryTool (backed by Meilisearch) to pull relevant context. The ScriptWriterAgent then formats this into a JSON structure containing a title, body, and a prompt for the image generator.

2. The Factory: Parallel Asset Generation

Once I had the script, I didn’t want to wait for the audio to finish before starting the image generation. Vizra’s Parallel Workflow allowed me to spin up multiple agents simultaneously.

I passed the output from the sequential step into a parallel block.

    // ... inside generateContent

    $assets = Workflow::parallel()
        ->agents([
            'audio' => VoiceOverAgent::class,
            'visual' => ThumbnailGeneratorAgent::class,
        ])
        ->run($scriptData['final_script']);

    return [
        'script' => $scriptData,
        'audio_path' => $assets['audio'],
        'image_url' => $assets['visual'],
    ];
}

3. The Agents & Tools

The magic happens inside the specialized agents. Here is how I configured them using the Vizra ADK structure.

The Voice Over Agent (ElevenLabs)

This agent is responsible for taking text and returning a path to an MP3 file. It utilizes a custom tool I built to interface with the ElevenLabs API.

namespace App\Agents\Content;

use Vizra\VizraADK\Agents\BaseLlmAgent;
use App\Tools\Audio\ElevenLabsTtsTool;

class VoiceOverAgent extends BaseLlmAgent
{
    protected string $name = 'voice_over_specialist';

    protected string $model = 'gpt-4o-mini'; // Fast, low cost for tool calling

    protected string $instructions = <<<'INSTRUCTIONS'
        You are an audio engineer.
        1. Receive the script text.
        2. Select the appropriate voice ID based on the content tone.
        3. Use the 'text_to_speech' tool to generate the audio.
        4. Return the file path provided by the tool.
    INSTRUCTIONS;

    protected array $tools = [
        ElevenLabsTtsTool::class,
    ];
}

The Tool Implementation

The ElevenLabsTtsTool handles the actual API call, keeping my agent logic clean.

namespace App\Tools\Audio;

use Vizra\VizraADK\Contracts\ToolInterface;
use Illuminate\Support\Facades\Http;

class ElevenLabsTtsTool implements ToolInterface
{
    public function definition(): array
    {
        return [
            'name' => 'text_to_speech',
            'description' => 'Converts text to audio using ElevenLabs',
            'parameters' => [
                'type' => 'object',
                'properties' => [
                    'text' => ['type' => 'string'],
                    'voice_id' => ['type' => 'string'],
                ],
                'required' => ['text'],
            ],
        ];
    }

    public function execute(array $arguments, $context, $memory): string
    {
        // Implementation calling ElevenLabs API...
        // Returns JSON with { "status": "success", "path": "..." }
    }
}

Why This Approach Wins

Modularity: If I want to switch from ElevenLabs to OpenAI TTS, I just swap the tool in the VoiceOverAgent. The workflow remains untouched.
Speed: By parallelizing the asset generation, I cut the total processing time by nearly 50%.
Observability: Vizra’s built-in tracing allows me to see exactly what the ResearcherAgent found in Meilisearch and why the ScriptWriterAgent made specific creative decisions.
Maintainability: Each agent has a single responsibility. The ResearcherAgent doesn’t know about audio files, and the VoiceOverAgent doesn’t care about SEO keywords.

Conclusion

Building complex AI features isn’t just about prompt engineering; it’s about architecture. By treating LLMs as specialized workers within a Vizra Workflow, I’ve turned a complex, multi-modal generation process into a reliable, maintainable pipeline.

Orchestrating Multimedia Magic: How I Built Content Generation with Vizra ADK Workflows

Orchestrating Multimedia Magic: How I Built Content Generation with Vizra ADK Workflows

The Challenge: Coordination vs. Chaos

The Solution: The Vizra Workflow

1. The Architect: Sequential Planning

2. The Factory: Parallel Asset Generation

3. The Agents & Tools

The Voice Over Agent (ElevenLabs)

The Tool Implementation

Why This Approach Wins

Conclusion

Related Posts

From Silence to Symphony: How We Built an AI-Powered Audio Mixing Engine with FFmpeg 8 and Intelligent Agents