MultimodalAgents

Published on Sep 23, 2025

Scene 1 (0s)

[Audio] Welcome to this presentation. We will introduce you to a new Python AI agent library and framework that simplifies the creation of workflows to automate the processing of multimodal data using AI..

Scene 2 (14s)

[Audio] All existing AI libraries focus heavily on Large Language Models (LLMs), using them as a central hub even for other AI types like image or audio generation. However, not every task needs an LLM! Sometimes directly connecting to these other AIs is more efficient. Current libraries don't easily support these direct workflows. We need a library that offers flexibility for all AI types, allowing us to build streamlined applications without being limited by an LLM-centric approach..

Scene 3 (49s)

[Audio] Current AI agent libraries struggle with reliability & accuracy due to LLM inconsistencies, especially when chaining multiple steps; they also lack robust auditing & observability, hindering accountability. Furthermore, these systems face emerging security vulnerabilities like agent hijacking and prompt injection, alongside significant cost & scalability challenges for real-world deployment..

Scene 4 (1m 20s)

[Audio] Our new AI agent architecture moves beyond being "LLM-centric" by creating a platform-agnostic system where tasks (defined by agents, prompts, and data) are independent of the specific model or cloud provider used. This decoupling unlocks flexibility, enables easy mixing of local & cloud resources, facilitates experimentation & optimization..

Scene 5 (1m 47s)

[Audio] Using Cloud Ai platforms Pro: Scalability & Access – Easily access powerful AI models and scale resources on demand without significant infrastructure investment. Con: Vendor Lock-in & Control – Reliance on a third party can limit customization options and create dependency concerns regarding pricing, data privacy, and model availability..

Scene 6 (2m 36s)

[Audio] Using local (on premise) AI platforms Pro: Data Privacy & Customization – Full control over your data and models allows for tailored solutions and enhanced security. Con: High Costs & Complexity – Requires significant upfront investment in hardware, expertise, and ongoing maintenance to manage infrastructure effectively..

Scene 7 (3m 15s)

[Audio] Our solution offers a flexible approach by enabling workflows that seamlessly combine Local and Cloud AI platforms, giving you the best of both worlds. This allows for complete data control when needed, cost-effective offloading of specific tasks to the cloud, integration with existing tools, and easy blending of AI with traditional code processing..

Scene 8 (3m 39s)

[Audio] Our MultimodalAgents architecture breaks down AI tasks into core components: **Platforms** (where tasks run), **Models** (the AI engines), **Agents** (task execution logic), **Media** (data inputs), and **Workflows** (complex task sequences) managed by an **Orchestrator**. Currently in alpha, the library supports basic text and image workflows with plans to expand to other data types..

Scene 9 (4m 11s)

[Audio] A simple workflow involves using an LLM to enhance a basic prompt into an improved one, similar to the "magic wand" found on some platforms. This improved prompt is then used to create images with one or multiple models..

Scene 10 (4m 13s)

[Audio] With our solution, you can effortlessly switch components within an existing workflow. For example, you can utilize different models or switch between using local and cloud platforms..

Scene 11 (4m 31s)

[Audio] Have you ever attempted to replicate an image using a text-to-image AI? Here's a simple workflow: - Utilize an LLM to analyze an image and generate a text-to-image prompt. - Utilize the generated prompt to create images using a text-to-image AI, potentially employing multiple models..

Scene 12 (4m 33s)

[Audio] This straightforward workflow enables batch editing of images. For instance, we want to dress an avatar in various types of clothes and insert it into a background scene that aligns with the style..

Scene 13 (4m 55s)

[Audio] A real-world scenario. Suppose you're working in a digital agency and want to create a portfolio of car photographs. However, you have car images where the license plate of the cars is visible. You want to automatically remove all the license plate numbers from the images and replace them with an advertising text (in this case, we've used the "FLUX" string)..

Scene 14 (5m 21s)

[Audio] The project is still in its Alpha phase. The first version is available on GitHub. Feel free to contribute ideas and suggestions. Thanks for watching this video..