Data Forge: Synthetic Data Generator

Published on
Embed video
Share video
Ask about this video

Scene 1 (0s)

[Audio] Hi everyone! I'm presenting Data Forge — an AI-powered synthetic data generator for Databricks. Let me show you how we're generating test data using just natural language.

Scene 2 (13s)

[Audio] Data Forge takes a source table, understands your requirements in plain English, and generates realistic synthetic data that matches your schema. It works across SQL Editor, Notebooks, and Workflows — and includes a global toggle to switch between production and synthetic data for stress testing.

Scene 3 (1m 8s)

[Audio] Under the hood, we use foundation models like Claude or GPT-4, with adaptive batching and automatic retry logic. The function integrates with Unity Catalog for full metadata awareness..

Scene 4 (1m 21s)

[Audio] This slide explains how UC is leveraged to understand the metadata and process the user request via Databricks Jobs and return the data with the mentioned target table.

Scene 5 (1m 33s)

[Audio] Four simple parameters — my requirements in plain English, the source table, number of rows, and output location. I'm asking for customer support conversations about Databricks AI features. Look at the output — realistic support tickets about MLflow errors, Model Serving latency, and AI/BI Genie issues. Each record is unique and contextually accurate." "We've also embedded this directly into Unity Catalog. From any table, click to generate synthetic data, specify your requirements, and trigger a job — all without leaving the catalog.".

Scene 6 (2m 12s)

[Audio] The job completes in about 3 minutes with full lineage tracking. Your synthetic data is governed and production-ready.

Scene 7 (2m 21s)

[Audio] Thank you. Thank You!!!.