Data Forge: Synthetic Data Generators

Published on
Embed video
Share video
Ask about this video

Scene 1 (0s)

[Audio] Hi Everyone. Data Forge: Synthetic Data Generators.

Scene 2 (8s)

[Audio] Excited to introduce Data Forge - a powerful tool that enables our customers to harness synthetic data to rapidly generate realistic datasets, accelerate unit testing, stress testing, scalability, and innovation. Data Forge is available as: Unity Catalog Built-In Synthetic Data Feature Scalable AI Function It leverages unity catalog metadata, rules, and statistical data to generate high-fidelity datasets. It works across SQL Editor, Notebooks, Jobs, and Workflows, and includes a Global Synthetic Mode that lets customers run production alongside synthetic data across all tables, scaling workloads 2×, 5×, or 10× to stay future-ready..

Scene 3 (1m 2s)

[Audio] This architecture slide explains how UC is leveraged to understand the metadata and process the user request via Databricks Jobs and return the synthetic data to provide a built-in feature..

Scene 4 (1m 16s)

[Audio] Additionally, we are exposing it as scalable AI function that leverages foundation models like Claude or GPT-4, with adaptive batching and automatic retry logic..

Scene 5 (1m 27s)

[Audio] Here's the AI function in action. Four simple parameters — my requirements in plain English, the source table, number of rows, and output location. I'm asking for customer support conversations about Databricks AI features. Look at the output — realistic support tickets about MLflow errors, Model Serving latency, and AI/BI Genie issues. Each record is unique and contextually accurate..

Scene 6 (1m 54s)

[Audio] We've also embedded this directly into Unity Catalog through Synthetic Data Button. From any table, click to generate synthetic data, specify your requirements, and trigger a job — all without leaving the catalog..

Scene 7 (2m 11s)

[Audio] Generate New Data Option. UC Synthetic Data Generator UI (Dropdown - Generate New Data).

Scene 8 (2m 18s)

[Audio] Submit Form. UC Synthetic Data Generator UI (Kick Off Data Generation).

Scene 9 (2m 26s)

[Audio] Job Triggered. Synthetic Data Generator from UC UI.

Scene 10 (2m 35s)

[Audio] Job Generation Insights. UC Synthetic Data Generator UI (Backend Job).

Scene 11 (2m 43s)

[Audio] Finally, results are back. Databricks' Lakehouse-native Data Forge lets you generate realistic, PII-safe synthetic data directly in your catalog. It's fast, governed, and cost-efficient — supporting unit tests, QA, load testing, schema validation, issue reproduction, and cross-team collaboration. You can scale synthetic workloads 2×, 5×, or 10× alongside production with full observability..

Scene 12 (3m 14s)

[Audio] That's Data Forge — realistic test data, where your data lives. Thank you!.