Name: Data Forge: Synthetic Data Generator
Uploaded: 2025-12-15
Duration: 2 min 28 s
Description: Hi everyone! I'm presenting Data Forge — an AI-powered synthetic data generator for Databricks. Let me show you how we're generating test data using just natural language.

[Audio] Hi everyone! I'm presenting Data Forge — an AI-powered synthetic data generator for Databricks. Let me show you how we're generating test data using just natural language. 

Scene 1

[Audio] Data Forge takes a source table, understands your requirements in plain English, and generates realistic synthetic data that matches your schema. It works across SQL Editor, Notebooks, and Workflows — and includes a global toggle to switch between production and synthetic data for stress testing. 

Scene 2

High-Level Overview:
Data Forge is an AI-powered synthetic data generator built for the Lakehouse, leveraging unity catalog metadata, rules, and learned statistical distributions to produce high-fidelity synthetic data. It is available as Scalable AI Function (ai_synthetic_data_generate) Unity Catalog’s Built-In Synthetic Data Feature
 Supported in Interfaces:  Databricks SQL Editor, Databricks Notebooks, Databricks Workflows / Jobs, Lakeflow Spark Declarative Pipelines
 Key Capabilities: Native SQL Extensions - AI-driven synthetic data generation using Natural Language UC Built-in Synthetic Data - Embedded inside Unity Catalog for discovering, generating, and managing synthetic data. Rich Metadata Access to within-organization metadata for context-aware synthetic data generation. It can be extended to cross-org cross-cloud though metadata publishers. Global Synthetic Mode lets customers run production alongside synthetic “stress-test” data across all tables, scaling workloads 2x, 5x, or 10x to be future ready . Team: Kaushal Vachhani, Aishwarya Ghosh,  Vidhi Khaitan, Pradeep Palaniswamy, Joel Robins, Naveen, Muthukumar Lakshmanan (Laks) Category: ✨ Trials / New User Experience
 Tech Stack:  React + TypeScript Scala Databricks Jobs Model Serving
 GitHub Repository/Architectural/Documentation Links: Github Link Confluence Wiki Link POC Stack:  Python/Pyspark/HTML Model Serving POC Scope: Demonstrate Databricks Lakehouse Synthetic Data Generator Capabilities in Notebooks and Unity Catalog UI

[Audio] Under the hood, we use foundation models like Claude or GPT-4, with adaptive batching and automatic retry logic. The function integrates with Unity Catalog for full metadata awareness.. 

Scene 3

[Audio] This slide explains how UC is leveraged to understand the metadata and process the user request via Databricks Jobs and return the data with the mentioned target table. 

Scene 4

[image] Admin
Databricks Job
Submit
Synthetic Data Generate
Send result
Write
audit loe
Audit log
x
Invokes A1 Functions
Cloud
Storage
Return data

[Audio] Four simple parameters — my requirements in plain English, the source table, number of rows, and output location.
I'm asking for customer support conversations about Databricks AI features. Look at the output — realistic support tickets about MLflow errors, Model Serving latency, and AI/BI Genie issues. Each record is unique and contextually accurate."
"We've also embedded this directly into Unity Catalog. From any table, click to generate synthetic data, specify your requirements, and trigger a job — all without leaving the catalog.". 

Scene 5

[Audio] The job completes in about 3 minutes with full lineage tracking. Your synthetic data is governed and production-ready. 

Scene 6

Scene 7

Data Forge: Synthetic Data Generator 

Hi everyone! I'm presenting Data Forge — an (A I ) powered synthetic data generator for Databricks. Let me show you how we're generating test data using just natural language

avatar

Data Forge takes a source table, understands your requirements in plain English, and generates realistic synthetic data that matches your schema. It works across S-Q-L Editor, Notebooks, and Workflows — and includes a global toggle to switch between production and synthetic data for stress testing

Under the hood, we use foundation models like Claude or GPT-4, with adaptive batching and automatic retry logic. The function integrates with Unity Catalog for full metadata awareness.

This slide explains how UC is leveraged to understand the metadata and process the user request via Databricks Jobs and return the data with the mentioned target table

Four simple parameters — my requirements in plain English, the source table, number of rows, and output location.
I'm asking for customer support conversations about Databricks A-I  features. Look at the output — realistic support tickets about MLflow errors, Model Serving latency, and AI/BI Genie issues. Each record is unique and contextually accurate."
"We've also embedded this directly into Unity Catalog. From any table, click to generate synthetic data, specify your requirements, and trigger a job — all without leaving the catalog."

The job completes in about 3 minutes with full lineage tracking. Your synthetic data is governed and production ready

Data Forge: Synthetic Data Generator

Scene 1 (0s)

Scene 2 (13s)

Scene 3 (1m 8s)

Scene 4 (1m 21s)

Scene 5 (1m 33s)

Scene 6 (2m 12s)

Scene 7 (2m 21s)