Data TransformationGenerationIntermediate

Generate Big Data

Infoveave Data Automation — Generation

You have 500 real transactions. You need 1 million rows to load test the analytics pipeline. Set Expansion Factor to 2000 — 1 million synthetic rows generated from your real data patterns.

Realistic large-scale test data is one of the hardest problems in data engineering and analytics performance work. Generating synthetic datasets that look like real business data for load testing, pipeline benchmarking, ML model training augmentation, or capacity planning typically requires scripted data generators using libraries like Faker, SDV, or custom scripts that must be maintained as the data schema evolves. Generate Big Data takes a different approach: it starts from a real dataset that already has the right schema, value distributions, and domain context, and multiplies the row count by a configured factor. The key column is auto-incremented across synthetic rows to maintain uniqueness for join operations, while other column values are varied to produce diverse synthetic patterns. The result is a large synthetic dataset rooted in real data patterns, generated in one pipeline step without any scripting.

Input:Tabular dataset of any size that will be expanded by replicating its rows a configured number of times, with optional key column uniqueness enforcement and optional inclusion of original rows alongside the synthetic onesOutput:Tabular dataset containing synthetic rows generated by replicating and varying the input dataset according to the configured expansion factor, with the key column auto-incremented to maintain uniqueness across synthetic rows

What Generate Big Data does

Multiply a dataset's row count by a configured expansion factor to generate synthetic data for load testing, performance benchmarking, model training, and pipeline capacity validation in Infoveave. Scale a small reference dataset into a large synthetic dataset with key column uniqueness preserved — without scripting data generators or maintaining faker libraries.

When to use Generate Big Data

  • You need a large volume of realistic test data for load testing a data pipeline, analytics query engine, or reporting system against expected production row counts, and you have a small sample of real data with the right schema and value distribution
  • You are benchmarking a data warehouse query or an ETL pipeline's throughput and need to generate millions of rows that match the production data schema to measure performance under realistic data volume conditions
  • You are training or fine-tuning a machine learning model that requires more rows than the available labeled dataset provides and can benefit from synthetic row augmentation that preserves the schema and value patterns of the real data
  • You need reproducible large test datasets for QA or integration testing environments that must mirror production schema without using actual production data, and the synthetic variation from expansion is sufficient for the test scenarios

When to avoid it

  • You need statistically rigorous synthetic data that exactly matches the real data's statistical distribution, co-occurrence patterns, and correlation structure — Generate Big Data produces pattern-varied rows but is not a statistical synthesizer; use a dedicated SDV or statistical data synthesis tool for distribution-accurate synthetic data
  • You need to generate only a few new records with specific values — this transformation is for bulk volume expansion; for targeted row creation with controlled values, use direct data entry or append steps
  • You need synthetic data for production deployment scenarios where incorrect data could cause business impact — generated synthetic data is for development, testing, and training environments only

Where it fits in your Infoveave automation

Generate Big Data is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.

PrepareLoad and clean a real production-like sample dataset with the correct schema, representative value distribution, and all columns present that the downstream system expects
You are hereGenerateConfigure the expansion factor, key column for uniqueness, and whether to include original rows in the expanded output
Load into Test EnvironmentWrite the synthetic dataset to the test environment's data store, analytics engine, or model training dataset — keep synthetic data isolated from production systems
BenchmarkRun the pipeline, query, dashboard, or model training against the synthetic dataset at the target volume to measure throughput, latency, and capacity behavior

Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.

Infoveave — Workflow Builder
● SavedSchedule: Daily 06:00
Data SourcePrepareLoad and clean a real prod…YOU ARE HEREGenerateConfigure the expansion fa…Load into Test EnvironmentWrite the synthetic datase…BenchmarkRun the pipeline, query, d…Dashboard

How teams use Generate Big Data

Real scenarios where this transformation saves hours of manual work.

Retail

Expand 10,000 Historical Orders to 5 Million Rows for Dashboard Load Testing

A retail analytics team builds a dashboard on top of an order analytics pipeline that will eventually process over 5 million orders per year. Before production deployment, the team needs to load test the dashboard with realistic order volume. Generate Big Data uses a 10,000-row sample of real historical orders with an Expansion Factor of 500, producing a 5 million-row synthetic order dataset. The test validates query performance and chart render times at production scale.

Manufacturing

Generate Synthetic Sensor Records for Pipeline Capacity Benchmarking

A manufacturing IoT integration team builds a pipeline that will process sensor readings from 800 machines at 1-minute intervals — approximately 1.2 million readings per day. The team has 5,000 real sensor readings from 10 machines in the development environment. Generate Big Data applies an Expansion Factor of 240 to produce 1.2 million synthetic readings for a full-day pipeline throughput benchmark, validating that ingestion, transformation, and storage steps complete within the production time window.

Finance

Augment Small Transaction Dataset for Fraud Detection Model Training

A bank's data science team is training a fraud detection model but has only 8,000 labeled transactions from the pilot program. The model needs at least 500,000 training samples for the target architecture. Generate Big Data expands the 8,000-row labeled dataset by a factor of 63 (rounding to approximately 500,000 rows). The Key Column is set to TransactionID which auto-increments across synthetic rows to prevent key collisions. The expanded dataset is used for model training augmentation alongside a validation split from the original rows.

See Generate Big Data in action

Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.

Expansion Factor:3
Key Column:OrderID
Include Original:No

Input Data

OrderIDCustomerNameProductCategoryAmount
ORD001Alice JohnsonElectronics1250.00
ORD002Bob SmithClothing89.50
ORD003Carol DavisHome & Garden450.00

Output Data

OrderIDCustomerNameProductCategoryAmount
ORD001_syn1Alice JohnsonElectronics1287.50
ORD002_syn1Bob SmithClothing91.20
ORD003_syn1Carol DavisHome & Garden462.00
ORD001_syn2Alice JohnsonElectronics1218.75
ORD002_syn2Bob SmithClothing87.30
ORD003_syn2Carol DavisHome & Garden441.00
ORD001_syn3Alice JohnsonElectronics1301.25
ORD002_syn3Bob SmithClothing93.00
ORD003_syn3Carol DavisHome & Garden472.50

Configuration

Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.

Expansion Factor

Enter the integer multiplier that determines how many synthetic row copies are generated from the input dataset. An Expansion Factor of 100 applied to a 1,000-row dataset produces 100,000 synthetic rows. The total output row count (excluding originals) is input rows times the expansion factor.

Key Column

Select the column that serves as the uniqueness identifier. Across synthetic rows, the key column values are auto-incremented or modified to prevent key collisions between synthetic copies. This ensures the key column remains unique across the expanded dataset, enabling downstream joins and deduplication steps to function correctly.

Include Original

When enabled, the output dataset contains both the original input rows and the synthetic generated rows — the total row count is input rows plus (input rows times expansion factor). When disabled, the output contains only the synthetic rows without the originals. Use Include Original when the real rows are needed as a validation baseline alongside the synthetic volume.

Frequently asked questions

Everything you need to know about Generate Big Data in Infoveave.

Also in Generation — and what runs before & after

Transformations in the same family as Generate Big Data, often chained together in the same Infoveave workflow.

Part of Infoveave Data Automation

80+ transformations. Zero manual steps.

Generate Big Data is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.

Ready to see Infoveave in action?

Book a Demo
ISO 27001ISO 27017ISO 27701GDPRHIPAACCPAAICPACSR LogoCapterra Reviews — Infoveave

© 2026 Noesys Software Pvt Ltd

Infoveave® is a product of Noesys

All Rights Reserved