Infoveave Data Automation — Generation
You have 500 real transactions. You need 1 million rows to load test the analytics pipeline. Set Expansion Factor to 2000 — 1 million synthetic rows generated from your real data patterns.
Realistic large-scale test data is one of the hardest problems in data engineering and analytics performance work. Generating synthetic datasets that look like real business data for load testing, pipeline benchmarking, ML model training augmentation, or capacity planning typically requires scripted data generators using libraries like Faker, SDV, or custom scripts that must be maintained as the data schema evolves. Generate Big Data takes a different approach: it starts from a real dataset that already has the right schema, value distributions, and domain context, and multiplies the row count by a configured factor. The key column is auto-incremented across synthetic rows to maintain uniqueness for join operations, while other column values are varied to produce diverse synthetic patterns. The result is a large synthetic dataset rooted in real data patterns, generated in one pipeline step without any scripting.
Multiply a dataset's row count by a configured expansion factor to generate synthetic data for load testing, performance benchmarking, model training, and pipeline capacity validation in Infoveave. Scale a small reference dataset into a large synthetic dataset with key column uniqueness preserved — without scripting data generators or maintaining faker libraries.
Generate Big Data is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.
Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.
Real scenarios where this transformation saves hours of manual work.
A retail analytics team builds a dashboard on top of an order analytics pipeline that will eventually process over 5 million orders per year. Before production deployment, the team needs to load test the dashboard with realistic order volume. Generate Big Data uses a 10,000-row sample of real historical orders with an Expansion Factor of 500, producing a 5 million-row synthetic order dataset. The test validates query performance and chart render times at production scale.
A manufacturing IoT integration team builds a pipeline that will process sensor readings from 800 machines at 1-minute intervals — approximately 1.2 million readings per day. The team has 5,000 real sensor readings from 10 machines in the development environment. Generate Big Data applies an Expansion Factor of 240 to produce 1.2 million synthetic readings for a full-day pipeline throughput benchmark, validating that ingestion, transformation, and storage steps complete within the production time window.
A bank's data science team is training a fraud detection model but has only 8,000 labeled transactions from the pilot program. The model needs at least 500,000 training samples for the target architecture. Generate Big Data expands the 8,000-row labeled dataset by a factor of 63 (rounding to approximately 500,000 rows). The Key Column is set to TransactionID which auto-increments across synthetic rows to prevent key collisions. The expanded dataset is used for model training augmentation alongside a validation split from the original rows.
Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.
3OrderIDNoInput Data
| OrderID | CustomerName | ProductCategory | Amount |
|---|---|---|---|
| ORD001 | Alice Johnson | Electronics | 1250.00 |
| ORD002 | Bob Smith | Clothing | 89.50 |
| ORD003 | Carol Davis | Home & Garden | 450.00 |
Output Data
| OrderID | CustomerName | ProductCategory | Amount |
|---|---|---|---|
| ORD001_syn1 | Alice Johnson | Electronics | 1287.50 |
| ORD002_syn1 | Bob Smith | Clothing | 91.20 |
| ORD003_syn1 | Carol Davis | Home & Garden | 462.00 |
| ORD001_syn2 | Alice Johnson | Electronics | 1218.75 |
| ORD002_syn2 | Bob Smith | Clothing | 87.30 |
| ORD003_syn2 | Carol Davis | Home & Garden | 441.00 |
| ORD001_syn3 | Alice Johnson | Electronics | 1301.25 |
| ORD002_syn3 | Bob Smith | Clothing | 93.00 |
| ORD003_syn3 | Carol Davis | Home & Garden | 472.50 |
Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.
Expansion Factor
Enter the integer multiplier that determines how many synthetic row copies are generated from the input dataset. An Expansion Factor of 100 applied to a 1,000-row dataset produces 100,000 synthetic rows. The total output row count (excluding originals) is input rows times the expansion factor.
Key Column
Select the column that serves as the uniqueness identifier. Across synthetic rows, the key column values are auto-incremented or modified to prevent key collisions between synthetic copies. This ensures the key column remains unique across the expanded dataset, enabling downstream joins and deduplication steps to function correctly.
Include Original
When enabled, the output dataset contains both the original input rows and the synthetic generated rows — the total row count is input rows plus (input rows times expansion factor). When disabled, the output contains only the synthetic rows without the originals. Use Include Original when the real rows are needed as a validation baseline alongside the synthetic volume.
Everything you need to know about Generate Big Data in Infoveave.
Transformations in the same family as Generate Big Data, often chained together in the same Infoveave workflow.
Part of Infoveave Data Automation
Generate Big Data is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.
Ready to see Infoveave in action?