Download ready-to-use CSV datasets curated for manufacturing, retail, supply chain, healthcare, banking, energy, and operations. Each file is lightweight, realistically structured, and designed to import straight into Infoveave — no setup required.
Data disclaimer: These are synthetic datasets inspired by publicly available industry data patterns. They are not generated or owned by Infoveave — all values are representative samples created for demonstration and learning purposes only. No real organisations, individuals, or transactions are represented.
Filter by industry
Filter by use case
Showing 12 datasets
25k daily POS transactions across 20 stores, 5 regions, and 8 product categories over two years. Rich time-series structure for trend decomposition, regional comparisons, and seasonal pattern analysis.
50k online orders over two years with product categories, shipping methods, return flags, and order status. Includes intentional dirty data: inconsistent category casing and ~5% null shipping_method values.
Inspired by: Kaggle · Brazilian E-Commerce (Olist)
30k hourly shift records across 20 machines over 500 days. Availability, performance, quality sub-scores, and defect counts at the machine-shift level — ideal for pivot/aggregation to machine-level, line-level, or daily rollups.
Inspired by: Kaggle · Manufacturing OEE Dataset
20k batch inspection records with intentional DQ issues: ~3% duplicate batch IDs, null defect_type values, outlier defect_counts, and non-standard qc_result entries. Designed for data profiling and DQ rule validation.
30k shipment records with order date, dispatch, expected and actual delivery, carrier, weight, freight cost, and on-time flag across 10 origin/destination city pairs. Suited for SLA monitoring and pivot by carrier × region.
Inspired by: Kaggle · SCMS Delivery History
15k daily SKU-level stock snapshots across 300 SKUs and 5 warehouses. Includes stock value and a below-reorder flag — wide structure designed for unpivoting, reshaping to trend series, and warehouse-level aggregation.
25k hospital admission records with three deliberate DQ issues: mixed date formats (ISO/US/UK), ~8% null age_group, and ~2% duplicate admission IDs. A realistic scenario for date standardisation and null-imputation workflows.
Inspired by: Kaggle · Hospital Patient Records
20k daily agent-level records across 200 agents and 7 queue types. Handles, AHT, wait time, FCR, CSAT, and escalations — a clean multi-dimensional dataset for distribution analysis and agent benchmarking.
Inspired by: Kaggle · Call Centre Performance Data
100k bank transaction records with an imbalanced fraud flag (~2%). DQ issues: ~5% null merchant_category, ~2% duplicate transaction IDs, ~1% outlier amounts. Suitable for fraud detection ML and data quality remediation.
20k customer records with tenure, contract type, charges, product count, internet type, and a churn flag correlated with behaviour features. Clean numeric features ideal for binary classification and scipy.stats correlation analysis.
Inspired by: Kaggle · Telco Customer Churn
20k daily meter readings across 100 meters and 7 site types. Includes peak/off-peak split, tariff type, renewable percentage, and estimated CO₂. Ideal for time-series aggregation by site type and sustainability metric analysis.
15k employee records with department, role level, tenure, salary band, performance rating, overtime, satisfaction scores, and an attrition flag correlated with satisfaction and workload. Clean ML-ready dataset for classification and scipy.stats hypothesis testing.
Large, real-world datasets from Kaggle, UCI, and open government portals. Download directly from the original source — ideal when you need millions of rows, live data, or specific domain coverage.
Filter by use case
Showing 12 external datasets
Over 3M taxi trip records per month with pickup/dropoff times, locations, fares, tips, and payment type. A classic EDA dataset for temporal patterns, borough-level aggregation, and fare distribution analysis.
Detailed Airbnb listing data with 70+ features including price, room type, neighbourhood, availability, and host metrics. Ideal for geospatial EDA, price distribution analysis, and outlier detection.
Classic multi-dimensional sales dataset with orders, returns, and people tables across 51 countries, 3 segments, and 3 product categories. Perfect for cross-join analysis and Tableau-style EDA.
7 interrelated tables with 120+ columns, significant nulls, and skewed distributions. A real-world multi-table DQ challenge: imputation, join deduplication, and type correction across 300k+ applications.
2.8M accident records with real-world DQ issues: missing weather columns, inconsistent city names, duplicate incidents, and mixed coordinate formats. Good for null-handling and text standardisation.
Census dataset where missing values are encoded as '?' strings — a realistic dirty data pattern. Used for DQ standardisation and binary income classification (>50K). 48k rows with 15 mixed-type features.
79 explanatory variables describing residential homes in Ames, Iowa. Numerical, ordinal, and nominal features with non-normal distributions. Ideal for scipy.stats normality tests and regression modelling.
284k European credit card transactions with 28 PCA-transformed features and a highly imbalanced fraud label (0.17%). Perfect for anomaly detection, precision-recall optimisation, and scipy imbalanced-class statistics.
6.5k red and white wine samples with 11 physiochemical input features and a quality score. Well-suited for scipy.stats correlation matrices, ANOVA, multi-class classification, and feature importance analysis.
1,400+ indicators across 200+ countries from 1960 to present, in wide format (one column per year). A classic unpivot exercise to convert to long format for time-series analysis.
Johns Hopkins University time-series dataset with one column per date — the canonical wide-to-long transformation exercise. Confirmed cases, deaths, and recoveries across 200+ countries.
7.7M reported crime incidents with location, type, arrest flag, and date. Good for group-by aggregation, reshape exercises, and EDA at scale. Requires joining with community-area and district lookup tables.
Ready to see Infoveave in action?