Free Sample Datasets for Data Analytics

Download ready-to-use CSV datasets curated for manufacturing, retail, supply chain, healthcare, banking, energy, and operations. Each file is lightweight, realistically structured, and designed to import straight into Infoveave — no setup required.

Data disclaimer: These are synthetic datasets inspired by publicly available industry data patterns. They are not generated or owned by Infoveave — all values are representative samples created for demonstration and learning purposes only. No real organisations, individuals, or transactions are represented.

Filter by industry

Filter by use case

Showing 12 datasets

Retail.csv

Retail Daily Sales

25k daily POS transactions across 20 stores, 5 regions, and 8 product categories over two years. Rich time-series structure for trend decomposition, regional comparisons, and seasonal pattern analysis.

25,000 rows9 columns~1.6 MB
EDA
Sales AnalysisRevenue TrendsDiscount ImpactTime Series
Retail.csv

E-Commerce Orders

50k online orders over two years with product categories, shipping methods, return flags, and order status. Includes intentional dirty data: inconsistent category casing and ~5% null shipping_method values.

50,000 rows10 columns~4.1 MB
EDADQ — Dirty Data
Order FulfilmentReturns AnalysisDirty DataNull Handling
Manufacturing.csv

Manufacturing OEE

30k hourly shift records across 20 machines over 500 days. Availability, performance, quality sub-scores, and defect counts at the machine-shift level — ideal for pivot/aggregation to machine-level, line-level, or daily rollups.

30,000 rows11 columns~1.9 MB
TransformationEDA
OEEShift AnalysisAggregationPivotDefect Tracking
Manufacturing.csv

Product Quality Control

20k batch inspection records with intentional DQ issues: ~3% duplicate batch IDs, null defect_type values, outlier defect_counts, and non-standard qc_result entries. Designed for data profiling and DQ rule validation.

20,000 rows10 columns~1.3 MB
DQ — Dirty Data
Dirty DataDuplicatesNull HandlingOutliersData Profiling
Supply Chain.csv

Supply Chain Shipments

30k shipment records with order date, dispatch, expected and actual delivery, carrier, weight, freight cost, and on-time flag across 10 origin/destination city pairs. Suited for SLA monitoring and pivot by carrier × region.

30,000 rows12 columns~3.0 MB
EDATransformation
On-Time DeliveryCarrier PerformanceFreight CostPivot
Supply Chain.csv

Inventory Levels

15k daily SKU-level stock snapshots across 300 SKUs and 5 warehouses. Includes stock value and a below-reorder flag — wide structure designed for unpivoting, reshaping to trend series, and warehouse-level aggregation.

15,000 rows10 columns~1.0 MB
Transformation
Stock LevelsReorder AnalysisReshapeUnpivotWide Format
Healthcare.csv

Healthcare Admissions

25k hospital admission records with three deliberate DQ issues: mixed date formats (ISO/US/UK), ~8% null age_group, and ~2% duplicate admission IDs. A realistic scenario for date standardisation and null-imputation workflows.

25,000 rows10 columns~1.7 MB
DQ — Dirty DataML / SciPy
Dirty DataMixed DatesNull ImputationReadmission RiskICD Codes
Operations.csv

Call Centre Performance

20k daily agent-level records across 200 agents and 7 queue types. Handles, AHT, wait time, FCR, CSAT, and escalations — a clean multi-dimensional dataset for distribution analysis and agent benchmarking.

20,000 rows10 columns~1.1 MB
EDA
Agent ProductivityCSATHandle TimeDistribution AnalysisQueue Analysis
Banking.csv

Financial Transactions

100k bank transaction records with an imbalanced fraud flag (~2%). DQ issues: ~5% null merchant_category, ~2% duplicate transaction IDs, ~1% outlier amounts. Suitable for fraud detection ML and data quality remediation.

100,000 rows10 columns~7.8 MB
DQ — Dirty DataML / SciPy
Fraud DetectionImbalanced ClassificationDirty DataDuplicatesOutliers
Banking.csv

Customer Churn

20k customer records with tenure, contract type, charges, product count, internet type, and a churn flag correlated with behaviour features. Clean numeric features ideal for binary classification and scipy.stats correlation analysis.

20,000 rows11 columns~1.3 MB
ML / SciPy
Binary ClassificationCorrelation AnalysisFeature EngineeringSciPyLogistic Regression
Energy.csv

Energy Consumption

20k daily meter readings across 100 meters and 7 site types. Includes peak/off-peak split, tariff type, renewable percentage, and estimated CO₂. Ideal for time-series aggregation by site type and sustainability metric analysis.

20,000 rows10 columns~1.6 MB
EDATransformation
Time SeriesPivotCO₂ TrackingPeak LoadTariff Analysis
Operations.csv

Employee Attrition

15k employee records with department, role level, tenure, salary band, performance rating, overtime, satisfaction scores, and an attrition flag correlated with satisfaction and workload. Clean ML-ready dataset for classification and scipy.stats hypothesis testing.

15,000 rows12 columns~1.0 MB
ML / SciPyEDA
Binary ClassificationClusteringSciPyHypothesis TestingHR Analytics

Curated External Datasets

Large, real-world datasets from Kaggle, UCI, and open government portals. Download directly from the original source — ideal when you need millions of rows, live data, or specific domain coverage.

Filter by use case

Showing 12 external datasets

EDATransformation

NYC Yellow Taxi Trips (2023)

Over 3M taxi trip records per month with pickup/dropoff times, locations, fares, tips, and payment type. A classic EDA dataset for temporal patterns, borough-level aggregation, and fare distribution analysis.

~3M / month rows19 columns~500 MB/month
EDA

Airbnb Listings (NYC / London / Paris)

Detailed Airbnb listing data with 70+ features including price, room type, neighbourhood, availability, and host metrics. Ideal for geospatial EDA, price distribution analysis, and outlier detection.

50k–300k rows74 columns30–150 MB
EDATransformation

Global Superstore (Tableau Sample)

Classic multi-dimensional sales dataset with orders, returns, and people tables across 51 countries, 3 segments, and 3 product categories. Perfect for cross-join analysis and Tableau-style EDA.

10k rows24 columns3 MB
DQ — Dirty DataML / SciPy

Home Credit Default Risk

7 interrelated tables with 120+ columns, significant nulls, and skewed distributions. A real-world multi-table DQ challenge: imputation, join deduplication, and type correction across 300k+ applications.

300k+ rows120 columns700 MB
DQ — Dirty DataEDA

US Accidents (2016–2023)

2.8M accident records with real-world DQ issues: missing weather columns, inconsistent city names, duplicate incidents, and mixed coordinate formats. Good for null-handling and text standardisation.

2.8M rows46 columns1.07 GB
DQ — Dirty DataML / SciPy

Adult Income — UCI

Census dataset where missing values are encoded as '?' strings — a realistic dirty data pattern. Used for DQ standardisation and binary income classification (>50K). 48k rows with 15 mixed-type features.

48k rows15 columns5 MB
ML / SciPy

House Prices — Advanced Regression

79 explanatory variables describing residential homes in Ames, Iowa. Numerical, ordinal, and nominal features with non-normal distributions. Ideal for scipy.stats normality tests and regression modelling.

1.5k rows79 columns1 MB
ML / SciPy

Credit Card Fraud Detection

284k European credit card transactions with 28 PCA-transformed features and a highly imbalanced fraud label (0.17%). Perfect for anomaly detection, precision-recall optimisation, and scipy imbalanced-class statistics.

284k rows30 columns150 MB
ML / SciPy

Wine Quality — UCI

6.5k red and white wine samples with 11 physiochemical input features and a quality score. Well-suited for scipy.stats correlation matrices, ANOVA, multi-class classification, and feature importance analysis.

6.5k rows12 columns2 MB
TransformationEDA

World Bank Development Indicators

1,400+ indicators across 200+ countries from 1960 to present, in wide format (one column per year). A classic unpivot exercise to convert to long format for time-series analysis.

~400k (long) rows65+120 MB
TransformationEDA

COVID-19 Global Cases — JHU

Johns Hopkins University time-series dataset with one column per date — the canonical wide-to-long transformation exercise. Confirmed cases, deaths, and recoveries across 200+ countries.

~100k (long) rows1 per date50 MB
TransformationEDA

Chicago Crime Data (2001–Present)

7.7M reported crime incidents with location, type, arrest flag, and date. Good for group-by aggregation, reshape exercises, and EDA at scale. Requires joining with community-area and district lookup tables.

7.7M rows22 columns1.6 GB

Frequently Asked Questions

Ready to see Infoveave in action?

Book a Demo
ISO 27001ISO 27017ISO 27701GDPRHIPAACCPAAICPACSR LogoCapterra Reviews — Infoveave

© 2026 Noesys Software Pvt Ltd

Infoveave® is a product of Noesys

All Rights Reserved