# Free Sample Datasets for Data Analytics

Download ready-to-use CSV datasets curated for manufacturing, retail, supply chain, healthcare, banking, energy, and operations. Each file is lightweight, realistically structured, and designed to import straight into Infoveave — no setup required.

Data disclaimer: These are synthetic datasets inspired by publicly available industry data patterns. They are not generated or owned by Infoveave — all values are representative samples created for demonstration and learning purposes only. No real organisations, individuals, or transactions are represented.

Filter by industry

All(12)Retail(2)Manufacturing(2)Supply Chain(2)Healthcare(1)Banking(2)Energy(1)Operations(2)

Filter by use case

All use casesEDADQ — Dirty DataML / SciPyTransformation

Showing 12 datasets

Retail.csv

### Retail Daily Sales

25k daily POS transactions across 20 stores, 5 regions, and 8 product categories over two years. Rich time-series structure for trend decomposition, regional comparisons, and seasonal pattern analysis.

25,000 rows9 columns\~1.6 MB

EDA

Sales AnalysisRevenue TrendsDiscount ImpactTime Series

[Analyze with AI](/resources/sample-datasets/retail-daily-sales)

Inspired by: [Kaggle · Store Sales – Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)

Retail.csv

### E-Commerce Orders

50k online orders over two years with product categories, shipping methods, return flags, and order status. Includes intentional dirty data: inconsistent category casing and \~5% null shipping\_method values.

50,000 rows10 columns\~4.1 MB

EDADQ — Dirty Data

Order FulfilmentReturns AnalysisDirty DataNull Handling

[Analyze with AI](/resources/sample-datasets/ecommerce-orders)

Inspired by: [Kaggle · Brazilian E-Commerce (Olist)](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)

Manufacturing.csv

### Manufacturing OEE

30k hourly shift records across 20 machines over 500 days. Availability, performance, quality sub-scores, and defect counts at the machine-shift level — ideal for pivot/aggregation to machine-level, line-level, or daily rollups.

30,000 rows11 columns\~1.9 MB

TransformationEDA

OEEShift AnalysisAggregationPivotDefect Tracking

[Analyze with AI](/resources/sample-datasets/manufacturing-oee)

Inspired by: [Kaggle · Manufacturing OEE Dataset](https://www.kaggle.com/datasets/podsyp/production-quality)

Manufacturing.csv

### Product Quality Control

20k batch inspection records with intentional DQ issues: \~3% duplicate batch IDs, null defect\_type values, outlier defect\_counts, and non-standard qc\_result entries. Designed for data profiling and DQ rule validation.

20,000 rows10 columns\~1.3 MB

DQ — Dirty Data

Dirty DataDuplicatesNull HandlingOutliersData Profiling

[Analyze with AI](/resources/sample-datasets/product-quality-control)

Inspired by: [UCI ML Repository · Steel Plates Faults](https://archive.ics.uci.edu/dataset/198/steel+plates+faults)

Supply Chain.csv

### Supply Chain Shipments

30k shipment records with order date, dispatch, expected and actual delivery, carrier, weight, freight cost, and on-time flag across 10 origin/destination city pairs. Suited for SLA monitoring and pivot by carrier × region.

30,000 rows12 columns\~3.0 MB

EDATransformation

On-Time DeliveryCarrier PerformanceFreight CostPivot

[Analyze with AI](/resources/sample-datasets/supply-chain-shipments)

Inspired by: [Kaggle · SCMS Delivery History](https://www.kaggle.com/datasets/usaid-assist/supply-chain-shipment-pricing-data)

Supply Chain.csv

### Inventory Levels

15k daily SKU-level stock snapshots across 300 SKUs and 5 warehouses. Includes stock value and a below-reorder flag — wide structure designed for unpivoting, reshaping to trend series, and warehouse-level aggregation.

15,000 rows10 columns\~1.0 MB

Transformation

Stock LevelsReorder AnalysisReshapeUnpivotWide Format

[Analyze with AI](/resources/sample-datasets/inventory-levels)

Inspired by: [Kaggle · Supply Chain & Inventory Analytics](https://www.kaggle.com/datasets/harshsingh2209/supply-chain-analysis)

Healthcare.csv

### Healthcare Admissions

25k hospital admission records with three deliberate DQ issues: mixed date formats (ISO/US/UK), \~8% null age\_group, and \~2% duplicate admission IDs. A realistic scenario for date standardisation and null-imputation workflows.

25,000 rows10 columns\~1.7 MB

DQ — Dirty DataML / SciPy

Dirty DataMixed DatesNull ImputationReadmission RiskICD Codes

[Analyze with AI](/resources/sample-datasets/healthcare-admissions)

Inspired by: [Kaggle · Hospital Patient Records](https://www.kaggle.com/datasets/nehaprabhavalkar/av-healthcare-analytics-ii)

Operations.csv

### Call Centre Performance

20k daily agent-level records across 200 agents and 7 queue types. Handles, AHT, wait time, FCR, CSAT, and escalations — a clean multi-dimensional dataset for distribution analysis and agent benchmarking.

20,000 rows10 columns\~1.1 MB

EDA

Agent ProductivityCSATHandle TimeDistribution AnalysisQueue Analysis

[Analyze with AI](/resources/sample-datasets/call-center-performance)

Inspired by: [Kaggle · Call Centre Performance Data](https://www.kaggle.com/datasets/mhdzahier/call-center)

Banking.csv

### Financial Transactions

100k bank transaction records with an imbalanced fraud flag (\~2%). DQ issues: \~5% null merchant\_category, \~2% duplicate transaction IDs, \~1% outlier amounts. Suitable for fraud detection ML and data quality remediation.

100,000 rows10 columns\~7.8 MB

DQ — Dirty DataML / SciPy

Fraud DetectionImbalanced ClassificationDirty DataDuplicatesOutliers

[Analyze with AI](/resources/sample-datasets/financial-transactions)

Inspired by: [Kaggle · PaySim Synthetic Financial Dataset](https://www.kaggle.com/datasets/ealaxi/paysim1)

Banking.csv

### Customer Churn

20k customer records with tenure, contract type, charges, product count, internet type, and a churn flag correlated with behaviour features. Clean numeric features ideal for binary classification and scipy.stats correlation analysis.

20,000 rows11 columns\~1.3 MB

ML / SciPy

Binary ClassificationCorrelation AnalysisFeature EngineeringSciPyLogistic Regression

[Analyze with AI](/resources/sample-datasets/customer-churn)

Inspired by: [Kaggle · Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)

Energy.csv

### Energy Consumption

20k daily meter readings across 100 meters and 7 site types. Includes peak/off-peak split, tariff type, renewable percentage, and estimated CO₂. Ideal for time-series aggregation by site type and sustainability metric analysis.

20,000 rows10 columns\~1.6 MB

EDATransformation

Time SeriesPivotCO₂ TrackingPeak LoadTariff Analysis

[Analyze with AI](/resources/sample-datasets/energy-consumption)

Inspired by: [UCI ML Repository · Individual Household Electric Power Consumption](https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption)

Operations.csv

### Employee Attrition

15k employee records with department, role level, tenure, salary band, performance rating, overtime, satisfaction scores, and an attrition flag correlated with satisfaction and workload. Clean ML-ready dataset for classification and scipy.stats hypothesis testing.

15,000 rows12 columns\~1.0 MB

ML / SciPyEDA

Binary ClassificationClusteringSciPyHypothesis TestingHR Analytics

[Analyze with AI](/resources/sample-datasets/employee-attrition)

Inspired by: [Kaggle · IBM HR Analytics Employee Attrition](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)

## Curated External Datasets

Large, real-world datasets from Kaggle, UCI, and open government portals. Download directly from the original source — ideal when you need millions of rows, live data, or specific domain coverage.

Filter by use case

All use casesEDADQ — Dirty DataML / SciPyTransformation

Showing 12 external datasets

EDATransformation

### NYC Yellow Taxi Trips (2023)

Over 3M taxi trip records per month with pickup/dropoff times, locations, fares, tips, and payment type. A classic EDA dataset for temporal patterns, borough-level aggregation, and fare distribution analysis.

\~3M / month rows19 columns\~500 MB/month

[View on NYC TLC Open Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

EDA

### Airbnb Listings (NYC / London / Paris)

Detailed Airbnb listing data with 70+ features including price, room type, neighbourhood, availability, and host metrics. Ideal for geospatial EDA, price distribution analysis, and outlier detection.

50k–300k rows74 columns30–150 MB

[View on Inside Airbnb](http://insideairbnb.com/get-the-data)

EDATransformation

### Global Superstore (Tableau Sample)

Classic multi-dimensional sales dataset with orders, returns, and people tables across 51 countries, 3 segments, and 3 product categories. Perfect for cross-join analysis and Tableau-style EDA.

10k rows24 columns3 MB

[View on Kaggle · Tableau Sample Data](https://www.kaggle.com/datasets/vivek468/superstore-dataset-final)

DQ — Dirty DataML / SciPy

### Home Credit Default Risk

7 interrelated tables with 120+ columns, significant nulls, and skewed distributions. A real-world multi-table DQ challenge: imputation, join deduplication, and type correction across 300k+ applications.

300k+ rows120 columns700 MB

[View on Kaggle · Home Credit Group](https://www.kaggle.com/competitions/home-credit-default-risk/data)

DQ — Dirty DataEDA

### US Accidents (2016–2023)

2.8M accident records with real-world DQ issues: missing weather columns, inconsistent city names, duplicate incidents, and mixed coordinate formats. Good for null-handling and text standardisation.

2.8M rows46 columns1.07 GB

[View on Kaggle · Sobhan Moosavi](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents)

DQ — Dirty DataML / SciPy

### Adult Income — UCI

Census dataset where missing values are encoded as '?' strings — a realistic dirty data pattern. Used for DQ standardisation and binary income classification (>50K). 48k rows with 15 mixed-type features.

48k rows15 columns5 MB

[View on UCI ML Repository](https://archive.ics.uci.edu/dataset/2/adult)

ML / SciPy

### House Prices — Advanced Regression

79 explanatory variables describing residential homes in Ames, Iowa. Numerical, ordinal, and nominal features with non-normal distributions. Ideal for scipy.stats normality tests and regression modelling.

1.5k rows79 columns1 MB

[View on Kaggle Competition](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)

ML / SciPy

### Credit Card Fraud Detection

284k European credit card transactions with 28 PCA-transformed features and a highly imbalanced fraud label (0.17%). Perfect for anomaly detection, precision-recall optimisation, and scipy imbalanced-class statistics.

284k rows30 columns150 MB

[View on Kaggle · ULB](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

ML / SciPy

### Wine Quality — UCI

6.5k red and white wine samples with 11 physiochemical input features and a quality score. Well-suited for scipy.stats correlation matrices, ANOVA, multi-class classification, and feature importance analysis.

6.5k rows12 columns2 MB

[View on UCI ML Repository](https://archive.ics.uci.edu/dataset/186/wine+quality)

TransformationEDA

### World Bank Development Indicators

1,400+ indicators across 200+ countries from 1960 to present, in wide format (one column per year). A classic unpivot exercise to convert to long format for time-series analysis.

\~400k (long) rows65+120 MB

[View on World Bank Open Data](https://databank.worldbank.org/source/world-development-indicators)

TransformationEDA

### COVID-19 Global Cases — JHU

Johns Hopkins University time-series dataset with one column per date — the canonical wide-to-long transformation exercise. Confirmed cases, deaths, and recoveries across 200+ countries.

\~100k (long) rows1 per date50 MB

[View on GitHub · CSSEGISandData](https://github.com/CSSEGISandData/COVID-19)

TransformationEDA

### Chicago Crime Data (2001–Present)

7.7M reported crime incidents with location, type, arrest flag, and date. Good for group-by aggregation, reshape exercises, and EDA at scale. Requires joining with community-area and district lookup tables.

7.7M rows22 columns1.6 GB

[View on Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2)

## Frequently Asked Questions

What are these sample datasets and where do they come from?

How do I import a CSV dataset into Infoveave?

Can I use these datasets to test Infoveave's Fovea AI assistant?

Are these datasets suitable for building demo dashboards?

Can I request a sample dataset for a different industry or use case?

What is the file format and are there any size restrictions?

Ready to see Infoveave in action?

Book a personalised demo with our data experts

[Book a Demo](/book-a-demo)

[![ISO 27001](https://cdn.infoveave.com/certificates-logos/new/iso27001.svg)](https://trust.infoveave.com "ISO 27001 Certified")[![ISO 27017](https://cdn.infoveave.com/certificates-logos/new/iso27017.svg)](https://trust.infoveave.com "ISO 27017 Certified")[![ISO 27701](https://cdn.infoveave.com/certificates-logos/new/iso27701.svg)](https://trust.infoveave.com "ISO 27701 Certified")[![GDPR](https://cdn.infoveave.com/certificates-logos/new/gdpr.svg)](https://trust.infoveave.com "GDPR Compliant")[![HIPAA](https://cdn.infoveave.com/certificates-logos/new/hipaa.svg)](/infoveave-awards-and-updates "HIPAA Compliant")[![CCPA](https://cdn.infoveave.com/certificates-logos/new/ccpa.svg)](https://trust.infoveave.com "CCPA Compliant")[![AICPA](https://cdn.infoveave.com/certificates-logos/new/aicpa-soc-2.svg)](https://trust.infoveave.com "SOC 2 Type II Certified")[![CSR Logo](https://cdn.infoveave.com/footer-svgs/csr.svg)](/infoveave-awards-and-updates "CSR Certification")[![Capterra Reviews — Infoveave](https://brand-assets.capterra.com/badge/ea3ac4b1-3dc8-48a5-999c-0f685147cfd3.svg)](https://www.capterra.com/p/181076/infoveave/reviews/)

© 2026 [Noesys Software Pvt Ltd](https://noesyssoftware.com) 

Infoveave® is a product of Noesys

All Rights Reserved