Running a Real Retail Dataset Through a Python Data Quality Workflow
DEV Community Grade 8

Running a Real Retail Dataset Through a Python Data Quality Workflow

In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation. The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files. Previous article: Preparing AI-Ready Data Without Calling an LLM API This follow-up focuses on the v0.7.0 update of the same project: Data Quality ETL Starter on GitHub The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally. This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository. The goal is narrower and more practical: manually downloaded public retail dataset ↓ prepare_real_dataset_demo.py ↓ normalized retail transaction CSV ↓ existing CLI validation and cleaning workflow ↓ quality reports + SQLite export ↓ run_real_dataset_benchmark.py ↓ benchmark report + summary CSV outputs That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear. Why add a real dataset benchmark? Earlier versions of this project used small sample files and generated synthetic order data. That is useful for testing and documentation, but it leaves one practical question: Can the workflow handle a public dataset that was not designed specifically for this repository? v0.7.0 adds an optional real dataset benchmark path to answer that question. The workflow now demonstrates how to: take a public retail transaction dataset; keep the raw dataset local-only; map external source columns into a project-friendly schema; derive practical fields such as revenue and cancellation flags; reuse the existing CLI validation and cleaning workflow; generate Markdown and JSON quality reports; export cleaned data to SQLite; produce benchmark evidence and summary CSV files. The key design choice is that the existing CLI remains the source of truth. The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project. Dataset used in v0.7.0 The default v0.7.0 dataset is the UCI Online Retail dataset. Official source: UCI Machine Learning Repository: Online Retail Citation: Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5BW33 License note: Creative Commons Attribution 4.0 International (CC BY 4.0) The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow: InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country The project maps those source columns into normalized snake_case columns and adds derived fields. What is kept out of Git This part is important. The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs. These paths are local-only: data/external/ data/raw/public/ data/output/real_dataset/ The repository keeps: source code; schema files; tests; documentation; screenshots; small sample inputs; instructions for running the workflow locally. It does not keep: full downloaded raw datasets; full normalized real dataset outputs; full cleaned real dataset outputs; local SQLite files generated from real datasets; private customer data; client data; API credentials; tokens or secrets. This keeps the repository lightweight and avoids turning it into a dataset mirror. What v0.7.0 adds The most relevant new files are: scripts/prepare_real_dataset_demo.py scripts/run_real_dataset_benchmark.py src/dq_etl_starter/real_dataset.py docs/data_sources.md docs/real_dataset_benchmark.md docs/limitations.md data/expected/online_retail_schema.json The real dataset helper module handles the project-specific mapping and summary logic. The two scripts provide a simple local workflow: prepare the manually downloaded dataset into a normalized CSV; generate local benchmark evidence and summary outputs after the CLI quality workflow runs. Project structure after the update The project now has a clearer path from messy input files to public-dataset benchmark evidence: data-quality-etl-starter/ ├── data/ │ ├── expected/ │ │ └── online_retail_schema.json │ └── output/ ├── docs/ │ ├── data_sources.md │ ├── limitations.md │ └── real_dataset_benchmark.md ├── screenshots/ ├── scripts/ │ ├── prepare_real_dataset_demo.py │ └── run_real_dataset_benchmark.py ├── src/dq_etl_starter/ │ ├── real_dataset.py │ ├── cli.py │ ├── clean.py │ ├── report.py │ └── validate.py └── tests/ ├── test_real_dataset.py └── test_real_dataset_benchmark.py

In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation. The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files. Previous article: Preparing AI-Ready Data Without Calling an LLM API This follow-up focuses on the v0.7.0 update of the same project: Data Quality ETL Starter on GitHub The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally. This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository. The goal is narrower and more practical: manually downloaded public retail dataset ↓ prepare_real_dataset_demo.py ↓ normalized retail transaction CSV ↓ existing CLI validation and cleaning workflow ↓ quality reports + SQLite export ↓ run_real_dataset_benchmark.py ↓ benchmark report + summary CSV outputs That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear. Why add a real dataset benchmark? Earlier versions of this project used small sample files and generated synthetic order data. That is useful for testing and documentation, but it leaves one practical question: Can the workflow handle a public dataset that was not designed specifically for this repository? v0.7.0 adds an optional real dataset benchmark path to answer that question. The workflow now demonstrates how to: - take a public retail transaction dataset; - keep the raw dataset local-only; - map external source columns into a project-friendly schema; - derive practical fields such as revenue and cancellation flags; - reuse the existing CLI validation and cleaning workflow; - generate Markdown and JSON quality reports; - export cleaned data to SQLite; - produce benchmark evidence and summary CSV files. The key design choice is that the existing CLI remains the source of truth. The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project. Dataset used in v0.7.0 The default v0.7.0 dataset is the UCI Online Retail dataset. Official source: UCI Machine Learning Repository: Online Retail Citation: Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5BW33 License note: Creative Commons Attribution 4.0 International (CC BY 4.0) The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow: InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country The project maps those source columns into normalized snake_case columns and adds derived fields. What is kept out of Git This part is important. The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs. These paths are local-only: data/external/ data/raw/public/ data/output/real_dataset/ The repository keeps: - source code; - schema files; - tests; - documentation; - screenshots; - small sample inputs; - instructions for running the workflow locally. It does not keep: - full downloaded raw datasets; - full normalized real dataset outputs; - full cleaned real dataset outputs; - local SQLite files generated from real datasets; - private customer data; - client data; - API credentials; - tokens or secrets. This keeps the repository lightweight and avoids turning it into a dataset mirror. What v0.7.0 adds The most relevant new files are: scripts/prepare_real_dataset_demo.py scripts/run_real_dataset_benchmark.py src/dq_etl_starter/real_dataset.py docs/data_sources.md docs/real_dataset_benchmark.md docs/limitations.md data/expected/online_retail_schema.json The real dataset helper module handles the project-specific mapping and summary logic. The two scripts provide a simple local workflow: - prepare the manually downloaded dataset into a normalized CSV; - generate local benchmark evidence and summary outputs after the CLI quality workflow runs. Project structure after the update The project now has a clearer path from messy input files to public-dataset benchmark evidence: data-quality-etl-starter/ ├── data/ │ ├── expected/ │ │ └── online_retail_schema.json │ └── output/ ├── docs/ │ ├── data_sources.md │ ├── limitations.md │ └── real_dataset_benchmark.md ├── screenshots/ ├── scripts/ │ ├── prepare_real_dataset_demo.py │ └── run_real_dataset_benchmark.py ├── src/dq_etl_starter/ │ ├── real_dataset.py │ ├── cli.py │ ├── clean.py │ ├── report.py │ └── validate.py └── tests/ ├── test_real_dataset.py └── test_real_dataset_benchmark.py The real dataset path is optional. The default small sample workflows remain unchanged. Install the project locally Clone the repository: git clone https://github.com/OnerGit/data-quality-etl-starter.git cd data-quality-etl-starter Create a virtual environment: python -m venv .venv Activate it on macOS or Linux: source .venv/bin/activate Activate it on Windows PowerShell: .venv\Scripts\activate Install dependencies and the local package: pip install -r requirements.txt pip install -e . The editable install step is useful because the project uses a src/ layout. Step 1: Download the public dataset manually Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page. Place the file here: data/external/online_retail.xlsx The project does not automatically download the dataset by default. That is intentional. For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review. Step 2: Prepare the normalized dataset Run the preparation script. macOS / Linux: python scripts/prepare_real_dataset_demo.py \ --raw-input data/external/online_retail.xlsx \ --output data/output/real_dataset/online_retail_normalized.csv Windows PowerShell: python scripts/prepare_real_dataset_demo.py ` --raw-input data/external/online_retail.xlsx ` --output data/output/real_dataset/online_retail_normalized.csv This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV. The normalized output columns are: invoice_no stock_code description quantity invoice_date unit_price customer_id country revenue is_cancellation source_dataset The derived fields are simple but useful: - revenue is derived from quantity and unit price; - is_cancellation marks cancellation-style rows; - source_dataset records dataset lineage. This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean. Step 3: Run the existing CLI workflow After preparation, the normalized CSV is passed into the existing CLI workflow. macOS / Linux: python -m dq_etl_starter.cli run \ --input data/output/real_dataset/online_retail_normalized.csv \ --input-type csv \ --schema data/expected/online_retail_schema.json \ --output-dir data/output/real_dataset/run \ --db-target sqlite \ --table-name cleaned_online_retail Windows PowerShell: python -m dq_etl_starter.cli run ` --input data/output/real_dataset/online_retail_normalized.csv ` --input-type csv ` --schema data/expected/online_retail_schema.json ` --output-dir data/output/real_dataset/run ` --db-target sqlite ` --table-name cleaned_online_retail Expected local outputs: data/output/real_dataset/run/cleaned_online_retail.csv data/output/real_dataset/run/etl_output.sqlite data/output/real_dataset/run/quality_report.md data/output/real_dataset/run/quality_report.json This is the most important design point in v0.7.0. The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture. Schema for the normalized retail dataset The schema file is: data/expected/online_retail_schema.json It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset. The schema is not intended to certify the dataset as business-ready. It is a practical contract for this starter workflow: external retail columns ↓ normalized project columns ↓ expected schema rules ↓ quality report That is a useful handoff pattern because the next person can inspect both the mapping and the validation report. Quality report The CLI workflow writes a Markdown report and a JSON report. For the real dataset workflow, the Markdown report is written to: data/output/real_dataset/run/quality_report.md The report is useful because it records what the workflow found rather than only producing a cleaned file. Typical report sections include: - raw row count; - cleaned row count; - missing values by column; - duplicate row count; - expected column checks; - validation issue summaries; - output file paths. For client-style work, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review. Step 4: Generate the real dataset benchmark report After the CLI workflow finishes, generate a local benchmark report and summary outputs. macOS / Linux: python scripts/run_real_dataset_benchmark.py \ --normalized-input data/output/real_dataset/online_retail_normalized.csv \ --quality-report data/output/real_dataset/run/quality_report.json \ --output

Comments

No comments yet. Start the discussion.