From Variant CSV to Review-Ready Report: A Python Workflow With Docker and GitHub Actions
DEV Community Grade 10 3h ago

From Variant CSV to Review-Ready Report: A Python Workflow With Docker and GitHub Actions

Variant prioritisation often starts with a table. But a table alone does not answer the most important question: Which variants deserve closer review, and why? The ClinVar Variant Prioritisation Workflow was built to answer that question with transparent scoring, validation, reporting, Docker, and CI. Repository: GitHub Tech Stack Python pandas Pydantic matplotlib pytest Make Docker GitHub Actions mamba What the Workflow Does The workflow takes a curated inherited-disease variant dataset and ranks variants using transparent evidence rules. Each variant receives: priority score out of 100 priority tier ranked output review recommendation Dataset Fields The curated dataset includes: variant_id gene chromosome position reference alternate consequence clinvar_significance review_status allele_frequency inheritance phenotype_match_score computational_score disease_area Validation Layer Before scoring, the workflow checks: required columns valid allele frequency values valid phenotype match score range valid computational score range record schema consistency Pydantic is used for schema validation. This prevents the scoring logic from running on malformed records. Scoring Framework The score is out of 100: ClinVar-style significance: 30 Review status: 15 Variant consequence: 20 Allele frequency rarity: 15 Phenotype match: 20 Priority tiers: >= 80 high_priority 60-79 moderate_priority 40-59 low_priority < 40minimal_priority This is not a clinical diagnostic score. It is a transparent prioritisation score for review. Example Result Top ranked variants from the current dataset: Rank Variant Gene Consequence Score 1 VAR010 DMD stop_gained 99 2 VAR001 BRCA1 stop_gained 98 3 VAR014 FBN1 splice_donor_variant 96 4 VAR019 MLH1 splice_acceptor_variant 95 5 VAR008 SCN1A frameshift_variant 94 Outputs The pipeline generates: results/tables/ranked_variants.csv results/tables/top_prioritised_variants.csv results/reports/top_variant_review_report.md results/figures/priority_score_distribution.png results/figures/priority_tier_counts.png results/figures/top_gene_priority_scores.png Makefile Commands make test make score make report make figures make pipeline The full pipeline loads data, validates records, scores variants, generates review outputs, and creates figures. Docker Workflow docker build -t clinvar-variant-prioritisation:latest . docker run --rm clinvar-variant-prioritisation:latest make test docker run --rm clinvar-variant-prioritisation:latest make pipeline Docker exposed two real issues. First, make was missing inside the image. Second, the non-root container user could not overwrite files under /app/results . Both were fixed in the Dockerfile. CI Workflow GitHub Actions runs: pytest test suite full pipeline expected output file checks The workflow was also updated to opt into the Node.js 24 runtime. Documentation The repository includes: README.md docs/methods.md docs/limitations.md docs/evidence_map.md docs/reviewer_guide.md docs/evidence/ Main Takeaway The project demonstrates how a small variant dataset can become a reproducible scientific workflow. It includes: validation transparent scoring ranked outputs review reporting visual analytics Docker reproducibility CI evidence tracking The result is not a clinical diagnostic system. It is a professional bioinformatics workflow showing how variant prioritisation logic can be made transparent, reproducible, and review-ready.

Variant prioritisation often starts with a table. But a table alone does not answer the most important question: Which variants deserve closer review, and why? The ClinVar Variant Prioritisation Workflow was built to answer that question with transparent scoring, validation, reporting, Docker, and CI. Repository: Tech Stack Python pandas Pydantic matplotlib pytest Make Docker GitHub Actions mamba What the Workflow Does The workflow takes a curated inherited-disease variant dataset and ranks variants using transparent evidence rules. Each variant receives: priority score out of 100 priority tier ranked output review recommendation Dataset Fields The curated dataset includes: variant_id gene chromosome position reference alternate consequence clinvar_significance review_status allele_frequency inheritance phenotype_match_score computational_score disease_area Validation Layer Before scoring, the workflow checks: required columns valid allele frequency values valid phenotype match score range valid computational score range record schema consistency Pydantic is used for schema validation. This prevents the scoring logic from running on malformed records. Scoring Framework The score is out of 100: ClinVar-style significance: 30 Review status: 15 Variant consequence: 20 Allele frequency rarity: 15 Phenotype match: 20 Priority tiers: >= 80 high_priority 60-79 moderate_priority 40-59 low_priority < 40 minimal_priority This is not a clinical diagnostic score. It is a transparent prioritisation score for review. Example Result Top ranked variants from the current dataset: | Rank | Variant | Gene | Consequence | Score | |---|---|---|---|---| | 1 | VAR010 | DMD | stop_gained | 99 | | 2 | VAR001 | BRCA1 | stop_gained | 98 | | 3 | VAR014 | FBN1 | splice_donor_variant | 96 | | 4 | VAR019 | MLH1 | splice_acceptor_variant | 95 | | 5 | VAR008 | SCN1A | frameshift_variant | 94 | Outputs The pipeline generates: results/tables/ranked_variants.csv results/tables/top_prioritised_variants.csv results/reports/top_variant_review_report.md results/figures/priority_score_distribution.png results/figures/priority_tier_counts.png results/figures/top_gene_priority_scores.png Makefile Commands make test make score make report make figures make pipeline The full pipeline loads data, validates records, scores variants, generates review outputs, and creates figures. Docker Workflow docker build -t clinvar-variant-prioritisation:latest . docker run --rm clinvar-variant-prioritisation:latest make test docker run --rm clinvar-variant-prioritisation:latest make pipeline Docker exposed two real issues. First, make was missing inside the image. Second, the non-root container user could not overwrite files under /app/results . Both were fixed in the Dockerfile. CI Workflow GitHub Actions runs: pytest test suite full pipeline expected output file checks The workflow was also updated to opt into the Node.js 24 runtime. Documentation The repository includes: README.md docs/methods.md docs/limitations.md docs/evidence_map.md docs/reviewer_guide.md docs/evidence/ Main Takeaway The project demonstrates how a small variant dataset can become a reproducible scientific workflow. It includes: validation transparent scoring ranked outputs review reporting visual analytics Docker reproducibility CI evidence tracking The result is not a clinical diagnostic system. It is a professional bioinformatics workflow showing how variant prioritisation logic can be made transparent, reproducible, and review-ready. Top comments (0)

Comments

No comments yet. Start the discussion.