DEV Community Grade 8 3h ago

Can AI Reason From Marker Genes? Building a Single-Cell Benchmark From PBMC3k

Most single-cell RNA-seq examples end with this pattern: load data preprocess cluster cells generate UMAP rank marker genes assign cell labels That workflow is useful, but it leaves one important part underdeveloped: the reasoning step. A cluster label is only meaningful if it is supported by marker-gene evidence. The Single-Cell Marker Reasoning Benchmark turns that reasoning step into a reproducible benchmark. Repository: Github What the Project Does The project starts with PBMC3k single-cell RNA-seq data and runs a Scanpy-based analysis workflow. It then converts the marker-gene outputs into benchmark tasks. The result is not just a single-cell analysis. It is an evaluation system for marker-gene reasoning. Dataset The project uses PBMC3k through Scanpy. Raw dataset: 2700 cells × 32738 genes Processed dataset: 2694 cells × 2000 highly variable genes Clusters: 9 Leiden clusters PBMC means peripheral blood mononuclear cells. These are immune cells from blood, which makes the dataset useful for marker-gene interpretation examples. Analysis Workflow The workflow includes: PBMC3k loading QC and preprocessing normalisation log transformation highly variable gene selection PCA neighbour graph construction UMAP Leiden clustering marker-gene ranking marker filtering cluster annotation benchmark generation Why Marker Filtering Was Added Raw marker-gene outputs can contain genes that are not ideal for reasoning tasks. Examples: RPS* RPL* MT-* MALAT1 TPT1 EEF1A1 B2M These genes may reflect ribosomal signal, mitochondrial signal, housekeeping expression, or broad background activity. The project keeps raw marker outputs but also creates filtered marker tables so benchmark tasks are based on more biologically useful signals. Example Cluster Annotations Cluster Annotation Marker Evidence 0 T cells CD3D, CD3E, IL7R, LTB 1 CD14+ monocytes LYZ, S100A8, S100A9, FCN1 2 B cells CD79A, CD79B, MS4A1, CD74 4 NK cells NKG7, GNLY, GZMB, PRF1 7 Platelets PPBP, PF4, GNG11, SDPR These are marker-derived working annotations, not experimentally validated ground truth. Benchmark Task Families The project generates three benchmark task families. 1. Hidden Cluster Annotation A solver receives marker genes and predicts the likely cell type. Example: CD79A, CD79B, MS4A1, CD74 Expected interpretation: B cells 2. Marker Contradiction Detection A solver checks whether marker evidence contradicts a proposed annotation. Example: Claim: B cells Markers: NKG7, GNLY, GZMB, PRF1 The marker evidence supports NK or cytotoxic immune cells, not B cells. 3. Masked Marker Recovery A solver receives partial marker evidence and recovers the likely biological identity. This tests reasoning under incomplete evidence. Public Tasks and Hidden Answers The benchmark separates public task inputs from hidden answer keys. benchmark_tasks/public/ benchmark_tasks/hidden/ benchmark_tasks/oracle_outputs/ Current benchmark size: 16 public tasks 16 hidden answers This separation prevents answer leakage and makes the benchmark more credible. Oracle Outputs Oracle outputs provide reference-style answers. They include: predicted label supporting genes confidence rationale This allows the benchmark to support future model or human solver evaluation. Validators and Scoring The project includes: src/scbench/validators.py src/scbench/scoring.py scripts/07_score_solver_answers.py The scoring logic checks whether answers match the expected label, include supporting evidence, and provide reasoning. Sample scoring result: accuracy: 1.0 average score: 0.923 Testing and Reproducibility The project includes: pytest Docker Makefile GitHub Actions CI evidence files Current test status: 36 passed The Docker workflow validates that the project can run in a clean container environment. Why This Project Is Different A normal single-cell project usually produces: clusters UMAPs marker tables annotations This project produces: clusters UMAPs marker tables filtered markers annotations benchmark tasks hidden answers oracle outputs validators scoring reports calibration assets Docker validation CI validation evidence documentation Main Takeaway The project demonstrates how a single-cell RNA-seq workflow can serve as a benchmark system. Instead of only asking: What are the clusters? the benchmark asks: Can a solver justify the cell-type interpretation from marker-gene evidence? That shift moves the project from analysis to evaluation.

Most single-cell RNA-seq examples end with this pattern: load data preprocess cluster cells generate UMAP rank marker genes assign cell labels That workflow is useful, but it leaves one important part underdeveloped: the reasoning step. A cluster label is only meaningful if it is supported by marker-gene evidence. The Single-Cell Marker Reasoning Benchmark turns that reasoning step into a reproducible benchmark. Repository: What the Project Does The project starts with PBMC3k single-cell RNA-seq data and runs a Scanpy-based analysis workflow. It then converts the marker-gene outputs into benchmark tasks. The result is not just a single-cell analysis. It is an evaluation system for marker-gene reasoning. Dataset The project uses PBMC3k through Scanpy. Raw dataset: 2700 cells × 32738 genes Processed dataset: 2694 cells × 2000 highly variable genes Clusters: 9 Leiden clusters PBMC means peripheral blood mononuclear cells. These are immune cells from blood, which makes the dataset useful for marker-gene interpretation examples. Analysis Workflow The workflow includes: PBMC3k loading QC and preprocessing normalisation log transformation highly variable gene selection PCA neighbour graph construction UMAP Leiden clustering marker-gene ranking marker filtering cluster annotation benchmark generation Why Marker Filtering Was Added Raw marker-gene outputs can contain genes that are not ideal for reasoning tasks. Examples: RPS* RPL* MT-* MALAT1 TPT1 EEF1A1 B2M These genes may reflect ribosomal signal, mitochondrial signal, housekeeping expression, or broad background activity. The project keeps raw marker outputs but also creates filtered marker tables so benchmark tasks are based on more biologically useful signals. Example Cluster Annotations | Cluster | Annotation | Marker Evidence | |---|---|---| | 0 | T cells | CD3D, CD3E, IL7R, LTB | | 1 | CD14+ monocytes | LYZ, S100A8, S100A9, FCN1 | | 2 | B cells | CD79A, CD79B, MS4A1, CD74 | | 4 | NK cells | NKG7, GNLY, GZMB, PRF1 | | 7 | Platelets | PPBP, PF4, GNG11, SDPR | These are marker-derived working annotations, not experimentally validated ground truth. Benchmark Task Families The project generates three benchmark task families. 1. Hidden Cluster Annotation A solver receives marker genes and predicts the likely cell type. Example: CD79A, CD79B, MS4A1, CD74 Expected interpretation: B cells 2. Marker Contradiction Detection A solver checks whether marker evidence contradicts a proposed annotation. Example: Claim: B cells Markers: NKG7, GNLY, GZMB, PRF1 The marker evidence supports NK or cytotoxic immune cells, not B cells. 3. Masked Marker Recovery A solver receives partial marker evidence and recovers the likely biological identity. This tests reasoning under incomplete evidence. Public Tasks and Hidden Answers The benchmark separates public task inputs from hidden answer keys. benchmark_tasks/public/ benchmark_tasks/hidden/ benchmark_tasks/oracle_outputs/ Current benchmark size: 16 public tasks 16 hidden answers This separation prevents answer leakage and makes the benchmark more credible. Oracle Outputs Oracle outputs provide reference-style answers. They include: predicted label supporting genes confidence rationale This allows the benchmark to support future model or human solver evaluation. Validators and Scoring The project includes: src/scbench/validators.py src/scbench/scoring.py scripts/07_score_solver_answers.py The scoring logic checks whether answers match the expected label, include supporting evidence, and provide reasoning. Sample scoring result: accuracy: 1.0 average score: 0.923 Testing and Reproducibility The project includes: pytest Docker Makefile GitHub Actions CI evidence files Current test status: 36 passed The Docker workflow validates that the project can run in a clean container environment. Why This Project Is Different A normal single-cell project usually produces: clusters UMAPs marker tables annotations This project produces: clusters UMAPs marker tables filtered markers annotations benchmark tasks hidden answers oracle outputs validators scoring reports calibration assets Docker validation CI validation evidence documentation Main Takeaway The project demonstrates how a single-cell RNA-seq workflow can serve as a benchmark system. Instead of only asking: What are the clusters? the benchmark asks: Can a solver justify the cell-type interpretation from marker-gene evidence? That shift moves the project from analysis to evaluation. Top comments (0)

Read on DEV Community ↗ ← Back to News

Can AI Reason From Marker Genes? Building a Single-Cell Benchmark From PBMC3k

Comments