5 Agentic Workflows to Automate Your Data Science Pipeline
Introduction
The average data scientist spends roughly 45% of their working time on data preparation and cleaning - not on modeling, not on insight generation, not on the work that requires genuine judgment. That estimate keeps appearing across industry surveys because it keeps being true.
The tasks eating up that time - profiling columns, flagging nulls, running the same exploratory data analysis (EDA) scripts, grid-searching hyperparameters, and writing the same monitoring checks - are formulaic enough to follow explicit rules. That is precisely what makes them automatable with agents.
Agentic workflows do not replace the data scientist. They absorb the procedural weight so you can focus on the evaluative weight: deciding whether a model makes sense, whether a feature is genuinely informative, whether a finding warrants a business decision.
Platforms like Databricks have already started shipping agentic data science capabilities into their core infrastructure, with their Agent framework explicitly designed to "compress the time from question to insight." This is the direction production data teams are moving.
This article covers five concrete agentic workflows, one for each major stage of a data science pipeline. Each includes a real-world scenario, tested code patterns, and the design decisions that matter in production.
Prerequisites
All five workflows assume Python 3.10+ and familiarity with pandas, scikit-learn, and basic large language model (LLM) API usage. Specific package requirements are listed under each workflow.
For the tool-calling patterns, you need either an OpenAI API key or a local serving endpoint (Ollama, vLLM) that exposes an OpenAI-compatible API.
Core packages used across all workflows
pip install openai pandas numpy scipy scikit-learn lightgbm shap pydantic
Workflow 1: Automated Exploratory Data Analysis Agent
What it replaces: Manually loading data, computing summary statistics, visualizing distributions, inspecting nulls, detecting outliers, writing up findings. Every dataset, every time, the same script with different column names.
What the agent does instead: Loads the dataset, runs a full profile, flags issues by severity, and produces a structured Markdown report. A human reviews the findings and decides what to do about them. The agent handles everything before that review.
Architecture
The agent uses a Reasoning and Acting (ReAct) loop with two tools: profile_dataset produces summary statistics per column, and flag_issues classifies problems by severity. The agent then synthesizes both outputs into a structured report through a single language model call.
The key design decision is how the agent handles the flag_issues output; it reasons about which issues are actionable before reporting, so the output is a prioritized list, not a raw dump.
Code Pattern
# eda_agent.py
# Prerequisites: pip install openai pandas scipy
# Run: python eda_agent.py
import json
import pandas as pd
from scipy import stats
from openai import OpenAI
from dataclasses import dataclass
client = OpenAI() # Uses OPENAI_API_KEY env var
@dataclass
class ColumnIssue:
column: str
issue_type: str # null_rate | skewness | dtype | high_correlation
severity: str # low | medium | high
detail: str
def profile_dataset(df: pd.DataFrame) -> dict:
""" Generate per-column statistics.
In production, swap this for ydata-profiling for richer output. """
profile = {}
for col in df.columns:
col_stats = {
"dtype": str(df[col].dtype),
"null_rate": df[col].isnull().mean(),
"n_unique": df[col].nunique(),
}
if pd.api.types.is_numeric_dtype(df[col]):
col_stats["skewness"] = float(df[col].skew())
col_stats["mean"] = float(df[col].mean())
col_stats["std"] = float(df[col].std())
elif df[col].dtype == "object":
non_null = df[col].dropna()
numeric_coerced = pd.to_numeric(non_null, errors="coerce")
col_stats["looks_numeric"] = bool(
len(non_null) > 0 and numeric_coerced.notna().mean() > 0.9
)
profile[col] = col_stats
return profile
def flag_issues(profile: dict) -> list[ColumnIssue]:
""" Flag data quality issues from a column profile.
Severity tiers: high = needs immediate attention, medium = worth reviewing. """
issues = []
for col, stats_dict in profile.items():
null_rate = stats_dict.get("null_rate", 0.0)
if null_rate > 0.15:
issues.append(ColumnIssue(col, "null_rate", "high", f"{null_rate:.0%} of values are missing"))
elif null_rate > 0.05:
issues.append(ColumnIssue(col, "null_rate", "medium", f"{null_rate:.0%} of values are missing"))
skewness = abs(stats_dict.get("skewness", 0.0))
if skewness > 5.0:
issues.append(ColumnIssue(col, "skewness", "high", f"Extreme skew={skewness:.1f} -- consider log transform"))
elif skewness > 2.0:
issues.append(ColumnIssue(col, "skewness", "medium", f"Moderate skew={skewness:.1f}"))
# Object columns with all-numeric values are likely miscoded
if stats_dict["dtype"] == "object" and stats_dict.get("looks_numeric", False):
issues.append(ColumnIssue(col, "dtype", "medium", "Numeric values stored as strings"))
return issues
def run_eda_agent(df: pd.DataFrame, dataset_description: str) -> str:
""" Run the EDA agent loop. The agent decides which tools to call and in what sequence,
then produces a structured report summarizing its findings. """
profile = profile_dataset(df)
issues = flag_issues(profile)
# Format issues for the agent
issues_text = "\n".join(
f"- [{i.severity.upper()}] {i.column}: {i.issue_type} -- {i.detail}"
for i in issues
) or "No issues detected."
prompt = f"""You are a senior data scientist reviewing a dataset for a data science project.
Dataset: {dataset_description}
Column profile (summary stats):
{json.dumps(profile, indent=2)}
Detected issues:
{issues_text}
Write a structured EDA report with these sections:
1. DATASET OVERVIEW -- shape, dtypes, overall quality assessment (1-2 sentences)
2. HIGH PRIORITY ISSUES -- items requiring action before modeling
3. MEDIUM PRIORITY ISSUES -- items worth monitoring
4. RECOMMENDED NEXT STEPS -- ordered list of 3-5 specific actions
Be direct. Prioritize actionability over completeness."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.2, # Low temperature for consistent structured output
)
return response.choices[0].message.content
# ── Run it ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Example: retail transaction data
import numpy as np
np.random.seed(42)
n = 5000
df = pd.DataFrame({
"revenue": np.random.exponential(scale=200, size=n), # right-skewed
"customer_age": np.random.normal(40, 12, n),
"created_at": pd.date_range("2024-01-01", periods=n, freq="h").astype(str),
"region_code": np.random.choice(["US", "EU", "APAC", None], size=n, p=[0.5, 0.3, 0.1, 0.1]),
"session_count": np.where(np.random.rand(n) < 0.1, None, np.random.poisson(3, n)).astype(float),
})
report = run_eda_agent(df, "Retail transaction data with revenue, customer demographics, and session activity")
print(report)
How to run
python eda_agent.py
Real scenario
A retail dataset with 5000 rows and 5 columns. The agent flags region_code (10% null, high severity), revenue (extreme skew, high severity), and session_count (10% null, medium severity). The report recommends log-transforming revenue, imputing region codes by geography, and investigating the session_count null pattern before modeling.
Workflow 2: Agentic Feature Engineering
What it replaces: Manually brainstorming feature ideas, writing pandas transformations, running correlation checks, testing each feature one at a time, and discarding the ones that don't help.
What the agent does instead: Reads column descriptions and the target variable, proposes candidate features with formulas and rationale, evaluates them by training a fast baseline model, prunes low-importance features, and explains which features survived and why.
Architecture
Two agents in sequence:
- Generator agent - proposes candidate features from column metadata. Uses an LLM call with structured JSON output.
- Evaluator agent - adds candidates to the dataframe, trains a LightGBM baseline, extracts feature importances, prunes below a threshold, and writes a plain-language summary of the selection.
The design decision here is separating generation from evaluation. The generator is creative and unconstrained; the evaluator is empirical and conservative. Running them in sequence avoids the common failure mode where an agent generates features it cannot validate.
Code Pattern
# feature_agent.py
# Prerequisites: pip install openai pandas numpy lightgbm
# Run: python feature_agent.py
import json
import pandas as pd
import numpy as np
import lightgbm as lgb
from openai import OpenAI
client = OpenAI()
def generate_feature_candidates(
column_descriptions: dict,
target: str,
task_type: str = "classification",
n_candidates: int = 15,
) -> list[dict]:
""" Ask the LLM to propose candidate features given column descriptions
and the prediction task. Returns a list of dicts with 'name', 'formula',
and 'rationale'. """
prompt = f"""You are a senior ML engineer performing feature engineering for a {task_type} task.
Target variable: {target}
Available columns:
{json.dumps(column_descriptions, indent=2)}
Propose {n_candidates} candidate engineered features that are likely to improve model performance.
For each feature, provide:
- name: a snake_case feature name
- formula: how to compute it from the available columns (pandas expression)
- rationale: one sentence on why this feature might help
Return a JSON object with a single key "features" containing an array of objects,
each with keys: name, formula, rationale.
Return ONLY valid JSON -- no explanation outside the JSON."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.4,
)
result = json.loads(response.choices[0].message.content)
return result.get("features", result.get("candidates", []))
def evaluate_and_prune(
df: pd.DataFrame,
candidate_features: list[dict],
target_col: str,
importance_threshold: float = 0.01,
) -> tuple[list[str], list[str], dict[str, float]]:
""" Add candidate features to the dataframe, train a fast LightGBM baseline,
extract feature importances, and prune below threshold.
Returns (kept_features, pruned_features, importance_scores) """
feature_df = df.copy()
added = []
for candidate in candidate_features:
try:
# Evaluate the formula string -- in production, use a safe eval sandbox
feature_df[candidate["name"]] = feature_df.eval(candidate["formula"])
added.append(candidate["name"])
except Exception as e:
# Formula failed -- skip this candidate
print(f" Skipped '{candidate['name']}': {e}")
if not added:
return [], [], {}
X = feature_df[added].fillna(0)
y = df[target_col]
model = lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
model.fit(X, y)
importance_scores = dict(
zip(added, model.feature_importances_ / model.feature_importances_.sum())
)
kept = [f for f in added if importance_scores.get(f, 0) >= importance_threshold]
pruned = [f for f in added if importance_scores.get(f, 0) < importance_threshold]
return kept, pruned, importance_scores
def explain_selection(
kept: list[str],
pruned: list[str],
scores: dict[str, float],
) -> str:
""" Ask the agent to explain its selection decisions in plain language. """
prompt = f"""You are reviewing feature selection results for an ML pipeline.
Features KEPT (above importance threshold):
{json.dumps({f: round(scores.get(f, 0), 4) for f in kept}, indent=2)}
Features PRUNED (below threshold):
{json.dumps({f: round(scores.get(f, 0), 4) for f in pruned}, indent=2)}
Write a 3-5 sentence summary of the selection outcome. Note any surprising prunings
or unexpected high-importance features. Suggest one additional feature worth testing
based on what survived."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
)
return response.choices[0].message.content
if __name__ == "__main__":
column_descriptions = {
"days_since_login": "Number of days since the customer last logged in",
"plan_tier": "Subscription tier: basic, pro, or enterprise",
"support_tickets_90d": "Number of support tickets opened in the last 90 days",
"monthly_spend": "Customer's average monthly spend in USD",
}
candidates = generate_feature_candidates(
column_descriptions,
target="churned",
task_type="classification",
n_candidates=10,
)
# In production, load real customer data here
np.random.seed(42)
n = 3000
df = pd.DataFrame({
"days_since_login": np.random.randint(0, 90, n),
"plan_tier": np.random.choice(["basic", "pro", "enterprise"], n),
"support_tickets_90d": np.random.poisson(1.5, n),
"monthly_spend": np.random.exponential(80, n),
"churned": np.random.binomial(1, 0.15, n),
})
kept, pruned, scores = evaluate_and_prune(df, candidates, target_col="churned")
summary = explain_selection(kept, pruned, scores)
print(summary)
How to run
python feature_agent.py
Real scenario
Customer churn prediction, 12 input columns including days_since_login, plan_tier, support_tickets_90d, and monthly_spend. The agent proposes 15 candidates, including spend_per_day, tickets_per_spend_ratio, and login_recency_x_plan. After evaluation, 9 survive the importance threshold. The explanation calls out that tickets_per_spend_ratio has the highest importance score (0.18): "customers spending more who are also raising support tickets are a particularly high churn risk," which becomes a finding worth sharing with the product team.
Workflow 3: Agentic Hyperparameter Optimization
What it replaces: Grid search (exhaustive but wasteful), random search (efficient but dumb), and manual Bayesian optimization setup (powerful but boilerplate-heavy). All of these treat hyperparameter tuning as a search problem. An agent treats it as a reasoning problem.
What the agent does instead: Proposes a hyperparameter configuration, evaluates it by training the model, analyzes the metric trend across iterations, identifies which parameters are driving improvement, and adjusts the search direction accordingly, without being told to. It converges on a good configuration in far fewer iterations than grid or random search.
Architecture
One agent, one tool: train_and_evaluate. The tool takes a Pydantic-validated hyperparameter config, trains the model with 5-fold CV, and returns the area under the curve (AUC), training time, and the train/validation overfitting gap. The agent receives the full trial history at each step and reasons about what to try next. Convergence is detected when the last three AUC scores span less than 0.005.
This pattern is directly inspired by published research on agentic hyperparameter tuning that showed LLM-guided search outperforming Bayesian optimization on mid-sized classification tasks by 5-12% in fewer iterations.
Code Pattern
# hp_agent.py
# Prerequisites: pip install openai scikit-learn pydantic pandas numpy
# Run: python hp_agent.py
import json
from dataclasses import dataclass, field
from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
import numpy as np
client = OpenAI()
# ── Pydantic schema for structured tool input ─────────────────────────────────
# The model must return valid hyperparameters -- Pydantic catches invalid values
# before the training job starts, saving wasted compute on bad configs.
class HyperparamConfig(BaseModel):
n_estimators: int = Field(..., ge=10, le=1000, description="Number of trees")
max_depth: int = Field(..., ge=1, le=50, description="Max tree depth")
min_samples_split: int = Field(..., ge=2, le=50, description="Min samples to split")
max_features: str = Field(..., description="Feature fraction strategy: sqrt, log2, or a float between 0.1 and 1.0")
@field_validator("max_features")
@classmethod
def validate_max_features(cls, v):
if isinstance(v, str) and v not in ("sqrt", "log2"):
raise ValueError("max_features string must be 'sqrt' or 'log2'")
if isinstance(v, (int, float)) and not (0.1 <= v <= 1.0):
raise ValueError("max_features float must be between 0.1 and 1.0")
return v
Comments
No comments yet. Start the discussion.