Beyond QA/QC: Evals for Data Product Managers

A product manager recently shared their breakthrough on LinkedIn after completing Shreya Shankar & Hamel Husain's AI Evaluation course:

"Been hunting for repeatable frameworks for building AI products as a PM. The problem: AI products are non-deterministic by nature. Manual testing doesn't scale. I was hunting for a systematic way to catch issues before users do."

They felt "completely out of my depth surrounded by data scientists and engineers" but walked away with game-changing frameworks: the Three Gulfs framework for diagnosing AI failures, the Analyze-Measure-Improve cycle, and LLM-as-Judge setup. Their conclusion:

"As AI becomes core to our products, we can't just 'manage around' the complexity anymore."

This perfectly captures the challenge every data product manager faces today. Six months into rolling out our first AI-powered NLP annotating support tool, our VP clinical data pulled me aside: "How do we know that this isn't going to hallucinate?" Our data quality was perfect—99.7% completeness, zero schema violations, 94.2% statistical accuracy. But he wasn't asking about data quality. She was asking whether our AI was actually helping researchers understand evidence, not just flag instances.

That conversation led me down the same rabbit hole that PM discovered. Traditional QA/QC processes weren't designed for genAI outputs. They were built for structured data, predictable patterns, and binary pass/fail scenarios. But AI outputs are probabilistic, contextual, and often subjective. They require an entirely different approach: evals (evaluations).

The $3M Reality Check: Why Perfect Data ≠ Perfect AI

Let me paint you a picture of what happens when you rely on traditional QA for AI products. A friend's team built an LLM-powered analytics assistant with 97% accuracy on their test set. Perfect data quality scores. Flawless demos. Three weeks after launch, customer complaints flooded in—the AI was hallucinating metrics and confidently presenting fiction as fact to C-suite executives making million-dollar decisions.

Here's what I've learned: Data quality and AI quality are not the same thing. You can have perfect data and terrible AI outputs. You can have messy data and surprisingly good AI performance. The relationship isn't linear, and traditional QA/QC tools don't capture this complexity.

Why Traditional QA/QC Fails for AI

This gap becomes a critical business issue across industries. I learned this the hard way when our medication dosage prediction model started recommending technically correct but clinically dangerous combinations. But it's not just healthcare—I've seen similar failures in customer segmentation AI that grouped users in ways that made statistical sense but zero business sense (marketing winter coats to customers in Miami because they'd bought scarves as gifts).

Our traditional QA/QC caught zero of these issues because:

The data was clean (passed all validation rules)
The model was accurate (95% precision on test data)
The outputs were consistent (same inputs produced same outputs)

But the model hadn't been evaluated for real-world scenarios. It didn't understand context, business logic, or the subtle factors that make a "correct" recommendation valuable or dangerous.

That's when I realized we needed evals.

What Are Evals? The Bridge Between Technical Accuracy and Real-World Value

Evals (evaluations) are systematic assessments of AI model performance that go beyond technical accuracy to measure real-world value and safety.

After years of building data products, I've learned that evals are the hidden lever behind every successful AI system. While data product managers obsess over data quality metrics and model accuracy, evals quietly determine whether your AI will thrive in production or become a cautionary tale. They are the Hot Topic (pun intended?) in the PM space as folks are figuring out how to build workflows that use genai models.

Think of evals as driving tests for AI systems. Traditional QA/QC asks:

Is the data complete? ✓
Does it match the schema? ✓
Are the calculations correct? ✓
Is the pipeline running? ✓

Evals ask the questions that actually matter:

Is the AI telling the truth or making things up?
Would a human expert agree with this recommendation?
Is this output helpful or harmful in the real world?
Are we treating all user segments fairly?
Can users understand and trust the AI's reasoning?

Just as you'd never let someone drive without passing their driving test, you shouldn't let AI make critical recommendations without passing rigorous evaluations. The difference is that when AI fails, it's not just a technical glitch—it's a potential business + ethical disaster.

The Core Eval Categories That Matter

Based on implementing evals across multiple data products, there are four critical evaluation categories that separate successful AI systems from failures:

1. Safety Evals: Does the AI output pose any risk? Example: Our drug interaction checker needed to identify dangerous combinations while avoiding false alarms that would reduce trust.

2. Relevance Evals: Are the AI outputs appropriate for the specific context? Example: Our diagnostic assistant needed to suggest relevant conditions based on patient symptoms, not just statistically probable ones.

3. Fairness Evals: Does the AI treat different user populations equitably? Example: Our readmission risk model needed to predict risk accurately across demographic groups without perpetuating disparities.

4. Explainability Evals: Can users understand and trust the AI's reasoning? Example: Our treatment recommendation engine needed to provide explanations that doctors could validate and communicate to patients.

Successful evals combine three approaches: human expert review (expensive but accurate), code-based validation (fast but limited), and AI-based evaluation (scalable but needs careful setup).

The Four-Part Eval Formula That Actually Works

After burning myself repeatedly on bad evals, I've discovered that every effective eval contains exactly four components. Miss any of these, and you're back to crossing your fingers:

1. Setting the Role: Tell your evaluator exactly who they are and what expertise they bring

Bad: "Evaluate this output"
Good: "You are a senior data analyst with 10 years of experience in financial services evaluating ETL pipeline outputs"

2. Providing Context: Give the actual scenario, data, and constraints

Bad: "Check if this is correct"
Good: "Given this sales data from Q3, source system constraints, and the requirement for hourly updates..."

3. Stating the Goal: Define what success looks like in clear, measurable terms

Bad: "Make sure it's good"
Good: "Verify that aggregations are mathematically correct, business logic is properly applied, and no PII is exposed"

4. Defining Terminology: Eliminate ambiguity in your evaluation criteria

Bad: "Check for quality"
Good: "'High quality' means: accurate to source data, formatted for executive consumption, with clear data lineage"

Here's a real example that works:

**Role**: You are a data quality engineer evaluating automated insight generation.

**Context**: 
- User Query: "Show me customer churn trends"
- Data Available: 24 months of customer data, transaction history
- AI Response: "Churn increased 47% in Q3, primarily driven by pricing changes"

**Goal**: Determine if the AI insight is:
1. Factually accurate based on the data
2. Statistically sound (not cherry-picking)
3. Actionable for business users
4. Free from hallucinations

**Terminology**:
- "Factually accurate": Numbers match source data calculations
- "Hallucination": Any claim not directly supported by provided data
- "Actionable": Includes enough context for business decisions

This structured approach helps ensure your AI evaluations capture the nuanced reasoning that traditional QA/QC misses entirely.

Your 30-Day Eval Transformation Plan

I've helped twelve teams implement evals. The ones who succeeded all followed this pattern:

Week 1: Foundation

Audit your last 20 AI failures and categorize by type
Write simple evals for your top 3 failure modes (regex, rule-based)
Add evals to your deployment pipeline with basic alerting

Week 2: Scale

Export 100 production examples and get 5 people to label them
Write your first LLM-as-judge eval (aim for 85%+ agreement with humans)
Connect evals to your CI/CD and data quality dashboard

Week 3-4: Production

Eval 10% of production traffic and compare with user feedback
Build eval→fix→measure workflow
Optimize slow evals and plan your next expansion

Connecting Evals to Your Data Stack (Without Starting from Scratch)

Here's the beautiful thing: as a data PM, you already have 80% of what you need for great evals. You just need to connect the pieces differently.

Your Secret Weapons Already in Place

1. Your Data Pipeline = Your Eval Pipeline

Remember that Airflow DAG you use for ETL? Add eval steps:

# Your existing pipeline
extract_data >> transform_data >> load_to_warehouse

# Becomes
extract_data >> transform_data >> run_evals >> load_to_warehouse
                                      ↓
                              [fail] >> alert_team

2. Your BI Tools = Your Eval Dashboards

Stop building custom eval dashboards. We wasted three weeks on a fancy React dashboard before realizing Metabase worked perfectly:

Connected eval results table to Metabase
Built standard metrics dashboards
Set up alerts on metric degradation

3. Your Data Quality Tools = Your Eval Framework

Great Expectations, dbt tests, Datafold—whatever you're using for data quality can power your evals:

-- dbt test that became our most valuable eval
-- tests/assert_no_hallucinated_metrics.sql
SELECT 
    ai_output.metric_name,
    ai_output.metric_value
FROM {{ ref('ai_generated_insights') }} ai_output
LEFT JOIN {{ ref('actual_metrics') }} actual
    ON ai_output.metric_name = actual.metric_name
WHERE actual.metric_name IS NULL

This simple test caught our AI inventing metrics that sounded plausible but didn't exist.

Case Study: How Evals Saved Our Data Pipeline

We built an AI that monitored data pipelines with 94% accuracy. The problem? It kept telling engineers to "check the server logs" for every issue. Technically correct, completely useless.

Traditional QA: Response time ✓, Format compliance ✓, Error rate ✓
User satisfaction: "This is worthless"

Our Eval: Measured whether diagnoses actually helped engineers fix issues—identifying root causes, providing specific next steps, and estimating fix time.

Results: 67% of responses were useless "check the logs" variations. After fixing based on eval feedback: 78% now include specific root causes, 89% provide actionable next steps, and engineer satisfaction jumped from 3.2/10 to 8.7/10.

Key Learning: Your evals should measure what users care about, not what's easy to measure.

Key Learning: Evals Are About User Trust, Not Technical Validation

After a year of implementing AI evaluations, here's what I wish I'd understood from the beginning: Evals aren't primarily a technical challenge—they're a user trust-building exercise.

The goal isn't to achieve perfect scores on evaluation metrics. The goal is to build enough confidence in your AI that users will actually rely on it to make better decisions.

The Trust Equation:

Technical accuracy gets you to the table
Business relevance gets you adopted
Safety evaluations keep you there
Explainability evaluations build long-term trust

Instead of asking: "How accurate is our model?"
Ask: "Do users trust this AI enough to act on its recommendations?"

Instead of asking: "What's our F1 score?"
Ask: "Are business outcomes improving when users engage with our AI?"

The most successful AI implementations focus relentlessly on user trust metrics rather than pure technical performance.

Your Next Steps: Start Your First Eval This Week

This Week:

Pick your highest-risk AI output
Interview 3 stakeholders about their biggest AI concerns
Collect 20 examples where users accepted/rejected AI recommendations

Next Week:

Write your first eval using the 4-part framework
Test it against your examples (aim for 90% agreement with experts)
Set up basic monitoring

Week 3-4:

Deploy automated evaluation on live data
Create a simple dashboard for stakeholders
Plan your next evaluation category

The Tools That Actually Work (links and refs below)

Start with these:

Phoenix by Arize: Free, works out of the box
Evidently AI: Great for data drift + eval monitoring
Your existing data tools: dbt, Airflow, Great Expectations

If you have budget:

LangSmith: Incredible for debugging LLM apps
Weights & Biases: If you need experiment tracking too

Avoid: Building your own eval framework from scratch, over-engineered solutions, or anything that requires changing your entire workflow.

The Bottom Line

Traditional QA/QC got us this far, but it's not enough for AI-powered data products. The stakes are too high, the outputs too complex, and the trust requirements too demanding.

Evals bridge the gap between technical accuracy and business value. They help you answer the questions that matter: Is this AI actually helping? Can we trust it? Will users adopt it?

If you're not evaluating your AI outputs beyond technical metrics, you're not managing risk, you're managing luck. And in business, luck isn't a strategy.

Key Takeaways

Traditional QA/QC is necessary but not sufficient for AI-powered data products
Evals are about trust-building, not just technical validation
Start with user concerns, not technical metrics
Safety, relevance, fairness, and explainability are the core eval categories
Implementation should be gradual: 30/60/90-day transformation plan
Use your existing data stack as your eval foundation
Success is measured by user adoption, not technical scores

What's your biggest concern about AI evaluation in your data products? How are you currently handling the gap between technical accuracy and user trust? I'd love to hear your experiences and challenges.

References

Essential Reading:

Beyond Vibe Checks: A PM's Complete Guide to AI Evals - Lenny's Newsletter
Prompt Optimization Guide - Arize
Your Product Needs Evals - Hamal Husein
The METRIC Framework for Healthcare AI Data Quality - npj Digital Medicine (2024)

Key Tools:

Arize AI Phoenix - Open-source LLM observability and evaluation
Evidently AI - ML monitoring and data drift detection
LangSmith - LangChain's evaluation and debugging platform