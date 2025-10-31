Episode 62: Practical AI at Work: How Execs and Developers Can Actually Use LLMs
Many leaders are trapped between chasing ambitious, ill-defined AI projects and the paralysis of not knowing where to start. Dr. Randall Olson argues that the real opportunity isn't in moonshots, but in the "trillions of dollars of business value" available right now. As co-founder of Wyrd Studios, he bridges the gap between data science, AI engineering, and executive strategy to deliver a practical framework for execution.
In this episode, Randy and Hugo lay out how to find and solve what might be considered "boring but valuable" problems, like an EdTech company automating 20% of its support tickets with a simple retrieval bot instead of a complex AI tutor. They discuss how to move incrementally along the "agentic spectrum" and why treating AI evaluation with the same rigor as software engineering is non-negotiable for building a disciplined, high-impact AI strategy.
They talk through:
How a non-technical leader can prototype a complex insurance claim classifier using just photos and a ChatGPT subscription.
The agentic spectrum: Why you should start by automating meeting summaries before attempting to build fully autonomous agents.
The practical first step for any executive: Building a personal knowledge base with meeting transcripts and strategy docs to get tailored AI advice.
Why treating AI evaluation with the same rigor as unit testing is essential for shipping reliable products.
The organizational shift required to unlock long-term AI gains, even if it means a short-term productivity dip.
Episode 61: The AI Agent Reliability Cliff: What Happens When Tools Fail in Production
Most AI teams find their multi-agent systems devolving into chaos, but ML Engineer Alex Strick van Linschoten argues they are ignoring the production reality. In this episode, he draws on insights from the LLM Ops Database (750+ real-world deployments then; now nearly 1,000!) to systematically measure and engineer constraint, turning unreliable prototypes into robust, enterprise-ready AI.
Drawing from his work at Zen ML, Alex details why success requires scaling down and enforcing MLOps discipline to navigate the unpredictable "Agent Reliability Cliff". He provides the essential architectural shifts, evaluation hygiene techniques, and practical steps needed to move beyond guesswork and build scalable, trustworthy AI products.
We talk through:
- Why "shoving a thousand agents" into an app is the fastest route to unmanageable chaos
- The essential MLOps hygiene (tracing and continuous evals) that most teams skip
- The optimal (and very low) limit for the number of tools an agent can reliably use
- How to use human-in-the-loop strategies to manage the risk of autonomous failure in high-sensitivity domains
- The principle of using simple Python/RegEx before resorting to costly LLM judges
Episode 60: 10 Things I Hate About AI Evals with Hamel Husain
Most AI teams find "evals" frustrating, but ML Engineer Hamel Husain argues they’re just using the wrong playbook. In this episode, he lays out a data-centric approach to systematically measure and improve AI, turning unreliable prototypes into robust, production-ready systems.
Drawing from his experience getting countless teams unstuck, Hamel explains why the solution requires a "revenge of the data scientists." He details the essential mindset shifts, error analysis techniques, and practical steps needed to move beyond guesswork and build AI products you can actually trust.
We talk through:
The 10(+1) critical mistakes that cause teams to waste time on evals
Why "hallucination scores" are a waste of time (and what to measure instead)
The manual review process that finds major issues in hours, not weeks
A step-by-step method for building LLM judges you can actually trust
How to use domain experts without getting stuck in endless review committees
Guest Bryan Bischof's "Failure as a Funnel" for debugging complex AI agents
If you're tired of ambiguous "vibe checks" and want a clear process that delivers real improvement, this episode provides the definitive roadmap.
Episode 59: Patterns and Anti-Patterns For Building with AI
John Berryman (Arcturus Labs; early GitHub Copilot engineer; co-author of Relevant Search and Prompt Engineering for LLMs) has spent years figuring out what makes AI applications actually work in production. In this episode, he shares the “seven deadly sins” of LLM development — and the practical fixes that keep projects from stalling.
From context management to retrieval debugging, John explains the patterns he’s seen succeed, the mistakes to avoid, and why it helps to think of an LLM as an “AI intern” rather than an all-knowing oracle.
We talk through:
- Why chasing perfect accuracy is a dead end
- How to use agents without losing control
- Context engineering: fitting the right information in the window
- Starting simple instead of over-orchestrating
- Separating retrieval from generation in RAG
- Splitting complex extractions into smaller checks
- Knowing when frameworks help — and when they slow you down
A practical guide to avoiding the common traps of LLM development and building systems that actually hold up in production.
Episode 58: Building GenAI Systems That Make Business Decisions with Thomas Wiecki (PyMC Labs)
While most conversations about generative AI focus on chatbots, Thomas Wiecki (PyMC Labs, PyMC) has been building systems that help companies make actual business decisions. In this episode, he shares how Bayesian modeling and synthetic consumers can be combined with LLMs to simulate customer reactions, guide marketing spend, and support strategy.
Drawing from his work with Colgate and others, Thomas explains how to scale survey methods with AI, where agents fit into analytics workflows, and what it takes to make these systems reliable.
We talk through:
Using LLMs as “synthetic consumers” to simulate surveys and test product ideas
How Bayesian modeling and causal graphs enable transparent, trustworthy decision-making
Building closed-loop systems where AI generates and critiques ideas
Guardrails for multi-agent workflows in marketing mix modeling
Where generative AI breaks (and how to detect failure modes)
The balance between useful models and “correct” models
If you’ve ever wondered how to move from flashy prototypes to AI systems that actually inform business strategy, this episode shows what it takes.
