Scan to access slides or ask your questions

Thinking Outside The Box Bot 🤯

Beyond simple LLM completions and building robust, production-ready Al agents

🥸 About Me

Entrepreneur 💼

Deep tech and emerging technologies
Innovation in Business

Engineer 🖥️

Working in AI since 2017 (own startup)
Large scale deployments

Socials 🐊

Twitter: @rozeappletree
GitHub: @rozeappletree
LinkedIn: /in/asapanna-rakesh

🤔 What is This Talk About

Agents that survive contact with real users

Flashy demo vs. Trustworthy agent
Pattern-first, not framework-first
Progressive from failure to production ready

🤔 What is This Talk About

Other foundational concepts

Anatomy of Agents
When to use / NOT use Agents
First principles of production ready agent

🙅‍♀️ Not About

✅❌ LangGraph, CrewAI, the Agents SDK, ADK, and others ...

✅❌ Prompt Engineering ...

✅❌ Just Theory, Research Survey

BUILDING

PRODUCTION

👀 Who is This Talk For?

🧑‍💻 Engineers building real autonomy
🧓 Evaluating agent frameworks for production
🧑‍🎓 Cutting edge of applied AI

🤩 Talk will cover

🔹 Foundations, first principles, anotamy of an agent
🧭 Systematic and programmatic approach to the agent's system prompt
🤖 LLM from text generator into an agent that can act on the external world
👥 Introduce collegues to a sharp agent as it hits the wall (clean, cheap and fast)
🔧 Unpredictable toy into a reliable, engineering-grade system (graph, state, memory)
🤝 multi agent collaboratoin & complex team workflows
⚖️ Making sure not at mercy of any single provider
🌐 Out of isolated islands - MCP, A2A, A2UI, AP2
🧩 Monolithic prompts to modular, declarative skills changes how you manage agents
📜 "Spec-first" discipline, machine-readable contract before any code is executed
🧠 Solving fundamental flaw in stateless LLMs (Memory / IR)
⚠️ God-mode of agent capabilities, and keeping it from blowing up in your face
👀 Expand the agent's sensory inputs to visual and auditory world
🛡️ As agents grow in power and autonomy, we must now turn to guardrails (safety & reliability)
🧮 How do we mathematically prove this thing is actually doing its job?
🕰️ Agent as a Service (daemon loop, the watchdog, the heartbeat, the restart policy)

🌱 Foundations

Beyond Completions

Prompt ➡️ LLM ➡️ Response
Useful, but not agent 🤖

🌱 Foundations

Anatomy of an Agent

Fundamentally a constrained loop
Underpins all designs 📌
Understand ➡️ Constrain ➡️ Trust

🌱 Foundations

Stages of Loop

Stage	⚙️🛠️ What Happens	Example
Perceive / Reflect	📖 Agent reads user input, tool results, or environment	"Find recent papers on LLM agents"
Plan / Think	🧠 LLM reasons about the next step	"I should search arXiv first"
Act	Agent calls a 🔨 tool or produces 📄 output	search_arxiv(query="LLM agents")
Observe	Agent 👀 reads the tool's return value 📃	10 papers returned, titles listed

Keeps going until:
1. Final answer ✅
2. Maximum iteration 🙀
3. Exceptions 😞➡️😄

🌱 Foundations

🍀 Four Patterns Matter

Not all agents loop the same way
1. ReAct (Reasoning + Acting)
2. Chain-of-Thought (CoT)
3. Reflection
4. Plan-and-Execute

🌱 Foundations

🍀 Four Patterns Matter

ReAct (Reasoning + Acting)
- Workhorse 🏇
- Think out loud ➡️ do ➡️ observe result ➡️ next move
- Usecases involving step by step processes (eg. ❓ QnA, 🔎 research)

🌱 Foundations

🍀 Four Patterns Matter

Chain-of-Thought (CoT)
- Bureaucrat 👴
- It plans (everything) first, acts once
- Usecases where full plan should be visible (e.g., 😥 Complex reasoning, 📐 math, or 🕸️ multi-step logic)

🌱 Foundations

🍀 Four Patterns Matter

Reflection
- Scientist🧑‍🔬
- Two phase loop: Generate, Critique, Revise ♻️
- Usecases where first drafts improve with self-review. (e.g., ✒️ Writing, 👨‍💻 Code generation)

🌱 Foundations

🍀 Four Patterns Matter

Plan and Execute
- Beurocrat riding a workhorse 🎠
- Create full plan upfront, then executes each step one by one.
- Usecases where scope is extremely clear and don't need mid-process adjustments.

🌱 Foundations

🛠️ Tools 💾 Memory 🏠 State

Regardless of 4️⃣ patterns, agents needs these 3️⃣ things

🌱 Foundations

🛠️ Tools

Tool is a 🐍 function the agent can call
Tools let agents "DO THINGS" (eg. Search the web 🔎, Query a database 🗄️, Run code 🏃‍♂️)
Without tools, an agent is just a 🗣️ chatbot with delusions of competence 🤦

🌱 Foundations

💾 Memory

Without memory, every conversation starts from scratch
1. Short-term memory 🧠 (within the conversation in message list)
2. Long-term memory 🗄️ (across conversations in vector store)
3. Episodic memory 🛠️ (all past interactions, preferences and experiences)

🌱 Foundations

🏠 State

Agent's current situation
Conversation history (simple ReAct) / Structured object (complex graph-based agent)

🌱 Foundations

🙅‍♀️ When NOT to use an Agent ⚠️

💪 Powerful, but they're also 🐢 slow, 💸 expensive, and 🥲 unpredictable
👎 Flowchart with no branches
Reach for a plain pipeline when:
- Steps are fixed ✅
- Order must be strict 🔒
- Latency matters ⚡
- Simple CRUD 🧾
Reach for an agent when:
- Next step depends on result 🔁
- Request is unclear ❓
- Tools unknown up-front 🧰
- Needs self-correction 🔄

🌱 Foundations

Minimal Tool Calling Loop (OpenAI)

🌱 Foundations

Minimal Tool Calling Loop (Gemini)

🌱 Foundations

Simple ReAct Prompt for Reasearch Agent

🌱 Foundations

Simple ReAct loop for Reasearch Agent

🌱 Foundations

⬇️ Summary

🔄 AI agents are fundamentally constarined loops, not one-time LLM responses.
🧠 Core agent loop: Perceive → Plan → Act → Observe.
🏗️ This loop forms the foundation of reliable agent architectures
⚡ Key design patterns: ReAct, Chain-of-Thought, Reflection, and Plan-and-Execute
🎯 Each pattern has specific strengths and use cases.
🛠️ Tools enable agents to interact with external systems.
💾 Memory provides persistence and context across tasks.
📌 State management keeps agent behavior structured and reliable.
⚖️ Choose complex agents only when the problem requires them.
🚀 For simpler tasks, a code pipeline is often faster and more efficient.
✅ Mastering these fundamentals prepares you to build production-ready AI agents.

🧠 Practical LLM breakdown as of May 2026

Heavy model for everything — you'll burn through your budget on tasks that light weight model handles
Light model for architecture decisions — the cost savings aren't worth the quality drop

Prompt Architecture for Agents

Treat agent prompts as code, not chat.

"be helpful." ➡️ Operating manual with role, rules, data, and output format
Structure, precision, edge case handling, and version control

Prompt Architecture for Agents

Four layer prompt architecture

System Identity: Who is the agent? Define role, personality, and hard boundaries — what it will and won't do.
Instructions Operational: rules, available tools, workflow steps, error handling, and escalation paths.
Context Injection: Dynamic runtime data — user profile, session history, current time, conversation summaries.
Output Constraints: Enforce exact response format (e.g. JSON schema) so downstream systems can parse reliably.

Prompt Architecture for Agents

Persona Design

Persona is an engineering decision, not decoration
Research consistently shows that LLMs adjust their reasoning depth, formality, and risk tolerance based on the role they're given. Choosing a persona is an engineering decision with measurable consequences.
Watch out: the "thorough" persona often loses on accuracy because it tries too hard and starts inventing facts to fill gaps. Always benchmark variants against real test cases.

Prompt Architecture for Agents

Few-Shot Examples

Show, don't tell
Instead of trying to describe what you want — which gets messy fast — you just show it.
Skip few-shot when the prompt is already long (each example adds tokens and cost) or the task is simple enough that instructions alone do the job.

Prompt Architecture for Agents

Versioning & Testing

Treat prompts like production code
The hidden failure mode: Testing only the "happy path" means silent bugs lurk in production. Every edge case in real usage must be covered by eval cases — including expected refusals.
"Prompts need version control just like code. When you change a prompt, old bugs come back. When you don't track which version is running in production, debugging becomes guesswork."
Regression test cases should cover: required tool usage, keyword presence in answers, max tool-call efficiency limits, and expected refusals for out-of-scope queries.

Prompt Architecture for Agents

Best Practices

Treat agent prompts as code, not chat
Constraints make agents more reliable, not less
Test the unhappy path - include expected refusals and edge cases
Persona choices are measurable
Match thinking budget to task difficulty

Tools, Skills, and Structured Outputs

Tool Design

All major LLM providers use JSON Schema underneath
👉 Abstract away vendor-specific wrappers rather than maintaining separate tool definitions per provider
Build atomic tools that do exactly one thing
👉 If you're writing "web_search_and_summarize", stop — you've created a hidden agent inside a tool. Split it and let the agent's main loop handle orchestration

Tools, Skills, and Structured Outputs

Structured Outputs

Never parse LLM responses with regex
👉 Use Pydantic schemas and tell the model to return matching JSON — if it doesn't validate, retry
Never return a raw string from a complex tool
👉 Return a structured "SkillResult" object with success, message, data, snapshot, and error_code fields so failures are programmatically distinguishable

Tools, Skills, and Structured Outputs

Skill Organization

Package related tools into reusable Skills
👉 With shared config, API keys, rate limits, and error handling — eliminates copy-paste sprawl across agents
Use a central SkillRegistry for tool discovery and dispatch
👉 Rather than hardwiring tools into each agent

Tools, Skills, and Structured Outputs

Safety & Reliability

Assume inputs are malicious
👉 Validate file paths are within a sandbox before reading; run static AST analysis before executing code
Build retry-with-backoff and fallback strategies
👉 Before you ship, not after the first outage
Use declarative Markdown skills (SKILL.md files with YAML frontmatter)
👉 Separate behavior from infrastructure — keeps Python logic clean and lets non-engineers tune agent behavior

Tools, Skills, and Structured Outputs

The core mindset shift

Stop building "clever scripts" and build a Capability Library — atomic, testable, sandboxed, and framework-agnostic.

Handoffs and Routines

When to use handoffs (and when not to)

Use handoffs when each sub-task is independent
👉 one agent can fully handle a request, and errors are self-contained.
Use graphs or crews when tasks depend on each other
👉 gents need to collaborate on the same output, or failures require coordinated recovery.

Handoffs and Routines

Start simple, add complexity only when justified

most expensive mistake in agent design is reaching for LangGraph or CrewAI when a simple fan-out would do
👉 Costly in both engineering time and runtime. Don't overcomplicate routing until you have a concrete reason to.

Handoffs and Routines

Keep the router's job narrow

The triage agent's instructions make this explicit
👉 Your ONLY job is to route... Do NOT try to answer questions yourself. Just route."
A router that starts answering questions is no longer a router.

Handoffs and Routines

Scope specialist agents tightly

Each specialist should have narrow instructions and know when to hand back
👉 e.g., "If the question is not about billing, hand off to the triage agent."
This prevents agents from overreaching

Handoffs and Routines

Use context compression for long trajectories

Without compressing state when handing off between agents, you'll hit context window limits
👉 execution history into a compact block before handoff is a production-readiness requirement, not an optimization.

Handoffs and Routines

Use embedding-based routing to cut costs

Cheap local sentence embeddings with cosine similarity can route requests without an LLM API call
👉 avoiding latency and cost overhead on every incoming message.

Handoffs and Routines

Know the hard limit of the pattern

Handoffs break the moment one agent's output needs to feed into another's input, or a human approval step is required mid-flow.
That's the signal to introduce state and graph-based architecture — not before.

Stateful Agent Graphs

Retry needs Graph

Model agents as stateful directed graphs, not linear chains
👉 Linear designs like ReAct loops can't self-correct or retry
👉 The moment you need a retry, you need a graph with explicit nodes and edges for Try, Evaluate, and Retry logic

Stateful Agent Graphs

The question isn't if production agents fail, but how gracefully.

Design for failure explicitly
👉 Errors are not edge cases. Model recovery as first-class edges in the graph
👉 Retry paths, escalation paths, and human handoff paths

Stateful Agent Graphs

Use retry limits with escalation

Cap retries (e.g., retry_count < 2) and route beyond that limit to a human or escalation node
👉 Never let an agent loop indefinitely

Stateful Agent Graphs

Add human-in-the-loop checkpoints before mutable operations

Use interrupt_before on any node that writes, posts, or changes state in the real world.
👉 Let a human review sensitive plan before the agent acts on it

Stateful Agent Graphs

Use durable checkpointers in production

MemorySaver is fine for development, but production systems need SQLite (single machine) or PostgreSQL (distributed) so state survives process restarts.

Stateful Agent Graphs

Match the tool to the complexity

Use lightweight frameworks like smolagents for quick prototypes and one-off scripts.
Reach for LangGraph only when you need production-grade state management, persistence, and complex control flow.

Stateful Agent Graphs

Keep state as a shared, typed dictionary

Flow a single TypedDict state through the entire graph
👉 so every node reads and writes to a consistent, inspectable structure
👉 makes debugging and time-travel replay practical

Multi-Agent Collaboration

On When to Use Multi-Agent Systems

architecture is powerful, but the coordination tax is brutal.
👉 More agents means more latency, explosive API costs, hallucination cascading, and actual nightmares when trying to debug which agent screwed up first
The core principle: only use multiple agents when roles are genuinely distinct — don't add agents just because you can.

Multi-Agent Collaboration

On Debugging & Predictability

Assembly Line Patter
👉 CrewAI's explicit task pipelines are easier to debug because execution order is predictable, versus conversational frameworks where emergent flow is harder to trace. Design for debuggability from the start.

Multi-Agent Collaboration

More...

On High-Stakes Accuracy
👉 Dbeate Protocol: If you are building an agent focused on factual accuracy (like a medical or legal assistant), don't just rely on a single ReAct loop. Put the output through a debate protocol before showing it to the user
👉 Use a Generator → Challenger → Revise cycle until the challenger runs out of objections or you hit max rounds.
On Governance & Agentic Actions
👉 Agents that can take real-world actions (messaging people, raising approvals, closing tickets) need hard interrupt-before-mutate checkpoints — not polite post-action email summaries.
On Resilience
👉 Hard-locking a multi-agent crew to a single API provider is a critical flaw. Fallback logic across providers isn't optional — it's a hard requirement for production systems.

⏭️ Next steps...

🔹 Foundations, first principles, anotamy of an agent
🧭 Systematic and programmatic approach to the agent's system prompt
🤖 LLM from text generator into an agent that can act on the external world
👥 Introduce collegues to a sharp agent as it hits the wall (clean, cheap and fast)
🔧 Unpredictable toy into a reliable, engineering-grade system (graph, state, memory)
🤝 multi agent collaboratoin & complex team workflows
⚖️ Making sure not at mercy of any single provider
🌐 Out of isolated islands - MCP, A2A, A2UI, AP2
🧩 Monolithic prompts to modular, declarative skills changes how you manage agents
📜 "Spec-first" discipline, machine-readable contract before any code is executed
🧠 Solving fundamental flaw in stateless LLMs (Memory / IR)
⚠️ God-mode of agent capabilities, and keeping it from blowing up in your face
👀 Expand the agent's sensory inputs to visual and auditory world
🛡️ As agents grow in power and autonomy, we must now turn to guardrails (safety & reliability)
🧮 How do we mathematically prove this thing is actually doing its job?
🕰️ Agent as a Service (daemon loop, the watchdog, the heartbeat, the restart policy)
🔗 Know more...