Scan to access slides or ask your questions

  • qr

Thinking Outside The Box Bot ๐Ÿคฏ

Beyond simple LLM completions and building robust, production-ready Al agents

๐Ÿฅธ About Me

Entrepreneur ๐Ÿ’ผ

  • Deep tech and emerging technologies
  • Innovation in Business

Engineer ๐Ÿ–ฅ๏ธ

  • Working in AI since 2017 (own startup)
  • Large scale deployments


Socials ๐ŸŠ

  • Twitter: @rozeappletree
  • GitHub: @rozeappletree
  • LinkedIn: /in/asapanna-rakesh
Profile

๐Ÿค” What is This Talk About

๐Ÿค” What is This Talk About

๐Ÿค” What is This Talk About

Agents that survive contact with real users

  • Flashy demo vs. Trustworthy agent
  • Pattern-first, not framework-first
  • Progressive from failure to production ready

๐Ÿค” What is This Talk About

Other foundational concepts

  • Anatomy of Agents
  • When to use / NOT use Agents
  • First principles of production ready agent

๐Ÿ™…โ€โ™€๏ธ Not About

โœ…โŒ LangGraph, CrewAI, the Agents SDK, ADK, and others ...

โœ…โŒ Prompt Engineering ...

โœ…โŒ Just Theory, Research Survey

BUILDING

PRODUCTION

๐Ÿ‘€ Who is This Talk For?

  • ๐Ÿง‘โ€๐Ÿ’ป Engineers building real autonomy
  • ๐Ÿง“ Evaluating agent frameworks for production
  • ๐Ÿง‘โ€๐ŸŽ“ Cutting edge of applied AI

๐Ÿคฉ Talk will cover

  1. ๐Ÿ”น Foundations, first principles, anotamy of an agent
  2. ๐Ÿงญ Systematic and programmatic approach to the agent's system prompt
  3. ๐Ÿค– LLM from text generator into an agent that can act on the external world
  4. ๐Ÿ‘ฅ Introduce collegues to a sharp agent as it hits the wall (clean, cheap and fast)
  5. ๐Ÿ”ง Unpredictable toy into a reliable, engineering-grade system (graph, state, memory)
  6. ๐Ÿค multi agent collaboratoin & complex team workflows
  7. โš–๏ธ Making sure not at mercy of any single provider
  8. ๐ŸŒ Out of isolated islands - MCP, A2A, A2UI, AP2
  9. ๐Ÿงฉ Monolithic prompts to modular, declarative skills changes how you manage agents
  10. ๐Ÿ“œ "Spec-first" discipline, machine-readable contract before any code is executed
  11. ๐Ÿง  Solving fundamental flaw in stateless LLMs (Memory / IR)
  12. โš ๏ธ God-mode of agent capabilities, and keeping it from blowing up in your face
  13. ๐Ÿ‘€ Expand the agent's sensory inputs to visual and auditory world
  14. ๐Ÿ›ก๏ธ As agents grow in power and autonomy, we must now turn to guardrails (safety & reliability)
  15. ๐Ÿงฎ How do we mathematically prove this thing is actually doing its job?
  16. ๐Ÿ•ฐ๏ธ Agent as a Service (daemon loop, the watchdog, the heartbeat, the restart policy)

๐ŸŒฑ Foundations

Beyond Completions

  • Prompt โžก๏ธ LLM โžก๏ธ Response
  • Useful, but not agent ๐Ÿค–

๐ŸŒฑ Foundations

Anatomy of an Agent

  • Fundamentally a constrained loop
  • Underpins all designs ๐Ÿ“Œ
  • Understand โžก๏ธ Constrain โžก๏ธ Trust

๐ŸŒฑ Foundations

Stages of Loop

Stage โš™๏ธ๐Ÿ› ๏ธ What Happens Example
Perceive / Reflect ๐Ÿ“– Agent reads user input, tool results, or environment "Find recent papers on LLM agents"
Plan / Think ๐Ÿง  LLM reasons about the next step "I should search arXiv first"
Act Agent calls a ๐Ÿ”จ tool or produces ๐Ÿ“„ output
search_arxiv(query="LLM agents")
Observe Agent ๐Ÿ‘€ reads the tool's return value ๐Ÿ“ƒ 10 papers returned, titles listed
  • Keeps going until:
    1. Final answer โœ…
    2. Maximum iteration ๐Ÿ™€
    3. Exceptions ๐Ÿ˜žโžก๏ธ๐Ÿ˜„

๐ŸŒฑ Foundations

๐Ÿ€ Four Patterns Matter

  • Not all agents loop the same way
    1. ReAct (Reasoning + Acting)
    2. Chain-of-Thought (CoT)
    3. Reflection
    4. Plan-and-Execute

๐ŸŒฑ Foundations

๐Ÿ€ Four Patterns Matter

  • ReAct (Reasoning + Acting)
    • Workhorse ๐Ÿ‡
    • Think out loud โžก๏ธ do โžก๏ธ observe result โžก๏ธ next move
    • Usecases involving step by step processes (eg. โ“ QnA, ๐Ÿ”Ž research)

๐ŸŒฑ Foundations

๐Ÿ€ Four Patterns Matter

  • Chain-of-Thought (CoT)
    • Bureaucrat ๐Ÿ‘ด
    • It plans (everything) first, acts once
    • Usecases where full plan should be visible (e.g., ๐Ÿ˜ฅ Complex reasoning, ๐Ÿ“ math, or ๐Ÿ•ธ๏ธ multi-step logic)

๐ŸŒฑ Foundations

๐Ÿ€ Four Patterns Matter

  • Reflection
    • Scientist๐Ÿง‘โ€๐Ÿ”ฌ
    • Two phase loop: Generate, Critique, Revise โ™ป๏ธ
    • Usecases where first drafts improve with self-review. (e.g., โœ’๏ธ Writing, ๐Ÿ‘จโ€๐Ÿ’ป Code generation)

๐ŸŒฑ Foundations

๐Ÿ€ Four Patterns Matter

  • Plan and Execute
    • Beurocrat riding a workhorse ๐ŸŽ 
    • Create full plan upfront, then executes each step one by one.
    • Usecases where scope is extremely clear and don't need mid-process adjustments.

๐ŸŒฑ Foundations

๐Ÿ› ๏ธ Tools ๐Ÿ’พ Memory ๐Ÿ  State

  • Regardless of 4๏ธโƒฃ patterns, agents needs these 3๏ธโƒฃ things

๐ŸŒฑ Foundations

๐Ÿ› ๏ธ Tools

  • Tool is a ๐Ÿ function the agent can call
  • Tools let agents "DO THINGS" (eg. Search the web ๐Ÿ”Ž, Query a database ๐Ÿ—„๏ธ, Run code ๐Ÿƒโ€โ™‚๏ธ)
  • Without tools, an agent is just a ๐Ÿ—ฃ๏ธ chatbot with delusions of competence ๐Ÿคฆ

๐ŸŒฑ Foundations

๐Ÿ’พ Memory

  • Without memory, every conversation starts from scratch
    1. Short-term memory ๐Ÿง  (within the conversation in message list)
    2. Long-term memory ๐Ÿ—„๏ธ (across conversations in vector store)
    3. Episodic memory ๐Ÿ› ๏ธ (all past interactions, preferences and experiences)

๐ŸŒฑ Foundations

๐Ÿ  State

  • Agent's current situation
  • Conversation history (simple ReAct) / Structured object (complex graph-based agent)

๐ŸŒฑ Foundations

๐Ÿ™…โ€โ™€๏ธ When NOT to use an Agent โš ๏ธ

  • ๐Ÿ’ช Powerful, but they're also ๐Ÿข slow, ๐Ÿ’ธ expensive, and ๐Ÿฅฒ unpredictable
  • ๐Ÿ‘Ž Flowchart with no branches
  • Reach for a plain pipeline when:
    • Steps are fixed โœ…
    • Order must be strict ๐Ÿ”’
    • Latency matters โšก
    • Simple CRUD ๐Ÿงพ
    Reach for an agent when:
    • Next step depends on result ๐Ÿ”
    • Request is unclear โ“
    • Tools unknown up-front ๐Ÿงฐ
    • Needs self-correction ๐Ÿ”„

๐ŸŒฑ Foundations

Minimal Tool Calling Loop (OpenAI)

๐ŸŒฑ Foundations

Minimal Tool Calling Loop (Gemini)

๐ŸŒฑ Foundations

Simple ReAct Prompt for Reasearch Agent

๐ŸŒฑ Foundations

Simple ReAct loop for Reasearch Agent

๐ŸŒฑ Foundations

โฌ‡๏ธ Summary

  • ๐Ÿ”„ AI agents are fundamentally constarined loops, not one-time LLM responses.
  • ๐Ÿง  Core agent loop: Perceive โ†’ Plan โ†’ Act โ†’ Observe.
  • ๐Ÿ—๏ธ This loop forms the foundation of reliable agent architectures
  • โšก Key design patterns: ReAct, Chain-of-Thought, Reflection, and Plan-and-Execute
  • ๐ŸŽฏ Each pattern has specific strengths and use cases.
  • ๐Ÿ› ๏ธ Tools enable agents to interact with external systems.
  • ๐Ÿ’พ Memory provides persistence and context across tasks.
  • ๐Ÿ“Œ State management keeps agent behavior structured and reliable.
  • โš–๏ธ Choose complex agents only when the problem requires them.
  • ๐Ÿš€ For simpler tasks, a code pipeline is often faster and more efficient.
  • โœ… Mastering these fundamentals prepares you to build production-ready AI agents.

๐Ÿง  Practical LLM breakdown as of May 2026

  • Heavy model for everything โ€” you'll burn through your budget on tasks that light weight model handles
  • Light model for architecture decisions โ€” the cost savings aren't worth the quality drop

Prompt Architecture for Agents

Treat agent prompts as code, not chat.

  • "be helpful." โžก๏ธ Operating manual with role, rules, data, and output format
  • Structure, precision, edge case handling, and version control

Prompt Architecture for Agents

Four layer prompt architecture

  1. System Identity: Who is the agent? Define role, personality, and hard boundaries โ€” what it will and won't do.
  2. Instructions Operational: rules, available tools, workflow steps, error handling, and escalation paths.
  3. Context Injection: Dynamic runtime data โ€” user profile, session history, current time, conversation summaries.
  4. Output Constraints: Enforce exact response format (e.g. JSON schema) so downstream systems can parse reliably.

Prompt Architecture for Agents

Persona Design

  • Persona is an engineering decision, not decoration
  • Research consistently shows that LLMs adjust their reasoning depth, formality, and risk tolerance based on the role they're given. Choosing a persona is an engineering decision with measurable consequences.
  • Watch out: the "thorough" persona often loses on accuracy because it tries too hard and starts inventing facts to fill gaps. Always benchmark variants against real test cases.

Prompt Architecture for Agents

Few-Shot Examples

  • Show, don't tell
  • Instead of trying to describe what you want โ€” which gets messy fast โ€” you just show it.
  • Skip few-shot when the prompt is already long (each example adds tokens and cost) or the task is simple enough that instructions alone do the job.

Prompt Architecture for Agents

Versioning & Testing

  • Treat prompts like production code
  • The hidden failure mode: Testing only the "happy path" means silent bugs lurk in production. Every edge case in real usage must be covered by eval cases โ€” including expected refusals.
  • "Prompts need version control just like code. When you change a prompt, old bugs come back. When you don't track which version is running in production, debugging becomes guesswork."
  • Regression test cases should cover: required tool usage, keyword presence in answers, max tool-call efficiency limits, and expected refusals for out-of-scope queries.

Prompt Architecture for Agents

Best Practices

  • Treat agent prompts as code, not chat
  • Constraints make agents more reliable, not less
  • Test the unhappy path - include expected refusals and edge cases
  • Persona choices are measurable
  • Match thinking budget to task difficulty

Tools, Skills, and Structured Outputs

Tool Design

  • All major LLM providers use JSON Schema underneath
  • ๐Ÿ‘‰ Abstract away vendor-specific wrappers rather than maintaining separate tool definitions per provider
  • Build atomic tools that do exactly one thing
  • ๐Ÿ‘‰ If you're writing "web_search_and_summarize", stop โ€” you've created a hidden agent inside a tool. Split it and let the agent's main loop handle orchestration

Tools, Skills, and Structured Outputs

Structured Outputs

  • Never parse LLM responses with regex
  • ๐Ÿ‘‰ Use Pydantic schemas and tell the model to return matching JSON โ€” if it doesn't validate, retry
  • Never return a raw string from a complex tool
  • ๐Ÿ‘‰ Return a structured "SkillResult" object with success, message, data, snapshot, and error_code fields so failures are programmatically distinguishable

Tools, Skills, and Structured Outputs

Skill Organization

  • Package related tools into reusable Skills
  • ๐Ÿ‘‰ With shared config, API keys, rate limits, and error handling โ€” eliminates copy-paste sprawl across agents
  • Use a central SkillRegistry for tool discovery and dispatch
  • ๐Ÿ‘‰ Rather than hardwiring tools into each agent

Tools, Skills, and Structured Outputs

Safety & Reliability

  • Assume inputs are malicious
  • ๐Ÿ‘‰ Validate file paths are within a sandbox before reading; run static AST analysis before executing code
  • Build retry-with-backoff and fallback strategies
  • ๐Ÿ‘‰ Before you ship, not after the first outage
  • Use declarative Markdown skills (SKILL.md files with YAML frontmatter)
  • ๐Ÿ‘‰ Separate behavior from infrastructure โ€” keeps Python logic clean and lets non-engineers tune agent behavior

Tools, Skills, and Structured Outputs

The core mindset shift

Stop building "clever scripts" and build a Capability Library โ€” atomic, testable, sandboxed, and framework-agnostic.

Handoffs and Routines

When to use handoffs (and when not to)

  • Use handoffs when each sub-task is independent
  • ๐Ÿ‘‰ one agent can fully handle a request, and errors are self-contained.
  • Use graphs or crews when tasks depend on each other
  • ๐Ÿ‘‰ gents need to collaborate on the same output, or failures require coordinated recovery.

Handoffs and Routines

Start simple, add complexity only when justified

  • most expensive mistake in agent design is reaching for LangGraph or CrewAI when a simple fan-out would do
  • ๐Ÿ‘‰ Costly in both engineering time and runtime. Don't overcomplicate routing until you have a concrete reason to.

Handoffs and Routines

Keep the router's job narrow

  • The triage agent's instructions make this explicit
  • ๐Ÿ‘‰ Your ONLY job is to route... Do NOT try to answer questions yourself. Just route."
  • A router that starts answering questions is no longer a router.

Handoffs and Routines

Scope specialist agents tightly

  • Each specialist should have narrow instructions and know when to hand back
  • ๐Ÿ‘‰ e.g., "If the question is not about billing, hand off to the triage agent."
  • This prevents agents from overreaching

Handoffs and Routines

Use context compression for long trajectories

  • Without compressing state when handing off between agents, you'll hit context window limits
  • ๐Ÿ‘‰ execution history into a compact block before handoff is a production-readiness requirement, not an optimization.

Handoffs and Routines

Use embedding-based routing to cut costs

  • Cheap local sentence embeddings with cosine similarity can route requests without an LLM API call
  • ๐Ÿ‘‰ avoiding latency and cost overhead on every incoming message.

Handoffs and Routines

Know the hard limit of the pattern

  • Handoffs break the moment one agent's output needs to feed into another's input, or a human approval step is required mid-flow.
  • That's the signal to introduce state and graph-based architecture โ€” not before.

Stateful Agent Graphs

Retry needs Graph

  • Model agents as stateful directed graphs, not linear chains
  • ๐Ÿ‘‰ Linear designs like ReAct loops can't self-correct or retry
  • ๐Ÿ‘‰ The moment you need a retry, you need a graph with explicit nodes and edges for Try, Evaluate, and Retry logic

Stateful Agent Graphs

The question isn't if production agents fail, but how gracefully.

  • Design for failure explicitly
  • ๐Ÿ‘‰ Errors are not edge cases. Model recovery as first-class edges in the graph
  • ๐Ÿ‘‰ Retry paths, escalation paths, and human handoff paths

Stateful Agent Graphs

Use retry limits with escalation

  • Cap retries (e.g., retry_count < 2) and route beyond that limit to a human or escalation node
  • ๐Ÿ‘‰ Never let an agent loop indefinitely

Stateful Agent Graphs

Add human-in-the-loop checkpoints before mutable operations

  • Use interrupt_before on any node that writes, posts, or changes state in the real world.
  • ๐Ÿ‘‰ Let a human review sensitive plan before the agent acts on it

Stateful Agent Graphs

Use durable checkpointers in production

  • MemorySaver is fine for development, but production systems need SQLite (single machine) or PostgreSQL (distributed) so state survives process restarts.

Stateful Agent Graphs

Match the tool to the complexity

  • Use lightweight frameworks like smolagents for quick prototypes and one-off scripts.
  • Reach for LangGraph only when you need production-grade state management, persistence, and complex control flow.

Stateful Agent Graphs

Keep state as a shared, typed dictionary

  • Flow a single TypedDict state through the entire graph
  • ๐Ÿ‘‰ so every node reads and writes to a consistent, inspectable structure
  • ๐Ÿ‘‰ makes debugging and time-travel replay practical

Multi-Agent Collaboration

On When to Use Multi-Agent Systems

  • architecture is powerful, but the coordination tax is brutal.
  • ๐Ÿ‘‰ More agents means more latency, explosive API costs, hallucination cascading, and actual nightmares when trying to debug which agent screwed up first
  • The core principle: only use multiple agents when roles are genuinely distinct โ€” don't add agents just because you can.

Multi-Agent Collaboration

On Debugging & Predictability

  • Assembly Line Patter
  • ๐Ÿ‘‰ CrewAI's explicit task pipelines are easier to debug because execution order is predictable, versus conversational frameworks where emergent flow is harder to trace. Design for debuggability from the start.

Multi-Agent Collaboration

More...

  • On High-Stakes Accuracy
  • ๐Ÿ‘‰ Dbeate Protocol: If you are building an agent focused on factual accuracy (like a medical or legal assistant), don't just rely on a single ReAct loop. Put the output through a debate protocol before showing it to the user
  • ๐Ÿ‘‰ Use a Generator โ†’ Challenger โ†’ Revise cycle until the challenger runs out of objections or you hit max rounds.
  • On Governance & Agentic Actions
  • ๐Ÿ‘‰ Agents that can take real-world actions (messaging people, raising approvals, closing tickets) need hard interrupt-before-mutate checkpoints โ€” not polite post-action email summaries.
  • On Resilience
  • ๐Ÿ‘‰ Hard-locking a multi-agent crew to a single API provider is a critical flaw. Fallback logic across providers isn't optional โ€” it's a hard requirement for production systems.

โญ๏ธ Next steps...

  1. ๐Ÿ”น Foundations, first principles, anotamy of an agent
  2. ๐Ÿงญ Systematic and programmatic approach to the agent's system prompt
  3. ๐Ÿค– LLM from text generator into an agent that can act on the external world
  4. ๐Ÿ‘ฅ Introduce collegues to a sharp agent as it hits the wall (clean, cheap and fast)
  5. ๐Ÿ”ง Unpredictable toy into a reliable, engineering-grade system (graph, state, memory)
  6. ๐Ÿค multi agent collaboratoin & complex team workflows
  7. โš–๏ธ Making sure not at mercy of any single provider
  8. ๐ŸŒ Out of isolated islands - MCP, A2A, A2UI, AP2
  9. ๐Ÿงฉ Monolithic prompts to modular, declarative skills changes how you manage agents
  10. ๐Ÿ“œ "Spec-first" discipline, machine-readable contract before any code is executed
  11. ๐Ÿง  Solving fundamental flaw in stateless LLMs (Memory / IR)
  12. โš ๏ธ God-mode of agent capabilities, and keeping it from blowing up in your face
  13. ๐Ÿ‘€ Expand the agent's sensory inputs to visual and auditory world
  14. ๐Ÿ›ก๏ธ As agents grow in power and autonomy, we must now turn to guardrails (safety & reliability)
  15. ๐Ÿงฎ How do we mathematically prove this thing is actually doing its job?
  16. ๐Ÿ•ฐ๏ธ Agent as a Service (daemon loop, the watchdog, the heartbeat, the restart policy)
  17. ๐Ÿ”— Know more...

Press S to open the speaker view for notes and slide previews.