A guide to Agentic Systems

AI Product development lifecycle

The diagram below shows a typical AI development lifecycle.

Many AI projects fail not due to bad models, but due to solving the wrong problem. Some points to consider are.

Is the problem best solved by AI or traditional software?
Who is the end user?
What are the edge cases?
What are the boundaries of acceptable behaviour?

Roles to involve: AI Product Managers, Domain Experts, AI Engineers.

Building a prototype

I like the practice where AI Engineers prototype and communicate with stakeholders directly. Some points to consider here are

Use Notebooks or no-code tools, small datasets and off-the-shelf models.
This is where you learn rather than focus on performance.
Document everything so that you don’t repeat the mistakes in the future.
This is where you will do a lot of prompting and researching market for tools that could potentially help in solving the problem.

Defining performance metrics

You are solving a real business problem that needs to be grounded in specific metrics you are planning to optimise. Remember to align with business before you start implementing.

What is that you are trying to optimise for? E.g. reduce headcount, improve user satisfaction, increase development velocity.
The above is your north star output metric, now you need to split it into input metrics that can actually drive the output metric forward. E.g. reduce an average time to Customer Support ticket resolution.
Your application should be targeting the input metrics, but the output metric will be what really matters for the business.

Defining Evaluation Rules

For any node in your Agentic System topology you should have an evaluation dataset prepared: Inputs → Expected Outputs.
Define unacceptable responses. E.g. toxicity, hallucinations, unsafe suggestions.

Building a PoC

The goal here is pushing out the system to the users as soon as possible. You might be driven to transition from prompts to a functioning interface (CLI, chat UI, API, etc.)

Use LLM APIs from OpenAI, Google, Anthropic, X etc. to quickly build out the first user facing application.
Your application could be an Excel Spreadsheet with Input Output pairs rather than a full fledged functioning interface. As long as it helps moving metrics forward, it is good to be exposed.
The feedback you get from users is key to understand unknown unknowns. In my experience it almost always shifts your perspective of how to improve the application.

Instrumenting (LLM Observability)

Logging extensive set of metadata about everything that is happening underneath the surface.

Log everything: prompts, completions, embeddings, latency, token counts, and user feedback.
Add additional metadata like: prompt versions, user inputs, model versions used.
Make sure that the chains are properly connected and you know the ordering of operations.
When working with multimodal data log different kinds of data like PDFs, Image, Audio, Video.
Remember that outputs of one LLM call will often become inputs to the next one.
Don’t forget the user feedback! Always attach it to the traces that represent the run users were interacting with when the feedback was provided.

Integrating with an Observability Platform

Observability platform come into play they help with efficient search and visualisation as well as prompt versioning and adding automated evaluation capabilities.

Store your Evaluation rules as part of the platform as you will later apply them on the traces.
Use these platforms as Prompt Registries as your application is a chain of prompts, you will want to analyse and group the evaluation results by Prompt Groups.
Most successful applications reach scale and it becomes too expensive to store all of the traces produced. Observability Platforms have smart sampling algorithms that allow you to store subset of incoming traces.
Most of the Observability Platforms come with their own tracing SDKs, use them for seamless Instrumentation.

Evaluating Traced Data

Run Evals on top of the trace data.

Assumption: You have your Evaluation rules stored in the Observability platform together with the incoming traces via your instrumented application and human feedback attached to corresponding traces.
Run the Evals automatically on the traces that hit the Observability Platform.
Filter out all of the traces that have failing evals or negative human feedback. It is up to you to decide what a failing eval or negative feedback is.
We will focus mostly on this “failing” data moving on.

Evolving the application

Focus on Failing Evaluations and human feedback to pinpoint where the improvement is needed.
If your current topology is not up to the task, make it more complex: Simple Prompts → RAG → Agentic RAG → Agents → Multi-agent systems.
Make the system more complex only if there is a hard requirement, focus on better prompt engineering, data preprocessing, tool integration.

Expose different versions

Deploying new versions fast is important for few reasons:

It improves UX as the present problems get fixed.
Some fixes will generalise to unknown problems and you will be solving multiple bugs with one shot.

Be sure to have strict release tests. You should always have evaluation datasets ready so that you know that what is being released is not worse compared to previous version. Integrate these checks into your CI/CD pipelines.

Continuous Development and Evolution

Build → Trace, collect feedback → Evaluate → Focus on Failing Evals and Negative Feedback → Improve the application → Iterate.
As your business requirements become more complex, you might add additional functionality to the application.

Production monitoring

Configure specific alerting thresholds and enjoy the peace of mind.