From Models to Multi-Agent Systems
An AI agent is not a new kind of model. It is a model API call wrapped with instructions, context, tools, permissions, and a loop that keeps working until the task is done.
A Large Language Model is an external API — a remote inference service. On its own it is stateless: no memory, no file access, no ability to act.
The model sits on the right. Your code calls it. Everything else you see in this guide is the wrapper around that call.
You send text to the model API 1. The model receives your question as a sequence of tokens — nothing more.
The model predicts the most likely next tokens 2. Text in, text out. No tools, no memory, no side effects.
Between you and the model sits the agent runtime — a local process that assembles context, calls the model API, and interprets what comes back.
At its core is the Execution Engine: the component that receives your request and coordinates everything that follows.
You ask: “What is the weather in Boston?” 1. This time the question goes to the Execution Engine — not directly to the model.
The Context Assembly layer 2 gathers system instructions, conversation history, and configuration into a single model prompt — and sends it to the model 2.
The model prompt is what the model actually sees. Context assembly is the process that builds it. Change what goes in, change the behavior.
Notice the History box inside the engine — it holds every user message and model response from this session. Each new prompt includes the full conversation so far, giving the model continuity across turns.
MCP servers expose tool schemas — the runtime injects these into the context assembly so the model knows what tools are available.
This is how the model discovers it can check the weather: the tool signature is right there in the prompt.
The model sees the weather tool in its context. Instead of guessing, it returns a structured tool call 4: weather("Boston").
This is not a function the model runs — it is a request sent back to the Execution Engine.
The Execution Engine routes the tool call to the right MCP server 5. The server executes the request and returns real data.
The tool result 6 is fed back into the context and sent to the model. The model now has facts it could not have known from training alone.
With real data in context, the model produces a factual answer 7 and the runtime delivers it to you.
To the left of the runtime sits the instruction layer — project-level files that define identity, rules, and behavior. These are injected into the context assembly on every call.
The engine now also shows Memory — persistent storage that survives across sessions. While History resets when you start a new conversation, Memory lets the agent recall user preferences, project context, and past decisions indefinitely. The Context Assembly layer pulls in relevant memories and includes them in the Model Prompt, so the model can act on information from previous sessions.
Instruction files 1 get loaded into context assembly. Claude Code uses CLAUDE.md. OpenAI Codex uses AGENTS.md. Cursor uses .cursor/rules.
These files define role, standards, workflow expectations, and guardrails. Swapping the instruction layer turns the same model into a security reviewer, staff engineer, or product writer.
A skill registry 2 lists available workflows in the context — names and descriptions the model can read. When relevant, the model makes a tool call to load the full skill file into context.
You ask: “Run my tests” 3. The runtime assembles instructions, tool schemas, and the skill registry into context, then sends everything to the model.
The model sees /test in the skill registry. It returns a tool call 4: load_skill("/test"). The engine reads the skill file and returns its content — instructions for running and reporting tests.
The test skill tells the model to run tests. The model calls bash("pytest --cov") 5 and the engine executes the command locally, returning stdout and stderr as the tool result.
This is what separates a chatbot from an agent: the ability to run real commands on your machine.
The model reads the test output, summarizes the results, and reports back 6: all 42 tests passing with 94% coverage. One prompt, one skill, one bash call.
You have already seen every piece: the runtime, the model prompt, tools, instructions, and skills. Now they work together. The persona tells the model: plan your approach, take an action, observe the result, decide if you are done.
The model still handles one prompt and one prediction at a time. After each response, the Execution Engine feeds the result back and calls the model again. The loop is not inside the model — it is the engine re-invoking it.
You say: “Implement the dashboard.” 1 The request enters the engine with instructions, tool schemas, and skill registry already in context.
The model’s first move: call Jira 2 to fetch the ticket specs. The MCP server returns the requirements. With specs in context, the model forms a plan and starts executing.
With specs in context, the model starts building 3. It calls writeFile for index.html and app.js. Each call goes to the engine, executes, and returns. The model keeps going.
The model runs npm test 4. Two tests fail. This is where the loop matters: the model does not stop or ask you what to do. It reads the error, decides what to fix, and continues.
The model fixes app.js 5, re-runs the tests, and this time they all pass. Each iteration is still one prompt → one prediction. The runtime feeds the result back and the model decides: “am I done?”
All tests pass. The model reports back 6: dashboard implemented, 3 files created, 8 tests passing. One prompt from you. Multiple rounds of plan → act → observe → decide internally.
So far, one agent, one loop. But real tasks need coordination: implement a feature, review the code, run the tests. An orchestrator delegates sub-tasks to specialized agents.
Each agent is still one runtime, one loop, one model call at a time. The orchestrator decides who works on what.
Each agent gets a different persona from its instruction file: one implements features, another reviews code, a third runs tests. Same model, different instructions, different behavior.
The orchestrator delegates tasks, collects results, and re-plans. Each sub-agent runs its own agentic loop — plan, act, observe, repeat — then reports back.
Here is the complete picture: multiple agent runtimes, each with its own context and tools, talking to models and connecting to MCP servers. But where does each piece actually live?
The agent runtimes, tools, and all execution happen on your machine. Your code, your files, your bash — nothing leaves unless you allow it.
Some agents run in a sandbox — a restricted environment with no network access, limited filesystem scope, and controlled permissions. This is governance built into the runtime, not bolted on.
The model API can run in three places: a cloud provider (Claude, GPT, Gemini), a local model on your machine (Ollama, llama.cpp), or a private cloud (AWS Bedrock, Azure OpenAI).
The runtime doesn’t care which — it sends the same API call. What changes is where your prompts go.
Local MCP servers run alongside your agent. They connect to databases, GitLab, internal APIs — services inside your network. Data stays local; only tool results enter context.
But “local” doesn’t always mean contained. A web-search MCP runs locally but makes Google API calls over the internet. A Puppeteer MCP fetches external URLs. The server is local — the data it touches may not be.
Some MCP servers are cloud-hosted — GitHub, Jira, Slack. Your local agent calls them via API. The connection crosses your perimeter, so permissions and scoping matter.
The MCP protocol standardizes auth and scoping: each server declares what it can do, and the runtime enforces which agents can call which tools.
1The agent can read .env files, SSH keys, API tokens from your filesystem. Credentials are one tool call away.
2Local MCPs can call external APIs. A “local” web-search MCP sends queries over the internet. Data crosses your perimeter.
3Remote MCP servers hold your API tokens and receive your data. Third-party services see what you send them.
4Cloud models receive every prompt. Where are logs stored? Is your data used for training? Who has access?
Every modern AI coding tool uses the same building blocks. What changes is the model layer, the instruction system, and where execution actually happens: in your editor, on your machine, or in a remote sandbox.
CLAUDE.mdAGENTS.md and runtime policy.cursor/rules.github/copilot-instructions.md and repo contextFive questions every team should answer before putting agents into production.
Where is the model running, and who stores the logs of my prompt history?
What is the 'blast radius' if the agent goes rogue? Does it have write access?
Can I replay every action the agent took in a verifiable log?
Where are the 'human-in-the-loop' checkpoints for high-risk actions (and beware of approval fatigue)?
How do I detect, correct, and prevent the agent from deviating from its core AGENTS.md scoped role?