Last year in 2025, I ran trainings on AI-first software engineering for over 300 developers from organizations ranging from fewer than 100 employees to over 10,000 employees. I saw a lot. And I got to hear a lot of “how should I or we do X?” questions. Often the X was about agent instructions. The two big questions relating to these agent instructions were:
- What should we write in our shared agent instructions?
- How do we know our shared agent instructions are working for us?
In addition to the trainings, I also had a particularly memorable interview with a 1,000+ employee organization who were looking for some AI-friendly developer consultants and freelancers to help their internal developers level up their skills with these new coding agents. I didn’t get the gig. While I didn’t receive detailed reasons why, my suspicion is that it had something to do with our different approaches to agent instructions. The client saw shared instructions as an easy solution for AI adoption. I saw them as a potential minefield.
While tooling will make it easier for people to start using coding agents efficiently, I think shared instruction crafting requires much more human coordination than we think.
I’ll offer here my opinions on the two questions listed above. A lot of my concerns are based on hypotheticals. But I’m becoming very convinced that the teams that don’t look for an answer to the question of “how do we know our shared agent instructions are working for us?” will at some point have to deal with instruction bloat and a great serving of unproductive discussions around which instructions to keep and which to trash.
What are agent instructions?
If you are using GitHub Copilot, you will know agent instructions as custom instructions. It’s the copilot-instructions.md file in the .github directory. Cursor uses the term rules and stores these instructions as different markdown files inside the .cursor/rules folder. Claude Code has CLAUDE.md for the same purpose.
The open format for these instructions is the AGENTS.md file, which is the only way to add instructions for OpenAI’s Codex. All of the tools listed above (besides Claude Code) support AGENTS.md as well.
These instructions are meant to be read automatically most of the time as you are prompting your coding agent, which means that most of the time your context will contain the contents of the instruction files even if you don’t see them in your chat session. It’s a great pattern for providing long-term memory for your agent so that you don’t need to explain every time what you’re working with, how to run different development commands, and how you want your code to look.
If you still have no idea what I’m talking about, check out https://agents.md/ or the documentation of your tool of choice.
What to add to shared instructions?
If you have no idea what to write in your very first instruction file, I suggest you consult the documentation of your coding agent. I’ve read the docs for the most popular tools, and I can tell you that the docs are a great starting point.
If you don’t like reading, you can initialize these instructions using your agent. For example, Claude Code has the /init command that prompts the agent to generate a CLAUDE.md file for itself. This will give you a good idea of what the instructions could look like for your project. But my suggestion is not to use this auto-generated file as your starting point. Why?
The common misconception developers have is that Claude Code knows better than you what the instructions should contain. Sometimes developers know they know better but don’t want to think too hard about the instructions file and therefore still delegate the work to the agent. It would seem to make sense that the instructions generated by the agent are better than what we could come up with, as the agent is the one that needs to use these instructions. Surely, the LLM powering the agent has been fed with training data about good instructions as well, whereas you are just now getting familiar with the subject.
While agents are great at generating coherent sentences, they don’t seem to be good at self-editing. The last time I generated an instruction file automatically, I noticed that the generated instructions were considerably more verbose compared to the documented examples from the agent provider.
Here is another uncomfortable fact for us developers who like to think in 1s and 0s: there is no single right set of instructions. You really have to develop an intuition for this stuff and be prepared to see the same instructions fail in one project and succeed in another.
Now, when you are playing around with instructions in your own projects or adding instructions but keeping them out of Git in a team project, do what works for you. I personally don’t write a single rule until I need one, but I have colleagues who are perfectly happy starting with multiple lines of instructions before their first prompt.
But when you start committing instructions to Git, you need to take things a bit more seriously. The two issues you need to watch out for are:
- Instruction bloat that decreases the quality of agentic outputs for the whole team
- Instruction bike shedding that eats up your and your team’s time
I’ll discuss instruction bloat more in depth in the next section. I’ll now focus on the bike shedding part of the problem.
If you’ve never heard of bike shedding, the basic idea is that we developers tend to spend more time discussing shallow topics and less time discussing deeper topics because everyone is able to contribute their opinion on the shallower topic. For more about the background of the term, check out https://en.wikipedia.org/wiki/Law_of_triviality.
Everyone who uses coding agents will at some point form opinions about instructions and what they should include. When it comes to using shared instructions, everything will go fine until you discover some conflicting opinions. This is why I recommend teams start with the most objective stuff imaginable — the things recommended across all instruction documentation from the popular tools.
In more detail, my recommendation for teams who want to create shared instructions is to start with these things and nothing else:
- Describe the architecture in 1 to 2 sentences
- List the helpful development commands you want the agent to keep running
These are things where the team should have zero disagreements. If you encounter disagreements when defining these things, you have discovered major misalignments that need to be fixed. You can’t have developers working on the same project if they have a different understanding of what the project is about and how it’s built.
This is an example of an AGENTS.md file I would add to a fictional project in a fictional team setting:
# Project overview
This is a Ruby on Rails 8 application for developer training.
## Commands
- `bin/dev` - Start development server
- `bin/rails test` - Run all tests
- `bin/rails test:system` - Run all system tests
- `bundle exec rubocop` - Check code style and formatting
Once we get a better footing with sharing instructions with the team (see next section), I’m comfortable with my team starting to add more “subjective” instructions about things like code style. Until that point, my fear is that these subjective instructions would lead to conversations without any data backing them, which either eat up your team’s time, cause frustrations inside your team, or both.
If you are the only one interested in this stuff, things are much easier for you. You will never enter into committee decision-making because others don’t really care what you add to the instructions. But in our last training, I made a joke that my co-trainer and I would start arguing about a specific instruction line during the first day if we were to work on a project together. My co-trainer corrected me by saying that the argument would probably ensue during the first hour instead of the first day.
Recap time: if you are generating the very first shared instruction file for your team, describe the architecture in 1 to 2 sentences and list the most used development commands. Stop there.
How to evaluate instructions?
Let’s now dive deeper into instruction bloat. If you are not yet familiar with the concept of context rot, pay really close attention. This stuff is the silent killer of the quality of your coding agent’s outputs.
I feel that developer thinking in 2024 was about how, once we get rid of small context windows, we can finally get more out of coding agents because they can read more files and folders in our codebase and do all sorts of queries to external data sources. In 2025 we became much less constrained by context windows — and then ran into a new problem: the more context you have, the more hallucinations you end up having as well. LLMs will interpret irrelevant facts as relevant, so you generally want to limit the amount of things the agent can misinterpret.
As instructions are loaded into the coding agent’s context almost every time you start a new session or prompt your agent, you are basically working with a less streamlined context compared to having no instructions. The more instructions you load into the context, the more hallucinations you might encounter.
I’ve heard some people argue that this is just a temporary problem and future models won’t have this context rot issue. That’s all well and good. But I need to write code now, and I need to make the most out of the tools I currently have access to.
In addition to context rot, the agent is operating with a system prompt that is not visible to you, and post-training that is even less visible. You might end up adding instructions that conflict in subtle ways with the system prompt. Your instructions might also go against the training process, which in turn might limit the agent’s problem-solving skills. Obviously this is all hypothetical, but I hope you see the possible risks here and how neither me nor you can really prove that this isn’t something we should be concerned about.
If you are starting with a very slim instruction set, you won’t run into instruction bloat issues. But fast forward the AI adoption process inside your team. More and more developers are going to start having more and more suggestions about new instructions. And if you don’t have some sort of gating for these instructions, you’ll end up with massive instruction files that decrease the quality of your coding agents’ outputs. And since there are so many rules, you have no idea how to test which rules are working for you and which aren’t. You might remove all of the instructions and notice that the agent is performing equally well — or even better — without your current setup.
So how do we avoid this situation? How do we evaluate instructions and skills so that we know they are making it easier — not harder — to complete tasks with AI assistance?
From local to global
The setup that most of us would probably want is a project-specific benchmark for instruction and skill evaluation. We would save a past state of the codebase, select a set of development tasks that were added later, and then build a harness that allows us to send an agent to try to replicate those tasks with a given prompt and instructions. We would have automated tests that verify the agent’s implementations work, so we don’t need manual evaluation. This, in turn, would allow us to run many different tasks across many iterations. And most importantly, we would have quantitative data on whether a particular instruction increased or decreased the agent’s performance.
Building these types of benchmarks is something I have looked into and done early experimentation with. But I haven’t actually built a fully functioning benchmark like this. I can tell you that building these benchmarks is getting easier and easier as more and more great open-source tooling becomes available. But building and maintaining these benchmarks will end up taking a lot of time from you and your team.
If I were given the opportunity to build a benchmark like this for the current project I work on, I’d have to ask for a minimum of two weeks to build it. But I haven’t done this thing before, and just like you, I also suffer from optimism bias. So when I say I can get it done in two weeks, you should add plenty of buffer on top of that estimate.
I also have no idea how much maintenance this benchmark would require. I’m mostly worried about the tasks in the evaluation set expiring in the sense that they stop reflecting the current work we are doing in the project. But I have no idea how fast the tasks would expire. We might not need to update anything in the next 12 months, or we might need to revisit them every quarter.
Finally, if I were to do a return-on-investment calculation for this, I wouldn’t know what the expected return is. I have no idea about the baseline performance of agents with no instructions versus agents with non-evaluated instructions versus agents with evaluated instructions. Even if I were able to nail down the estimate, the actual value of this initiative would remain difficult to quantify.
So is there some other way we could go about evaluating instructions? I believe so. The following idea is something I’ve discussed with other developers in the past weeks but haven’t actually tried out in practice. This setup wouldn’t work in my current work context because of very context-specific reasons, but it’s something that might be very easy for you to start experimenting with in your team.
This is the idea:
- Whenever you want to add a project-wide instruction, you first add it as a local instruction (meaning you keep it in your version of the project but out of Git).
- Once you have used the instruction for a set amount of time (e.g. two weeks) and you still feel that it improves agent outputs, you propose it to the team.
- At set intervals (e.g. every month), the team decides which proposed instructions they take into evaluation. These instructions are then added to Git and become available for everyone working on the project.
- Once the team has used the instruction for a set amount of time, they discuss their findings. Did outputs seem to get better or worse? Do they want to keep it or not? Do they want to keep evaluating it with a slight tweak?
The team-wide evaluation step could be even less formal by keeping a log of suggested instructions and allowing individuals to pick which ones they want to try next. If someone else ends up liking the instruction, they can cast a vote for it to be added to the project-wide instructions.
This process adds a gate that all instructions need to pass. Since it encourages local use of instructions, I cannot see that developers would view this process as something that limits their experimentation. At least I would be happy knowing that the project-wide rules I must use have been vetted by the whole team — including me.
It’s also relatively easy to adopt or start experimenting with. No one needs to set up an evaluation harness. The biggest hurdle is probably getting agreement inside the team about how you want to manage this process. But hey, this is something you could bring up in the next retrospective, maybe? Feel free to point them to this post to explain the why and how of the idea.
Some additional notes about “from local to global”:
- Keep the number of instructions under evaluation as low as possible so that you don’t end up with too many variables in your experiments. I’m inclined to think that the wider the variety of instructions you have under evaluation, the fewer you can evaluate at the same time.
- I’m not saying that these qualitative evaluations will be free from all forms of cognitive bias. I’ve asked many teams how they handle project-wide instructions, and the answer I’ve heard from all of them is, “we add the instructions that seem to make sense.” That’s probably something that works for many teams as they start adding their first instructions. But I’m not so sure this approach will prevent instruction bloat and bike shedding.
What about skills?
Agent skills are now a thing as well. Instead of just sharing our instructions with the team, we are encouraged to share skills that can be used to basically break down large instruction sets into smaller files.
Does this solve our problem with instruction bloat and bike shedding?
When it comes to bike shedding, I can’t imagine people having fewer opinions about skills than about instructions once they become more familiar with them. So expect no fewer unproductive meetings and conversations around skills.
Instruction bloat can be managed with skills. Here are the two big buts:
- Initial evaluations show that skills are more prone to failure compared to instructions. This is partly because skills inherently require the agent to make the right skill call at some point during its task. Will this issue be solved in the future? Hopefully. But in the meantime, build your ways of working around the current capabilities of agents.
- Skills add their descriptions to the context. It will be fewer tokens. But don’t think that this means there will be zero chances of confusing irrelevant information with relevant information. Every new skill is also a new way for the agent to get things wrong. But I don’t know where the balance between “too few skills” and “too many skills” is.
What about AI evaluating instructions?
I’ve heard people using coding agents to evaluate their instructions and using that as a quality metric. There are two things that need to be taken into account when evaluating instructions:
- Is the instruction understandable for agents (and humans)?
- Does the instruction contribute positively on average to all agent tasks?
I can see how using agents to evaluate the first question works pretty well. But this is the easiest part to evaluate. The much harder part to evaluate is whether output quality goes up or down — and the answer is not something that exists in the training data or can be reasoned. It really does require evaluations.
Do we need “instruction engineers”?
It sounds like we need new capabilities (instruction and skill crafting) in our teams. Should some people dedicate their attention to this and become the writers of instructions?
No. If you have ever done prompt engineering for an AI-powered tool or even crafted prompts for a particular workflow, you will have noticed that it’s really important to keep the feedback loop as short as possible. You won’t get the prompts right on the first try. You will need to adjust things, try again, adjust more, try again a couple of times, and so on until you consistently get better outputs.
In practice, this means that when it comes to instructions and skills, dogfooding your instructions seems like the best way to go about it. By dogfooding, I mean that you evaluate instructions and skills by being a user of those instructions and skills. If you have centralized “instruction engineers” running around between teams, you will lose that feedback loop and the necessary intuition-building that feature developers need to cultivate.