Buying and Deploying AI Agents in the Public Sector
- Suren Nihalani
- Jan 23
- 8 min read
Updated: 1 day ago
What’s overhyped, what’s underappreciated, and how to get off on the right foot.
GovTron is a collective of Silicon Valley engineers and product managers. We designed our explainer series to give procurement officers real-world context about what emerging tools can and can’t actually do, along with tips on successful implementation. We update these guides quarterly. This one is current to Q1 2026.
The toddler years of every new technology are loud. Thought leaders and sales folks compete to predict the future and pitch the one innovation that you absolutely must leverage to avoid being left behind. This incessant noise makes it difficult for outsiders to gauge both reality and reasonable expectations.
When it comes to AI agents specifically, there are two truths and a half-lie:
They’ve indeed restructured how we in the tech industry approach our own work. If anything, this shift has been faster and more fundamental than is commonly reported.
It's unfortunately true that many outside deployments have underwhelmed. Legacy procurement and integration approaches are a though match for these particular tools, and many promising plans haven't survived contact with organizational realities.
Hype around AI agents replacing humans has gotten ahead of itself. They’re more a second parallel workforce that can reduce admin, speed cycle times, and lower coordination costs across large organizations—if implemented well.
The thrust of what follows is about bridging the gap. What does good implementation look like? There are no fixed barriers preventing the public sector from seeing the same results as the tech industry itself. Most of the difference is just process. And there are things that every org can do to establish the right preconditions for much happier outcomes.
But before we get to the tactical, a bit of context-setting.
Isn't GovTron just talking its own book here too?
In part, yes! But we also designed our business model so that winning contracts has minimal impact on our own finances. We just want the public institutions that we and our families use to be excellent. If our resources can help without our further involvement, great! What matters is just services getting better!
The Stakes
As a rule, the cost of getting deployment wrong rises with the underlying pace of change. A bad buildout that leaves you stuck in time effectively burns you twice: first in opportunity costs, and then in the lost internal capital and flexibility you need to try again.
Acceleration here isn’t theoretical. Anthropic, the AI lab behind the widely used coding agent, Claude Code, recently published a new platform that was itself 100% written by Claude Code over just a few weeks. And this happened within a year of Claude Code’s initial release. We’re in unprecedented territory. And the upshot is that flexible implementations—that can agnostically incorporate the best new tools as they become available—will massively outperform more static “buy once, integrate once” alternatives.
While government agencies and academic institutions don’t tend to conceptually group themselves with private companies, their competitive realities are just as real:
A university that uses agents productively can stretch the same budget to attract more top students, improve satisfaction scores, and generally reduce admin burdens.
Even if you’re the sole agency to provide a given service, your employees vote with their feet. Orgs that make daily workflows less frictional and annoying—thus letting people focus on more rewarding work—will see non-trivial recruiting advantages.
It’s not quite adapt or perish. But the advantage curve is steep enough that (1) doing nothing comes with real costs, and (2) bad approaches can lock you into suboptimal workflows and make the next attempt both contractually and politically harder.
What if my RFP structure isn’t flexible?
There are always ways to increase your odds and decrease your pain. We cover these more in the final section. In general, we recommend fighting for the maximum optionality available to you. Any lock-in that can be avoided should be avoided—be that vendors, platforms, or any other contract that might turn into an albatross.
Glossary
A few important terms:
Models - The raw brains of any AI implementation.
Agents - The digital worker bees that complete tasks using models as their hive mind.
Orchestrators - The management layer that routes tasks, applies guardrails (eg. permissions, policy checks, rate limits), and records activities for auditing.
APIs - The gatekeepers that models and agents interact with to access external systems like apps and databases.
MCPs (Model Context Protocols) - A new standard that lets (some) APis be approached in a default way, freeing models and agents from having to look up local one-off instructions.
Fine-tuning — Training models and agents on your specific context using curated examples.
What Agents Already Do Well
What forms of work can agents be reliably tasked with today?
Exposing your offerings. Users want to access your services with a minimum of effort. Agents can help you integrate with the apps and interfaces that users already know and prefer. When someone can just send a text instead of navigating your unfamiliar portal, they’ll choose the text every time.
Taming your internal knowledge. Most knowledge in any org is spread between policy binders, messy databases, and individual brains. Agents can help map this knowledge, while also flagging inconsistencies and making search much easier.
Handling your admin. Lots of tasks involve pulling (substantially) the same data, creating the same reports, and searching the same documents. Agents love repetitive tasks in ways that we humans don’t.
Freeing your talent. Employees are often pulled thin. Agents can triage requests, answer the easy ones, and generally trade out low-leverage work for the thornier edge cases that actually benefit from human input.
Ensuring better compliance. We’re all fallible, and don’t always follow every step in a checklist—especially the dull ones. Agents can pinch-fill by handling the sub-tasks that humans prefer to skip, and can also run automated checks to ensure nothing was missed accidentally.
Capturing hidden signals. You have lots of data that isn’t being interrogated enough. What patterns are in there calling for adjusted offerings or processes? Agents can be tasked to be curious on your behalf, and can surface findings (of varying utility) at tiny marginal cost.
As a rule: if a decision or action can be reduced to a set of fixed rules, agents are a godsend. A good agentic system will gradually swallow all the work that doesn’t require human judgment, while also ensuring that the people left to supervise the system and manage edge cases have maximum help and context available.
What’s a real world example of this going well?
The NCIS, of network TV fame, recently began recruiting faster than the rest of the Pentagon. How? Using agents to speed up cycle times. While humans are still deciding who gets hired, every recruiting process involves dozens of individual steps, many done in series, where automating the rote ones can both free up headspace for evaluations and make each handoff happen much faster. Also of note: the agency’s first attempts at integrating AI fell short. They initially relied on off-shelf tools and didn’t prioritize testing. There are no magic solutions, only magic playbooks.
The Wrong Way to Implement
Let’s start with the big one: when we say that public-sector implementations often fail because of process, what do we mean specifically?
Agents aren’t like traditional software. Implementations aren’t signed off on once and then left to run for multiple years. Given that the underlying models are constantly evolving, everything downstream needs to be at least somewhat as dynamic. Tech companies are mostly already structured to accommodate this, where all departments explicitly plan for frequent ongoing adjustments.
Compared to most public orgs, this difference comes out in two practical ways:
A tech company will test many agents, models, and prompts to see what works. There’s little internal pressure to quickly lock any specific solution, and high confidence that the incremental costs of serial testing will be outpaced by long-run savings.
This testing and iteration never really stops. As the models themselves change (often between official updates) and new tools become available, experiments continue. So it's not just that more options will be tried initially, but also more or less forever.
This is the scaffolding that makes any long-term deployment work. And it’s why we strongly encourage public orgs to use modular contracting as a means of setting the right preconditions for later success. It’s not about choosing the right solution in isolation. Even the most thoughtful one-time implementation will fail or fall behind in some way in a month or two. It’s about ensuring you’re setting the right foundation to be able to test widely and iterate indefinitely with a minimum of friction.
How do I know what my modular contracting options are?
Much has changed here recently, especially at a federal level. There’s increasing top-down support for breaking down projects into smaller, iterative stages and avoiding vendor lock-in. We’re working on a new resource to cover the latest FAR overhaul, and hope to follow it with guidance on equivalent state and regional authorities across the US. In the interim, we’re happy to assign a specialist to look into your particular case.
General Implementation Tips
Even if you’re already locked into a less flexible RFP process, there’s a lot you can still do:
Many agents, one voice. While you want external users to encounter a seemingly unified experience, you want many agents and for each to be narrowly tailored to one concrete job where it can (1) learn, (2) be kept in line, and (3) work well with others.
Focus on the biggest fish. There are tasks where agents already have deep experience that you can leverage (eg. coding, customer support, and parsing databases). Given that subscription and token prices are volatile, prioritizing high-ROI projects insures your budget and makes further spending requests more attractive.
Favor defined, measurable tasks. Your early implementations should all drive some specific and unambiguous metric—eg. completion rates, satisfaction scores, human hours saved, etc. It's important to be able to point to specific and tangible wins.
Look at data holistically. When one key metric improves, are you monitoring possible declines in others? Some wins are tradeoffs, and tradeoffs can only be managed when you catch them in real time (or close to it). Every PM has learned this the hard way!
Future-proof your implementation. You never know when a model might be either improved or degraded (this will happen frequently, often without notice). You need to be ready for both—in your processes and your contracts.
Break things down. Every institutional action is really multiple steps in a trench coat. Don’t assign entire processes. Break them down to their atomic pieces, then have your orchestrator begin gradually delegating. Say your recruiters are spending extensive time reading CVs and transcripts. If you can articulate even part of what they’re looking for, you can train an agent to ask those preliminary questions for you. Anything that speeds up cycle times can have outsized benefits.
Let agents help with triage. They can be trained to understand urgency and to route accordingly, which reduces reliance on humans getting around to manual checks.
Multiply your data. In a perfect world, human employees would log all the aspects of every interaction they have with a constituent. But they don’t, and it’s rarely cost-efficient to have them try. What can you train agents to observe and document in a way that’s privacy-friendly and that staff will see as an assist and not an intrusion?
Allow for (some) creativity. Within strong and enforced boundaries, it’s often good to let agents make some independent decisions where risks are low and reversibility is high. Anthropic published a pair of transparent (and charming) reviews of its initial attempts to have an AI agent run an on-campus vending machine. (using older models than are available today). Many things went wrong the first time! But the gains from unexpected learnings made this a very profitable exercise.
Anticipate gamification. If your agent is accidentally giving out freebies or can be exploited for pranks, expect news to travel quickly. Make sure you've tested for possible attack vectors and also instructed an agent to flag any strange new requests coming in (especially sharing a common script or element) .
Always be auditing. Agents produce a tremendous amount of data, and can also be tremendously useful at parsing this data. But there's still no substitute for close human analysis. Make sure that you're capturing the right data (and explicitly not capturing what you shouldn't) and regularly inspecting it. There's always gold in there.
Still have questions? Want us to cover other topics? Please reach out.