AI & US Government Software Development: A Primer (Q3 2025)
- Suren Nihalani
- Mar 7
- 7 min read
Updated: Jun 25
What does responsible AI integration look like? We update this primer every quarter to reflect the latest in tool development and emerging best practices.
The great challenge of the decade has arrived. Not only has the pace of AI progress overwhelmed even many industry insiders, it’s also the slowest that it will ever be again.
This presents an obvious difficulty for those procuring and developing software for all levels of American government: how can a new project simultaneously leverage the best AI tools, remain future-flexible, and take concerns around reliability and data safety seriously?
We’ve prepared this evergreeen guide to lay out our thinking on the three aspects to this puzzle that we get the most questions about.
1. Which AI tools and platforms are now widely considered acceptable for use in government?
While there’s a bit of fuzziness between them, there are four basic layers of AI technologies:
AI foundation models are the algorithmic engines that turn training data into a set of “weights” that can be called upon to perform calculations. Most models are released in multiple versions, each with its own tradeoffs between speed, price, and performance. These models fall on a spectrum between fully open (ie. the training data and/or weights are public) and closed (ie. almost fully proprietary and only accessible via API).
AI tools can leverage any model, and may tap into more than one for more complex work. Though they’re sometimes made by the model developers themselves, more commonly they’re created as “wrappers” that license models for specific use cases.
AI agents are a tool subset that take a more active role—ie. you give them tasks instead of queries, and they produce recommended actions vs. bare answers. Agents typically use multiple AI models and tools in the same ways a clever assistant might.
AI platforms create a larger user interface for your operational needs, where a combination of models and tools can be integrated and called upon as requested.
When evaluating suitability for government use, the first consideration is usually data exposure. Given unusually high privacy sensitivity, even using a basic chatbot is typically forbidden unless the data stays within a closed environment that’s been built specifically for government use. While coordination between levels of government is still ongoing, expectations are that cities and states will soon (ie. likely within 2025) have formal guidance that broadly allows them to follow federal clearances.
In the interim:
The few major model labs that don’t already do so will soon offer government implementations that don’t expose any data to their own servers. (Open-weight models like Meta’s Llama can also be freely licensed today for most use cases, and can be installed on air-gapped laptops to allow for fully secure testing playgrounds.)
Leading AI platforms (from eg. Palantir, SAS, AWS, Microsoft, and Google) are building out marketplaces of tools and agents approved for use on sensitive data.
The data generated by this use can also be both logged and retained for training future AI models. (This data should also be paired with digitization of internal PDFs. Lots of agencies can improve considerably at fraud detection and general efficiency by simply unlocking and centralizing information they already have.)
Our expectation is that federal agencies and departments will soon be doing far more of their work on AI platforms that ingest all their data and provide rich options for both analyzing and acting on said data. Legacy systems will increasingly exist in parallel to preserve stability while component pieces are individually migrated to more modern infrastructure.
Under which bucket is something like Grok, Claude, or ChatGPT?
This gets needlessly confusing in the way that’s common to engineering cultures. The leading US AI labs (eg. xAI, Anthropic, OpenAI, Google, Meta) each have a flagship model that may or may not share a name with their primary user interface. Anthropic calls their chatbot Claude, which might be powered by eg. their Opus 4.0 model. OpenAI calls their chatbot ChatGPT, which can pull from a dozen or so different models like 4.5, 4o, or o3. Other models, like xAI’s Grok 3.0, also have a range of “modes” that instruct the “personality” of their outputs. But this will become less important to keep track of as workflows shift to agents that will simply dial up whichever models and/or tools they know to be most useful for the task(s) to be done.
2. To what degree should contractors be encouraged and/or allowed to incorporate AI in their development?
While AI has gone through some intense hype cycles, it’s instructive to consider how the largest American tech companies have begun using AI tools themselves.
As far back as October 2024, Google was already using AI to write 25% of its new code.
Many startups entering incubators today use AI platforms (IDEs) like Cursor and Windsurf to write their code for them, with developers then refining outputs as they go.
The major AI labs are all expecting 2026-2027 to be the tipping point where AI rapidly takes over their development processes (including on model self-improvements).
Overall AI progress (output per dollar) is improving at a rate of something like 10x per year. Leading models are increasingly losing their benchmark spots as fast as music singles shift on the charts. And cost curves are following, allowing models to do far more computation—including sanity checks—for a substantial lower sticker prices.
While human engineers aren’t going anywhere, the percentage who write all their own code is rapidly trending towards zero. It simply isn’t time efficient, nor is it meaningfully better for an increasing share of use cases. While frontier challenges will always exist, a model that’s learned from many thousands of past implementations will often get the rote stuff right at quite a sophisticated level of understanding.
Putting this together:
Any vendor that says they don’t use AI anywhere is going to produce a worse and more expensive product—else they’re simply being dishonest in the first place.
Talent matters more than ever. AI rarely makes a mediocre programmer better. Bad talent will just produce mediocre work faster, which means more remedial effort later. It also takes experience and judgment to hone a sense of where AI code is suitable.
Testing is either foundational to a culture or it isn’t. No AI produced code—nor really any code at all—should ever reach production before going through robust checks.
A responsible vendor will outline their approach as part of the procurement process, including extended detail about their quality control systems. They also either have a reputation for strong hiring bars or they don’t. If they do, AI will make them a substantially better partner. If they don’t, you’ll get exactly what you pay for.
Isn’t “hallucination” still a major unsolved problem?
Yes and no. Where the data/knowledge is already in a single place, AI models can now reason at something like the 80th percentile human at most tasks (and are well into the 99th at others). When pulling data from multiple sources and formats, there’s more room for things to break down. Though this weakness is quickly fading as techniques like chain-of-thought and mixture-of-experts become more sophisticated, and as work on data access standardization continues. All this will improve rapidly as chips and inference techniques grow in speed, power, and efficiency. Though it may be years before fully-automated testing systems are widely considered reliable.
3. What are the red lines and red flags for irresponsible AI integration?
There's really just one hard red line: does your vendor have a detailed plan for ensuring that zero of your sensitive data is ever exposed to an unsanctioned system? Everything else can be fixed without an irreversible loss. Though of course even reversible losses matter to whether your project will come in on time and make taxpayers believe their dollars were well spent. As vendors can be excellent at buzzwording, we recommend drilling down on specifics to separate real expertise from vague awareness.
Press for concrete explanations. If a vendor tells you about eg. the magic of model context protocols (MCPs), can they also walk you through exactly what they do and how they differ from traditional APIs? (Hint: APIs are unique to each organization; MCPs are meant to be universal. The latter are solving the problem of standardization so that AI tools can all speak a common language.)
Ask about breadth of familiarity. Developers can easily grow too comfortable with a given tool to really experiment widely. While this is still better than sticking to legacy alternatives, being even six months behind the curve will meaningfully impact relative outcomes. Which specific processes do they have internally to ensure they aren't getting stagnant? (GovTron solves this by rotating between talent partners based on their fields of speciality, and vetting these partners on this question specifically before proposing them for work on a specific project component.)
Listen for humility. Given the pace of development, no one is going to have stayed 100% current. This is ok, and each good vendor will find their own sweet spot between following the curve and commitment paralysis. (Our practice is to say "we'll find out" and to rely on our talent partner structure to tap the people closest to the newest tools. Our principals try hard to keep up, but are comfortable being extremely human.)
Does this mean the end of vendors subcontracting their work?
Yes, though not immediately. Our philosophy at GovTron is to match clients with engineers, user reachers, and product managers that have passed the highest hiring bars in American tech. If clients wish to further reduce costs for custom development (ie. where reusable code isn’t an option), we can also use trusted partners abroad for suitable project sections. But the same dynamic applies there as here. We’ll always be picky, as mediocre talent works against good outcomes. Though we do expect that AI will eventually become the only required co-developer, likely by 2027 or so.