Research

Open problems we are funding the work to solve.

For technical reviewers. This page describes the unsolved engineering and AI-safety problems we are researching. The two narratives on this site are deliberately separate: the homepage points at the research thesis, this page is the technical narrative, and the commercial product is what makes the research economically possible — neither is true without the other.

Thesis

Mainstream agent platforms were architected for a different customer than the 33-million-firm U.S. small-business sector.

Autonomous AI agents are about to enter the small-business market — the 33 million U.S. firms that employ fewer than 500 people and account for 99.9% of American businesses. The agent platforms that exist today were architected for a single enterprise team with engineers, a security review process, and the ability to absorb the runtime cost of a per-task language-model call.

Move that same architecture down to a five-employee HVAC company on a $300/month budget and three things break: tenant isolation, multi-agent failure modes, and operator-grade auditability. We believe these three problems are research problems, not product problems. Solving them is what enables an AI agent fleet to act on a small business's behalf — answering its phones, replying to its reviews, dispatching its technicians — without exposing it to systemic compromise, catastrophic coordination failure, or unaccountable autonomous action.

Open problem 1

Per-tenant agent isolation at SMB unit cost.

State of the art

Today's mainstream agent platforms — LangChain agent loops, OpenAI Assistants, the various GPT-Action-style tool-calling frameworks — share a runtime across customers. Tools, code execution, and tenant credentials co-exist in the same process or container. This works because the customer is typically a single engineering team that owns the deployment.

For a multi-tenant platform serving thousands of small businesses, the same architecture creates an unbounded compromise blast radius. A single prompt-injection vulnerability, a single credential exfiltration through an agent's tool call, or a single sandbox escape becomes a cross-tenant incident. The 2025 Replit and Cursor MCP incidents — where adversarial inputs reached privileged tool calls in shared runtimes — are early proofs of concept.

The known robust answer is hardware-backed per-tenant isolation: each tenant's agent runs in its own VM. AWS Lambda demonstrates that this is operationally feasible at scale using Firecracker MicroVMs. But Lambda is a stateless function platform; an agent is stateful, and pricing the stateful per-tenant case at SMB unit economics ($0.001–$0.01 per agent action) is unsolved.

Why this is hard

A stateful agent runtime is not a function invocation. It carries:

multi-megabyte conversation history and tool-call traces
dynamically loaded skill bundles (each one a chunk of code)
live credential leases for the customer's connected services (Stripe, Twilio, GHL)
in-flight retry state for partially completed actions

Cold-starting that into a fresh Firecracker microVM today takes 2–4 seconds and several hundred milliseconds of CPU work. For a real-time voice receptionist, the latency budget is ≤ 700ms end-to-end. For an SMS reply agent invoked from a webhook, the cost-per-invocation must stay below half a cent for the business model to hold. Closing both gaps simultaneously — sub-second stateful cold-start, sub-cent per-task — is the research target.

Our hypothesis

Three combined techniques:

Snapshot-restore with delta-only state hydration. The per-tenant microVM image is restored from an immutable base snapshot and a small per-action delta is rehydrated from a tenant-scoped object store.
Capability-scoped credential brokering. The agent runtime never holds long-lived customer credentials. It requests a short-lived (≤5 minute) action-scoped delegation from a separate credential broker, eliminating credential persistence inside the agent context.
Predictive warm pools. Vertical-specific call patterns (an HVAC business gets a phone call burst at 7–9am Mountain Time) drive a scheduler that pre-warms microVMs against forecasted demand windows.

Phase I deliverable

A measured cold-start latency distribution across 1,000 stateful microVM restorations, with target p95 < 800 ms and per-invocation cost < $0.005, on a representative SMB workload trace.

Cited prior art

Firecracker: Lightweight Virtualization for Serverless Applications (Agache et al., NSDI 2020) — the microVM substrate.
Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing (Shillaker & Pietzuch, USENIX ATC 2020) — stateful serverless isolation.
Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting (Du et al., ASPLOS 2020) — the snapshot-restore lineage.
AWS re:Invent 2025 sessions on Bedrock AgentCore Runtime — the production substrate we build on.

What remains research, not engineering: none of the above papers measured agent-specific stateful workloads where the state includes live credential leases and skill bundles. That is the gap we are filling.

Open problem 2

Emergent-failure containment in coordinated agent swarms.

State of the art

A single agent that hallucinates, loops, or makes a wrong tool call is a relatively well-studied failure mode. Static safety filters, LLM judges, and human-in-the-loop checkpoints all address the single-agent case.

When N agents coordinate on a shared task — one books a service appointment in Calendly, one sends the customer an SMS confirmation in Twilio, one updates the customer record in GoHighLevel — a different class of failure appears. We call this category emergent coordination failure:

Action conflict: two agents independently take incompatible actions on the same external resource (one cancels the appointment the other just confirmed).
Recursive prompting: agent A prompts agent B prompts agent A in a runaway loop with no economic stop condition.
Livelock: each agent waits on a state that the other has not yet produced; the swarm makes no progress but consumes inference cost.
Compensating-action drift: a recovery agent reverses an action whose original justification has since become valid; net effect is invisible damage to the customer's data.

These failures do not appear in single-agent test suites. They are not detectable from the model's output alone — they only appear when you observe the joint trajectory of multiple agents acting on shared state.

Why this is hard

The published work on multi-agent LLM systems (AutoGen, MetaGPT, CrewAI, the agentic benchmarks like AgentBench and SWE-bench multi-agent variants) measures task success on the happy path. There is, to our knowledge, no published runtime safety supervisor for multi-agent action conflict on production external resources, with a target detection latency that would let it intercept the second agent's action before it commits. Detection latencies in the seconds range are research-grade results; we need sub-200ms because that is the gap between "the conflicting Twilio SMS is queued" and "the conflicting Twilio SMS has been delivered to the customer's phone."

Our hypothesis

A runtime coordination supervisor that observes every agent's tool call before it executes, maintains a fast in-memory model of which external resources are being mutated in this swarm-cycle, and flags or blocks an action whose target intersects an in-flight action by a peer agent. The supervisor's decision must run in < 50 ms to fit inside the tool-call latency budget. We hypothesize this is achievable with a small classifier (not an LLM judge) trained on a corpus of synthetic and recorded multi-agent traces with labeled conflict outcomes.

Phase I deliverable

An open evaluation harness: 200 multi-agent scenarios drawn from real SMB workflows (booking + reminder + reply), each annotated with the gold-standard "should be blocked" or "should proceed" label. Target: ≥ 90% precision on block decisions (false positives are a customer-experience cost), with end-to-end supervisor latency p99 < 100 ms.

Cited prior art

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023).
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — single-agent guardrail lineage.
The Off-Switch Game (Hadfield-Menell et al., 2017) — corrigibility theory we extend to the multi-agent case.
Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016) — formal frame for inter-agent coordination.

What remains research: all of the above either treat single-agent safety or treat coordination as an idealized game-theoretic problem. The production runtime supervisor for production tool calls on third-party APIs is an empirical research artifact that does not yet exist.

Open problem 3

Audit-by-construction for non-technical operators.

State of the art

In an enterprise deployment, the AI safety story relies on engineers reviewing logs, on SOC 2 controls maintained by a security team, and on a CISO who can read CloudTrail. None of those exist at a 5-person HVAC business. The customer is the owner-operator. They cannot read JSON logs and they cannot interpret an audit trail expressed in IAM events.

The trustworthy-AI subproblem here is: how do you give a non-technical operator the ability to verify, in plain English, what the agent did on their behalf, and the ability to reverse it if it was wrong, without giving up the autonomy that makes the product valuable in the first place.

Why this is hard

Three constraints fight each other. First, the audit must be tamper-evident: a malicious agent or a compromised process must not be able to retroactively edit its own history. Second, the audit must be replayable: an operator must be able to ask "what would have happened if I had said no to that action three hours ago" and get a deterministic answer. Third, the surface must be readable by someone who has never seen a log line, which means the underlying record cannot be a verbose machine log — it has to be a structured, semantically rich event that renders into prose.

These constraints separately are well-studied. Tamper-evidence is solved by hash-chained logs (Merkle trees, certificate transparency). Replayability is solved by deterministic event-sourcing. Plain-English rendering is solved by templated event-to-prose generators. The combination — particularly the constraint that the LLM-generated prose cannot be the source of truth and must be derivable from the structured record — is unsolved at the SMB-platform layer.

Our hypothesis

Cryptographically chained per-action event records, where each agent action emits a structured event with: (a) the agent's identity, (b) the input state hash, (c) the tool call's signed parameters, (d) the external resource ID, (e) the prior event's hash. The chain is anchored periodically to an immutable store. Operator-facing prose is rendered deterministically from the structured event by a small templated generator — the LLM is notin the audit path, eliminating hallucination as a class of audit failure.

Phase I deliverable

A reference implementation of the audit chain integrated with our existing AgentCore-based runtime, plus a usability study with 12 SMB owner-operators measuring whether they can correctly identify a planted erroneous agent action from the operator-facing audit view. Target: ≥ 80% identification rate at the first reading.

Cited prior art

Certificate Transparency (Laurie et al., RFC 6962) — public hash-chained log lineage.
Practical Byzantine Fault Tolerance (Castro & Liskov, 1999) — tamper-evidence heritage.
Concerning Trustworthy AI: A Computational Perspective (Liu et al., 2022) — survey we extend.

What remains research: usability of cryptographic audit at the small-business operator layer is, to our knowledge, an unstudied empirical question.

Phase I objectives

Five falsifiable claims. Each can fail. Failure of any one is itself a published-grade research result.

#	Objective	Measurable target	Technical risk
O1	Stateful per-tenant microVM cold-start	p95 < 800 ms	Snapshot-delta hydration unproven for agent state class
O2	Per-action unit cost	< $0.005 / action at 1k actions/min	Predictive warm pool may underfit vertical traffic patterns
O3	Capability-scoped credential brokering	Zero long-lived credentials in agent context across the test suite	Action-scoping for arbitrary third-party APIs is heterogeneous
O4	Multi-agent action-conflict detection	≥ 90% precision, p99 < 100 ms	No labeled corpus exists; we will create one
O5	Operator-readable audit chain	≥ 80% planted-error identification rate by SMB owners	Usability of cryptographic audit at this audience is untested

Why this work matters beyond SwarmEngines

The 33-million-firm U.S. small-business sector is the part of the economy with the least capacity to safely adopt autonomous AI on its own.

It has no internal security teams, no compliance staff, and no software engineers to read the audit logs. If autonomous agents are deployed to that audience using the architecture mainstream platforms ship today — shared runtime, shared credentials, no coordination supervision, opaque automation — the resulting failure surface will be measured in compromised customer-payment systems, leaked health data at dental practices, and contract violations at small legal firms.

The research above is the safety floor required for an SMB AI agent product to exist responsibly. We are doing this research because we have to do it for our own product to be defensible. Publishing it (Phase I deliverables will go to a peer-reviewed venue and will be open-sourced where commercially compatible) is the way the rest of the industry can build on it without each platform reinventing the safety floor independently.

Read the broader-impact case

Contact

For preliminary measurement traces, collaboration, or questions on the work above.

research@swarmengines.com

Email research