QCon AI Boston: Six Engineering Areas That Define AI in Production

When I first joined QCon as a program committee member, I thought the job was selecting good talks. It is not. The real work is mapping where an industry is stuck and figuring out what moves it forward.

I have been part of the QCon program committee for almost ten years. Today I chair QCon AI Boston and QCon AI New York, and serve as program co-chair for QCon London. QCon's speakers are curated by the committee, and every session is a deliberate choice. That model has worked for almost twenty years because the underlying premise has not changed: bring practitioners who are running emerging technologies in production before the rest of the industry catches up.

Applying that lens to AI in 2026, together with program committee members Meryem Arik and Hien Luu, what surfaced was not a list of trending topics. It was a set of architectural areas that cut across marketing categories. Six define where AI engineering is right now. The QCon AI Boston program is shaped around them.

The Keynotes as Editorial Spine

Martin Spier (OpenAI) opens Monday with Keeping ChatGPT Fast in the Agentic Era. ChatGPT is one of the largest AI applications in production, and at that scale, performance engineering extends well beyond GPU inference, spanning client networking, orchestration, context assembly, model routing, inference, and streaming. Agentic development compounds the problem: as tools like Codex help teams ship faster, performance regressions accumulate faster too. The end goal is agent-operated performance engineering, making the performance feedback loop as fast as the development loop.

Lizzie Matusov (Quotient) opens Tuesday with The Five Stages of AI Maturity in Engineering Organizations — Where and Why Teams Get Stuck. The research-driven maturity model asks the question most organizations are avoiding: why are productivity gains from AI tools not translating into delivery improvements? The answer, that many teams see productivity gains without corresponding improvements in delivery outcomes because they invested in tools while missing the organizational capabilities required to make them effective, is uncomfortable, and that is exactly why it needs to be a keynote.

Meryem Arik (Doubleword) closes Tuesday evening with A Few Predicted Talks From QCon AI 2030. Arik co-curated this conference and runs a self-hosted inference platform. Arik has the vantage point to extrapolate where these areas lead.

Infrastructure reality opens the conference. Organizational maturity follows. Forward trajectory closes it. That sequence is the editorial thesis for QCon AI Boston 2026.

1. Context Engineering Is a Platform Discipline Now

Model capability is commoditizing. What differentiates production AI systems is what reaches the model, organizational knowledge, retrieval quality, token budget management, memory.

Andrej Karpathy framed this in mid-2025: context engineering is "the delicate art and science of filling the context window with just the right information for the next step." The engineering problem is not how you phrase the prompt. It is what information reaches the model before the prompt runs. Gartner's research points the same direction: prompt engineering alone does not hold at enterprise scale, and the investment needs to shift to context-aware architectures.

The most concrete evidence at QCon AI Boston comes from Ajay Prakash at LinkedIn. Their CAPT system is an organizational context layer built on MCP: meta-tools for dynamic discovery across hundreds of internal tools, 500+ community-authored skills, non-engineers doing data analysis without SQL. This is not a prompt engineering project with a bigger budget. It is platform infrastructure.

Ricardo Ferreira (Redis) treats context as an architectural resource with memory systems and retrieval pipelines. Fabiane Nardon (TOTVS) tackles the data layer beneath agents: when to use data lakes versus transactional systems, how semantic layers reduce hallucinations, how MCP servers should handle agent traffic patterns.

Rachel Shalom (Dell) shares practical lessons on fine-tuning embedding models, drawn from building agentic search in production. Cassie Shum (RelationalAI) presents knowledge graphs as a foundation for agentic AI, using GraphRAG for multi-step reasoning with provenance.

The separation between AI systems that work in demos and AI systems that work in production is increasingly a context infrastructure problem. The teams investing in retrieval pipelines, MCP layers, and token economics are the ones crossing that gap.

2. Inference Economics: Cheaper Tokens, Higher Bills

Token prices to match 2023-era model performance dropped by 50x per year, with some capability tiers declining even faster. Aggregate spending rose 320%. This is the Jevons paradox applied to inference, efficiency gains do not reduce consumption, they increase it. Agentic workflows that chain multiple sequential model calls per user action create multiplicative demand that erases per-token savings entirely.

And the physical constraints are tightening. Jordan Nanos from SemiAnalysis presents the state of the market from fabrication to token, DRAM pricing limits, CoWoS advanced packaging bottlenecks, gigawatt-scale clusters that require behind-the-meter power generation because the grid literally cannot keep up. This talk alone reframes the inference cost conversation from software optimization to industrial infrastructure.

At the serving layer, KV cache management has become the bottleneck that determines whether inference is fast or unusable. Khawaja Shams (Momento) covers disaggregated prefill and cache-aware routing with live demonstrations. Sundara Ramachandran (LinkedIn) shows a prefill-only LLM ranking architecture built with SGLang under strict latency budgets. Mallika Rao (Zocdoc) presents end-to-end adaptive recommender design, from candidate generation through ranking and evaluation, delivering sub-10ms inference in production.

Then there is the economic architecture. Erik Peterson (CloudZero) brings data on hidden reasoning tokens, MCP overhead, and an $82,000 stolen-key incident. The argument: AI cost control requires engineering-first architectural patterns like circuit breakers and model routing by margin. Dio Rettori (JP Morgan Chase) addresses when to own versus rent inference capacity. Aditya Mulik (Walmart) shows batch scheduling against off-peak hours for inventory accuracy, saving millions annually.

Cost control is an architecture constraint, not a FinOps problem.

3. The Demo-to-Production Cliff

A demo that works 80% of the time is impressive. An agent that fails 20% of the time in production is a liability.

Gartner projects that over 40% of agentic AI projects will be cancelled by 2027. The gap is not about polish. Crash recovery, state persistence, concurrency, long-running execution, error propagation, these are fundamental architecture concerns, and most agent frameworks treat them as afterthoughts.

Multiple teams at QCon AI Boston arrived at distinct architectural answers to this problem.

Deepak Chandramouli and Bhumik Thakkar at Apple built a framework-agnostic agent infrastructure on Ray, decoupling orchestration from execution to handle bursty workload patterns. Manju Rajashekhar (Mad Labs) started with a prompt-chain framework that demoed well, then watched it collapse under production workloads into "a pile of locks, ad-hoc state machines, and recovery code we couldn't trust." The rebuild used the actor model, with write-through state for crash recovery. Siddharth Kodwani and Swaroop Chitlur at DoorDash built a GenAI platform with shared infrastructure: an LLM Gateway, a Batch Inference platform, an Agentic Gateway, and ADK templates so teams stop reinventing orchestration.

Sudeep Das (DoorDash) presents the use case side of this shift, moving consumer experiences from static recommendation models to agentic AI where multi-step decision flows replace one-shot predictions. Zhou Yu (Arklex.ai) attacks the testing gap: multi-turn simulation for automated agent testing, because manual testing does not scale when 90% of enterprise agents are stuck in proof of concept.

Vinoth Govindarajan (OpenAI) presents the agent harness pattern: a control plane that owns sessions and state, concurrency invariants that prevent overlapping runs from corrupting behavior, and tool and approval boundaries that determine when chat becomes action. Bruna Pereira (DoorDash) tackles a different production concern with SafeChat, a layered AI moderation system that balances speed, cost, and accuracy across multiple ML models to protect users during real-time marketplace interactions.

Two talks address the boundary between probabilistic and deterministic execution. Alex Porcelli (Aletyx) makes the case that some agent decisions must be deterministic, using formal decision models as agent skills for workflows where the output must be deterministic, explainable, and fully traceable. Francesca Lazzeri (Microsoft) presents a hybrid NL-to-SQL architecture that keeps aggregation outside the LLM, with intent screening and schema-aware validation ensuring data queries return numerically accurate results rather than hallucinated metrics.

Ray, actors, platform gateways, deterministic guardrails, different answers to the same problem. But they share one insight: agents are stateful, long-running processes. They need the infrastructure rigor we apply to databases and message brokers, not what we apply to web requests.

4. Evaluation Has No Standard

89% of organizations have observability for their AI agents. Only 52% run offline evaluations. Watching an agent is easy. Knowing whether it is ready to ship is a different problem entirely.

Benchmarks show a 37% gap between lab scores and real-world deployment performance. Worse, the 2026 International AI Safety Report documented frontier models distinguishing between evaluation and deployment contexts, performing better during testing than in production. The measurement tools themselves are becoming unreliable.

Several speakers at QCon AI Boston built evaluation systems independently, which suggests the pattern is real even if no standard exists yet.

Terran Melconian at Zillow presents a interesting framing: decision-driven evaluation. Evaluation should directly inform product and engineering decisions, not just produce what Melconian calls "metrics of convenience." The focus is on building confidence in when to ship, when to invest further, and when to pivot, not just improving scores. Susan Chang (Elastic) brings almost two years of production experience building reusable eval frameworks across multiple GenAI products. Pratik Rasam (Spotify) describes three-level evaluation for their AI advertising platform: tool trajectory scoring to verify agents called the right tools in the right order, deterministic response assertions that gate CI, and LLM-as-judge scoring that runs in nightly experiments for longitudinal quality tracking.

Hannes Hapke (Dataiku) brings a complementary perspective through mechanistic interpretability. The open-source framework Kiji Inspector uses sparse autoencoders to produce interpretable decision factors explaining why an agent chose a specific tool.

Deterministic checks for gating, semantic judgment for tracking. That layered pattern is converging across companies. But there is no standard framework, no shared vocabulary, no interoperability. AI evaluation is where software testing was in the early 2000s, everyone knows it matters, nobody agrees on how to do it.

5. Agents Are an Insider Threat Pattern

Tool-calling agents are functionally insiders with API access, operating faster than any human can audit. Perimeter security does not help when the threat is already inside the trust boundary.

An autonomous AI agent from security startup CodeWall compromised McKinsey's internal AI platform and gained broad system access in under two hours. The OWASP Foundation published its Top 10 for Agentic Applications in December 2025, introducing "least agency" as the agentic equivalent of least privilege, recognizing that agents exercise judgment, not just access. Anthropic documented the first large-scale cyberattack predominantly executed by an AI agent.

Broadcom and Klaviyo each address agent security at QCon AI Boston.

Advait Patel (Broadcom) builds zero-trust agent systems that pass SOC 2 and ISO audits while still shipping, agents as first-class identities, least privilege enforced at the tool layer, not the prompt layer. Adrianna Valle (Klaviyo) frames tool-calling agents as employees with database access and zero concept of consequences, then focuses on detecting compromised behavior by monitoring logic paths, data access patterns, and tool-use frequencies.

Security at the prompt layer is a design flaw. Enforcement belongs at the execution layer, and agents must be treated as first-class identities with scoped, continuously authorized access.

6. MCP Won the Protocol Battle. The Architecture Is Not Ready.

The Model Context Protocol hit 97 million installs by March 2026. OpenAI, Google, and Microsoft all adopted it within months of each other. The Linux Foundation's Agentic AI Foundation now governs both MCP and the Agent-to-Agent Protocol. The protocol question is settled.

The architecture question is wide open, and the QCon AI Boston program makes that visible.

Niko Matsakis (Amazon), co-lead of the Rust language design team, introduces agent mods, portable extensions built on the Agent Client Protocol that go beyond MCP to inject context, intercept messages, and transform tool output.

And then there is the scaling problem. LinkedIn's Ajay Prakash describes what happens when you have hundreds of internal MCP tools, injecting them all into every agent's context creates bloat that degrades performance. Their meta-tools layer for dynamic discovery is an architectural workaround for a protocol limitation that the spec does not yet address.

This feels like the early HTTP moment for AI infrastructure. The protocol won, but enterprise authentication, tool discovery at scale, and stateful session management remain unsolved. Organizations building on MCP today should expect the protocol's surface area to expand significantly within the next year. Architect for that.

The Organizational Layer

One thread cuts across talks that do not map neatly onto the six areas above. The bottleneck is not faster GPUs or better models. It is whether organizations can actually absorb what already exists.

Catherine Weeks (Red Hat) presents an agentic SDLC case study across a 500+ person organization: the team initially incentivized speed metrics, but engineers pushed back when those metrics began rewarding the wrong behavior. The fix was rethinking what the organization measured, not changing the tools. Andrew Swerdlow (Roblox) presents an autonomous coding system that achieves 60% PR acceptance rates through Exemplar Alignment, while addressing the Productivity Paradox: AI tools generate more code than ever, yet the time to ship safely to production remains stagnant.

Jatin Aneja (Zoox) frames AI adoption as a behavior change problem across 1,500 engineers, not a software rollout. Ben Maraney (Forter) asks what the minimum viable platform is that turns 200 people into agent creators in two weeks. Brian Turcotte (Kilo Code) brings 25 trillion tokens of developer AI usage data, identifying three trust-killing failure points in the adoption ladder from autocomplete to orchestration.

Matusov's keynote frames the meta-problem: organizations confuse tool adoption with capability building. Chasing end-to-end autonomy without investing in governance, validation, and data context is not ambitious. It is counterproductive.

These six areas are not independent. Context engineering feeds inference. Inference economics constrain what agents can do. Agents require evaluation to ship and security to operate. All of it needs platform standardization to scale beyond a single team. Treat them as separate initiatives, a context project here, a security review there, and you build fragmented capabilities that never compound.

The QCon AI Boston program is where practitioners who operate production AI systems are spending their effort right now.