AI Inference for Financial Services Belongs On-Premises

The most important question isn't which model you run. It's where.

Jeff Springborn

June 3, 2026

Fifteen years ago, the smartest firms in financial services moved their execution infrastructure out of their own data centers and into buildings physically adjacent to exchange matching engines. In markets, the advantage goes to whoever gets there first, and physics sets the ceiling on how fast you can get there.

That decision compounded into one of the most durable infrastructure advantages in modern finance.

AI inference is now entering those same decision systems. Most firms are making the same mistake their predecessors made before colocation became the norm, treating infrastructure placement as an IT detail when it is a competitive one.

The Distance Between the Model and the Decision

When teams measure AI inference performance, they look at model execution time, how long the model takes to produce an output once it receives an input. That number looks fine.

What it misses is the full round trip: from the business event that triggers inference, across the network to wherever the model is running, back to the system that needs the answer, and into the decision it was supposed to inform.

For a fraud model running on a centralized inference service, model execution is only one component of the total response time. Data retrieval, network transit, feature computation, and system orchestration all consume part of the available decision budget. The data must travel from where the transaction occurs to where the model runs and back again before a decision can be made.

In payment authorization and fraud detection systems, decisions often operate within latency budgets measured in hundreds of milliseconds, with many fraud workflows designed to complete risk assessments in 10 to 50 milliseconds. Every millisecond spent moving data between systems reduces the time available for feature computation, model execution, risk evaluation, and response.

In algorithmic and quantitative trading, AI signal generation is increasingly embedded directly in order routing and risk decisions. CME's matching engines sit in Aurora, Illinois. Servers co-located in the same facility connect to those engines in under a millisecond. Infrastructure located outside the exchange ecosystem introduces materially higher and less deterministic latency. For strategies where the signal edge exists in a narrow time window, that gap determines whether the firm captures the opportunity or observes it after it has closed.

No amount of model optimization fixes a latency disadvantage that originates in the gap between where the model runs and where the decision happens.

‍

AI Inference is Now Part of the Decision

Consider where real-time AI is actually being deployed across financial services right now.

Trading and signal generation.

‍Firms are using AI to read earnings calls, parse order book dynamics, and calibrate risk positions in real time during market hours. These models are embedded in the decision loop, with their output sitting upstream of order routing. Latency in those models flows directly into latency in those decisions.

Fraud and transaction monitoring.

The most advanced fraud detection systems run AI against transaction patterns, behavioral signals, and network relationships simultaneously, in the moment a transaction is presented for authorization. A model that cannot return an answer before the authorization window closes has no meaningful role in the decision.

Real-time client decisioning.

A relationship manager on a call with a client, a digital underwriting engine evaluating a customer mid-funnel, an insurance pricing model during a live quote. All of these require inference to complete before a human moment ends or a customer drops off. The latency budget is measured in seconds, sometimes less.

Inference has moved inside the transaction. It has become part of the product. The location of the infrastructure running it connects directly to whether it works.

Unlike model training, which occurs periodically, inference becomes an ongoing operational workload. Every fraud check, risk assessment, customer interaction, compliance review, and AI-assisted workflow consumes compute resources. As AI adoption scales across financial services, the economics of inference increasingly become as important as model accuracy and performance. Infrastructure decisions that appear technical today may ultimately shape operating costs, scalability, and competitive advantage for years to come.

‍

AI Infrastructure is a Strategic Priority

JPMorgan is planning AI infrastructure capacity years in advance and continues to invest heavily in infrastructure it controls directly. Its internal AI platform, now used broadly across the organization, reflects a reality emerging across financial services: while cloud services remain important, certain AI workloads increasingly justify dedicated infrastructure where performance, governance, economics, and long-term capacity planning can be managed directly.

Jane Street recently revealed a purpose-built Texas AI data center housing more than 4,000 liquid-cooled GPUs used for machine learning research and model development, with internal teams competing for compute resources through a market-based allocation system. Hudson River Trading has partnered with Lambda to access dedicated NVIDIA AI infrastructure. Together, these approaches illustrate how leading financial firms are treating AI infrastructure as a strategic capability rather than a commodity resource.

These firms applied the same infrastructure discipline to AI that they already applied to execution: put the compute where the decision happens.

‍

Why "We'll Use the Cloud for Now" Is a Longer Commitment Than It Sounds

Enterprise infrastructure commitments around network topology, security certifications, operational processes, and vendor integrations do not change quickly. A decision to run production AI inference on a hyperscaler endpoint creates downstream dependencies that accumulate over time: data pipelines, monitoring systems, compliance frameworks built around that vendor's audit tooling, and procurement relationships. Switching costs grow as the architecture deepens.

Once production AI inference becomes embedded in transaction systems, fraud platforms, risk workflows, compliance processes, and customer-facing applications, changing infrastructure architectures becomes increasingly difficult. The challenge is not simply migration cost. It is the accumulated dependency across data pipelines, governance frameworks, audit processes, operational procedures, application integrations, and organizational workflows.

Beyond switching costs, there is a capacity problem developing in parallel. AI-ready power and dense colocation space are increasingly constrained in major metros. Chicago is no exception. Firms that wait to secure infrastructure until the need is obvious will find lead times that do not match their deployment timelines. The firms that are moving now are locking in capacity, location, and hardware flexibility before the window tightens further.

For analytics, research, and non-latency-critical inference, hyperscaler infrastructure makes sense. The specific question is about workloads where latency and control matter, and whether the infrastructure decision being made today reflects that distinction.

‍

What AI Inference Demands from Infrastructure

Facilities that can support latency-sensitive AI inference in financial services require a specific combination of characteristics. Most data centers, including many enterprise-grade ones, do not have all of them simultaneously.

Location that matches the decision.

The latency advantage is geographic and cannot be engineered away from a distant facility. Colovore's Aurora campus sits at sub-0.1 milliseconds to CME's matching engines and under 5 milliseconds to 350 East Cermak, Chicago's primary carrier hotel. For financial services firms running AI inside trading and risk systems, that proximity is not a secondary feature. It is the reason the facility exists where it does.

Density that supports next-generation hardware, today.

The current generation of inference-class GPU clusters requires liquid cooling as a design requirement, not an option. Colovore's Chicago facilities support 5 to 600+ kW per rack, HVDC-ready, with liquid cooling built in from the ground up. A deployment can start at the scale it needs and grow to multi-megawatt without changing facilities, re-engineering cooling loops, or negotiating new terms. ORD01 in Aurora goes live December 2026, with additional halls across ORD02, ORD03, and ORD04 coming online throughout 2028, totaling 54 MW of critical capacity across the campus.

Hardware flexibility that survives the silicon roadmap.

The chip that optimizes a fraud model today may not be the right choice for next-generation signal generation. Colovore's infrastructure is hardware-agnostic, supporting NVIDIA, AMD, and other architectures in the same facility. Firms are not locked to today's inference chip when better options arrive, and they do not need to rebuild or re-platform to adopt them.

A compliance posture built for regulated workloads.

Colovore's Chicago campus is Tier 3 certified with a 99.999% SLA. Colovore facilities are compliant and auditable, by design. ISO/IEC 27001 certification and SOC 2 Type II are targeted shortly after opening. Data centers are built to support customer HIPAA, PCI DSS, HITRUST, and FedRAMP requirements. For financial services firms where the execution environment has to survive regulatory examination, that compliance posture is part of the infrastructure decision, not a separate one.

‍

The Question Worth Asking Before the Next Infrastructure Decision

Financial institutions have always treated infrastructure placement as a strategic decision when performance, control, and risk management matter. As AI becomes embedded in fraud detection, risk analysis, compliance workflows, and customer interactions, AI inference is becoming one of those decisions.

AI inference is inside your most latency-sensitive systems, or it will be soon. Does the facility where it runs reflect the same discipline you apply to every other system where performance and control matter?

The firms that moved early on colocation built advantages that lasted. AI inference presents a similar opportunity today. Capacity in the right locations is not unlimited, and the timelines to secure it are longer than most infrastructure teams assume.

‍

This post is part of Colovore's ongoing series on the coming AI inference divide, the structural shift separating where AI is trained from where it runs in production at enterprise scale. Read the previous article

For the full analysis, including industry-specific use cases and the specialized silicon landscape, download the complete strategy paper.

‍