Insights

Why Financial Services AI Inference Belongs On-Premises

SUBSCRIBE FOR ACCESS

Thank you. You may now view the white paper.

VIEW PAPER
VIEW PAPER
Oops! Something went wrong whilst entering your details. Please try again.
By clicking Subscribe you're confirming that you agree with our Terms and Conditions.

Why Financial Services AI Inference Belongs On-Premises

Why the most important question in financial services AI infrastructure isn't which model you run. It's where.

Jeff Springborn

Subscribe to newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Fifteen years ago, the smartest firms in financial services moved their execution infrastructure out of their own data centers and into buildings physically adjacent to exchange matching engines. In markets, the advantage goes to whoever gets there first, and physics sets the ceiling on how fast you can get there.

That decision compounded into one of the most durable infrastructure advantages in modern finance.

AI inference is now entering those same decision systems. Most firms are making the same category of mistake their predecessors made before co-location became the norm, treating infrastructure placement as an IT detail when it is a competitive one.

The Part Your Latency Dashboard Isn't Showing You

When teams measure AI inference performance, they look at model execution time, how long the model takes to produce an output once it receives an input. That number looks fine.

What it misses is the full round trip: from the business event that triggers inference, across the network to wherever the model is running, back to the system that needs the answer, and into the decision it was supposed to inform.

For a fraud model running on a major cloud provider's inference service, that round trip from Chicago to the nearest cloud region and back typically runs 20 to 40 milliseconds. The model execution itself might take 10. The rest is physics. The signal is traveling from where the decision happens to where the compute lives and back again.

In payment fraud scoring, the total decision window runs roughly 50 to 100 milliseconds end-to-end. Before the model has run a single calculation, the firm has already spent half its budget on network transit.

In algorithmic and quantitative trading, AI signal generation is increasingly embedded directly in order routing and risk decisions. CME's matching engines sit in Aurora, Illinois. Servers co-located in the same facility connect to those engines in under a millisecond. A cloud region in Chicago connects in 20 to 40. For strategies where the signal edge exists in a narrow time window, that gap determines whether the firm captures the opportunity or observes it after it has closed.

No amount of model optimization fixes a latency disadvantage that originates in the gap between where the model runs and where the decision happens.

AI Inference Has Entered the Systems Where This Matters Most

Consider where real-time AI is actually being deployed across financial services right now.

Trading and signal generation.
Firms are using AI to read earnings calls, parse order book dynamics, and calibrate risk positions in real time during market hours. These models are embedded in the decision loop, with their output sitting upstream of order routing. Latency in those models flows directly into latency in those decisions.

Fraud and transaction monitoring.
The most advanced fraud detection systems run AI against transaction patterns, behavioral signals, and network relationships simultaneously, in the moment a transaction is presented for authorization. A model that cannot return an answer before the authorization window closes has no meaningful role in the decision.

Real-time client decisioning.
A relationship manager on a call with a client, a digital underwriting engine evaluating a customer mid-funnel, an insurance pricing model during a live quote. All of these require inference to complete before a human moment ends or a customer drops off. The latency budget is measured in seconds, sometimes less.

Inference has moved inside the transaction. It has become part of the product. The location of the infrastructure running it connects directly to whether it works.

The Firms That Thought About This First Have Already Moved

JPMorgan's infrastructure leadership has stated publicly that the firm works directly with hardware manufacturers and colocation operators, not hyperscaler managed services, for AI workloads where regulatory control and performance both matter. Its internal AI platform, used daily by a significant portion of its global workforce, runs on infrastructure the firm controls end to end, with external generative AI tools explicitly restricted for internal use.

Jane Street built a purpose-built data center in Texas for AI research and inference, housing thousands of liquid-cooled GPUs allocated internally by competitive bidding between research teams. Hudson River Trading partnering with Lambda for dedicated NVIDIA infrastructure rather than building their own facility, with the firm's leadership explicit that physical proximity of AI to execution is part of the strategy.

These firms applied the same infrastructure discipline to AI that they already applied to execution: put the compute where the decision happens.

Why "We'll Use the Cloud for Now" Is a Longer Commitment Than It Sounds

Enterprise infrastructure commitments around network topology, security certifications, operational processes, and vendor integrations do not change quickly. A decision to run production AI inference on a hyperscaler endpoint creates downstream dependencies that accumulate over time: data pipelines, monitoring systems, compliance frameworks built around that vendor's audit tooling, and procurement relationships. Switching costs grow as the architecture deepens.

Enterprise cloud re-platforming costs range from tens of thousands of dollars to well over half a million depending on workload complexity, before accounting for performance degradation and operational disruption during transition.

Beyond switching costs, there is a capacity problem developing in parallel. AI-ready power and dense colocation space are increasingly constrained in major metros. Chicago is no exception. Firms that wait to secure infrastructure until the need is obvious will find lead times that do not match their deployment timelines. The firms that are moving now are locking in capacity, location, and hardware flexibility before the window tightens further.

For analytics, research, and non-latency-critical inference, hyperscaler infrastructure makes sense. The specific question is about workloads where latency and control matter, and whether the infrastructure decision being made today reflects that distinction.

What the Right Infrastructure for This Actually Looks Like

Facilities that can support latency-sensitive AI inference in financial services require a specific combination of characteristics. Most data centers, including many enterprise-grade ones, do not have all of them simultaneously.

Location that matches the decision.

The latency advantage is geographic and cannot be engineered away from a distant facility. Colovore's Aurora campus sits at sub-0.1 milliseconds to CME's matching engines and under 5 milliseconds to 350 East Cermak, Chicago's primary carrier hotel. For financial services firms running AI inside trading and risk systems, that proximity is not a secondary feature. It is the reason the facility exists where it does.

Density that supports next-generation hardware, today.

The current generation of inference-class GPU clusters requires liquid cooling as a design requirement, not an option. Colovore's Chicago facilities support 5 to 600-plus kilowatts per rack, HVDC-ready, with liquid cooling built in from the ground up. A deployment can start at the scale it needs and grow to multi-megawatt without changing facilities, re-engineering cooling loops, or negotiating new terms. ORD01 in Aurora is live December 2026, with additional halls across ORD02, ORD03, and ORD04 coming online through 2028, totaling 54 megawatts of critical capacity across the campus.

Hardware flexibility that survives the silicon roadmap.

The chip that optimizes a fraud model today may not be the right choice for next-generation signal generation. Colovore's infrastructure is hardware-agnostic, supporting NVIDIA, AMD, and other architectures in the same facility. Firms are not locked to today's inference chip when better options arrive, and they do not need to rebuild or re-platform to adopt them.

A compliance posture built for regulated workloads.

Colovore's Chicago campus is Tier 3 certified with a 99.999% SLA and a 100% uptime track record since day one. ISO/IEC 27001 and SOC 2 Type II certified, with support for customer PCI DSS and FedRAMP requirements built into facility design, not retrofitted. For financial services firms where the execution environment has to survive regulatory examination, that compliance posture is part of the infrastructure decision, not a separate one.

The Question Worth Asking Before the Next Infrastructure Decision

AI inference is inside your most latency-sensitive systems, or it will be soon. Does the facility where it runs reflect the same discipline you apply to every other system where performance and control matter?

The firms that understood co-location as a competitive decision built advantages that lasted. The same logic applies now, for AI inference, for the firms willing to act on it before the moment passes. Capacity in the right locations is not unlimited, and the timelines to secure it are longer than most infrastructure teams assume.


This post is part of Colovore's ongoing series on the coming AI inference divide, the structural shift separating where AI is trained from where it runs in production at enterprise scale. Read the previous article

For the full analysis, including industry-specific use cases and the specialized silicon landscape, download the complete strategy paper.

Book a call

Talk with our team about optimizing your compute density, efficiency, and TCO.