THE SILICON VALLEY CO-LOS KNOW WHAT’S REALLY GOING ON WITH AI
originally published in The Next Platform
If you want to get a sense of what companies are really doing with AI infrastructure, and the issues of processing and network capacity, power, and cooling that they are facing, what you need to do is talk to some co-location datacenter providers. And so, we are going to do that starting with Colovore, which is based in Santa Clara, in the heart of Silicon Valley and perhaps the worst place to run a datacenter.
Or, if you do it right, maybe the best place because the customer demand is through the freaking roof. And that is precisely how Ben Coughlin, co-founder, chairman, and chief financial officer of Colovore, sees it.
Colovore came to our attention because it is the place where Cerebras Systems is hosting its “Andromeda” cluster of sixteen of its CS-2 wafer-scale computing systems, which delivers over 1 exaflops of half-precision FP16 floating point math for training AI models, which is shown in the feature image above in the datacenter out on Space Park Drive near San Jose Mineta International Airport – and oddly enough across the street from the UNIXSurplus Computer Store and a stone’s throw away from datacenters run by Digital Realty, Equinix, Evocative, and Tata Communications.
Founded in 2012, just when the GPU-accelerated AI boom was starting, Colovore has raised $8 million in funding to date and has just one datacenter so far. The company’s SJC01 datacenter weighs in at 24,000 square feet, which is compact thanks to the liquid cooling, has been operational since 2014. The SJC01 facility has been expanded incrementally, with a 2 megawatt expansion inside the facility that was done in February 2022, to reach closer to its full 9 megawatt load. The racks started out at 20 kilowatts of power and cooling, and have expanded to 35 kilowatts. Its SJC02 datacenter, set to open in Q2 2024, is going to be occupying that UNIXSurplus building, which it is leasing from Ellis Partners. (There is a metaphor if we ever saw one. . . . ) It has about 29,000 square feet of space, and like SJC01, will only offer liquid cooled racks and possibly some direct liquid cooling if customers request it. (And we think they will.) The racks scale to 50 kilowatts in the new datacenter from the get-go.
Colovore was co-founded by Sean Holzknecht, who was vice president of operations at Evocative and founded another datacenter operator called Emerytech Data Center after a stint running multiple central offices in San Francisco for Pacific Bell. Coughlin is the money person and was a partner at Spectrum Equity Investors, a private equity firm with $5 billion in capital that was focused on telecommunications and digital media. Peter Harrison, Colovore’s third co-founder, managed Google’s global datacenter footprint, its fiber to premises project and YouTube’s content delivery network. Harrison was operations director at eBay and also helped Netflix launch its streaming video service.
Coughlin reached out to us because he sees everyone wanting to get going with AI, but they haven’t quite got their heads wrapped around the cooling issues with these matrix math monsters they need to drive recommendation engines and large language models. Colovore is in the thick of it, running a 9 megawatt facility in the heart of the action, which is completely liquid cooled and ready to take on the densest compute its companies need to bring to bear. We aren’t talking the 100 kilowatts per rack that a massive exascale-class supercomputer with direct-attached, liquid-cooled cold plates might need these days, but it is getting close. And if you need that, Coughlin has the team and the facility that can push that envelope right in the heart of Silicon Valley.
Ben Coughlin: We obviously have been following your coverage of this industry for a while. And we’re at an interesting intersection at Colovore because we support a lot of the newer AI infrastructure here in Silicon Valley – in part because we offer liquid cooling. There’s a lot of discussion about the growth in AI and how they are innovating on the underlying server platforms, but there is very limited discussion about the datacenter. The vast majority of datacenters are not built to support those AI systems. If the datacenter can’t support it, Houston, we got a bit of a problem here.
Everybody generally looks at the datacenter as some building, a piece of real estate. Not very exciting, not that much fun to talk about, they all look and feel the same. And for the most part, that’s right. Except now as this type of AI infrastructure proliferates, things are going to have to change.
Timothy Prickett Morgan: OK, let’s talk about that. You have a datacenter in Santa Clara, which means you are serving some of the most compute and data intensive customers who have figured out that they don’t want to be running their own datacenter. You have got them right where you want them, and they have got you right where they want you.
So why the hell would you pay California pricing for real estate, for water, for electricity? That seems crazy on the face of it, but there is always that limit to the speed of light that forces certain things to be reasonably local.
Ben Coughlin: We serve startups to the Fortune 500. It’s like a whole range of customers, with some spending a few thousand bucks a month, others spending hundreds of thousands per month. And a number of our customers are in the Fortune 500 – big, publicly traded companies with huge market caps that are leading the AI revolution. But the truth is, they don’t have IT departments that can actually manage datacenters in remote locations. It’s shocking for companies of their scale and complexity, but when you peel back the IT onion a little bit at these companies and look at the technical ops people that can handle infrastructure, it’s not nearly as deep as you think. And that one of the quiet reasons why not everybody just goes to Fargo, North Dakota or get whatever source of power that’s a lot cheaper and in a place that is far easier to build compared to Silicon Valley. And that is why there’s still plenty of local demand.
TPM: What percentage of the infrastructure that you currently have under management at SJC01 is AI stuff?
Ben Coughlin: If I ballpark my rack unit count across all of the servers in the datacenter, AI probably accounts for 80 percent. We have some fat systems with thousands of GPUs running here.
TPM: OK, that means I don’t have to end this call right now. Which is good.
Ben Coughlin: When we started the businesses ten years ago, we had all been running datacenters for a long time. And the thing that we were seeing years back was this. With blades and virtualized environments, the server platforms were kind of getting smaller and more powerful, you could condense the footprint and do more with a smaller physical space. And we figured out that this is was going to require power in a cabinet and more cooling in a cabinet. Nobody saw this whole AI revolution coming, but because we started doing liquid cooling from day one, we were ready.
Here is the thing: At the end of the day, this is all really about cooling inside the datacenter. You can always deliver more power circuits to a location. And that is what we focused on.
TPM: Wait a second. I thought you guys in the Valley and in other places like Ashburn in Virginia were power limited, and also that it was harder and harder to get more power down into the racks even when you can get it delivered to the building?
Ben Coughlin: Not really. Silicon Valley Power, as a utility, does have some constraints – now not quite like what’s going on in Northern Virginia, where they literally cannot give out more power. If you want to pull more power to location in the datacenter, generally you can do that. The problem is how do you deal with the heat.
TPM: I have read the specs on what power the SJC01 datacenter could deliver to racks – where you started and where you are at today. I still think 100 kilowatts is a lot for a rack to handle, both for cooling and power reasons. What are people actually doing?
Ben Coughlin: Let me give you the building blocks. Most run of the mill datacenters support 5 kilowatts in a cabinet.
TPM: That’s stupid. A CPU is pushing 400 watts and a GPU is pushing 800 watts.
Ben Coughlin: Hey, believe me, you’re singing our tune. But ten years ago, a typical server was maybe 250 watts, and a server CPU was maybe 75 watts, maybe sometimes 100 watts.
TPM: Yeah, I remember when people were freaking out that a CPU burned more juice than a freaking incandescent light bulb, and now, it’s like they are a hair dryer and we don’t even flinch.
Ben Coughlin: When we first opened the doors, we built every every rack to handle 20 kilowatts. Then, a couple years later, when we expanded and brought on our next phase online, we built at 35 kilowatts. Now we support 50 kilowatts. So, just in our evolution in the last decade, we’ve gone internally for 20 To 35 to 50 years. And we can deliver 250 kilowatts per cabinet. That’s really a function of those platforms and how they are being cooled. Those are direct liquid cooled systems, we have a number of in operation. Some drop 35 kilowatts or 50 kilowatts in a cabinet, but we are designing and deploying a customer right now that has north of 200 kilowatts per cabinet. And no, it is not cryptomining, which is a terrible customer base.
TPM: Could not agree more. If you want to start a new currency, go with Elon Musk to Mars. I’ll help y’all freaking pack and drive you to the launch pad. . . .
Ben Coughlin: These are all real AI workloads from real companies.
TPM: You are only in the Valley. How come you are not in other locations?
Ben Coughlin: You know, one step at a time. We’re profitable, and we are growing. I’ve been in Silicon Valley for a long time, and I know the venture capital model of grow at all costs. That’s not our approach.
But to your point, because we are seeing AI move from prototyping to early trials and to some deployments, we’re seeing customers move to multiple cabinets. It’s all expanding pretty fast, which is why we’re building another location next door. Beyond that, I think our next move would be a little bit out of market, but still regional in nature. So maybe we go up to Reno, there’s an area of there where the power is cheaper, but it’s still relatively local. The Pacific Northwest is a good location for us. But we’re not going to plant a flag in every NFL city and go crazy. One step at a time. . . .
TPM: I know a bunch of companies that believe this, and for edge computing, I would argue, as VaporIO does, that they should be in every NFL city because the permitting and construction hassles of building out an edge network are immense.
Different topic: How much of the datacenter market will go co-lo? I think it might be one-third in the cloud, one third on premises, and one third co-lo in the longest of runs.
Ben Coughlin: That’s a good question. I would say it’s bigger than you think, and here’s the part that you have to remember. Of the cloud footprint – and I don’t know exactly what number is – but around 0 percent to 40 percent of their cloud datacenters are actually running in co-lo facilities that those big guys lease. They will build their own datacenters in markets where power and land are super cheap and they can backhaul traffic to it. But they’re leasing capacity from co-lo providers in the major metros because it doesn’t make sense for them to spend all that money and pay that premium on space and power.
My point for years was that the clouds were not the silver bullet for co-los. We’ve always said it’s actually a rising tide. Yes, there are some people will make the decision to do just pure cloud. But again, a bunch of those cloud providers are using co-los. . . .
TPM: I was ignoring that phenomenon and really thinking about the Global 20000 that do not run their own clouds and service providers, and thinking about what they might do. No one is going to to move from on premises to the cloud and then repatriate to on premises. They are going to half-back to a co-lo, I think, when the cloud expense gets to be too high.
Ben Coughlin: First of all, all of our customers are hybrid. They’re using the cloud for certain applications and co-lo for certain applications. It really a kind of multi-platform. With AI in particular and these types of workloads, the cloud does have some limitations – and it is not just cost. Everybody knows that cloud is super-duper expensive. But that’s just one variable, even if it is very important.
TPM: How much cheaper can you do AI for your customers?
Ben Coughlin: On a monthly basis, most of our customers are saving 50 percent to 70 percent versus their monthly cloud bill. There is investment on the front end when they buy their gear, but that payback can be in just three to six months. So the economics are as clear as day that the ROI is huge.
If you just looked at the financial aspects, the cloud does not make sense for these types of AI workloads. But again, there are also other variables: You have to have the skillset to run your infrastructure. The personnel at a lot of these cloud companies are kids that are 20 years old and they have never even touched a server, and they don’t even know how it works. Some people have the CapEx-OpEx thing. Latency is another, and for AI, we see latency as a big advantage for co-los. People talk about self-driving cars and ChatGPT, which is fine, but that is a very small portion of the AI workload. But for real time applications, it is not ideal to be using the cloud, to have that infrastructure residing in the middle of the country, and you gotta go back and forth. Latency does matter for some of these applications. So the cloud isn’t perfect for AI stuff for a number of different dimensions.
Here is the thing. Whatever you are doing, you need that density of compute engines in the metros, because that’s where the data is being generated. That’s where it needs to be analyzed and stored. And the best way to do that is to have those datacenters matching what’s happening with the server platform, making it smaller and more powerful. At the end of the day, what we’re doing is mimicking what’s going on in these servers. We’re just shrinking the datacenter down and we’re just making it more efficient overall. And we leverage water to do that. We don’t need to build these, you know, Cadillacs of hundreds and hundreds of thousands of square feet.
We have a perfect example right across the street from us at a Digital Realty facility, which is six stories and 150,000 square feet. We’re 25,000 square feet, we have the exact same amount of power they do. Which means they are, for the same amount of compute, 6X larger than we are.
TPM: What is your incremental cost, and what is the incremental cost passed on to the customer?
Ben Coughlin: It’s cheaper. There’s another bit of a fallacy. Because typically, when you build air-cooled datacenters, it’s sort of linear: If I have more capacity, it costs me more. But because water is such an efficient cooling medium and it has so much capacity, you don’t need to keep building more and more. There are economies of scale there. So when we look at our costs to deliver a megawatt of critical power that is being consumed by the customer, we are 30 percent cheaper than the industry because our footprint is smaller.
The other thing you have to remember is that in our datacenter industry, a lot of the giants are real estate professionals. They built buildings, and they know how to build their buildings and run their datacenters in a way that works for them. And when they’re building at that scale, they have a approach and this is how they stamp them out. They’re not the most nimble in terms of incorporating some of these new technologies like liquid in the datacenter. So what to you and me looked very logical and necessary – liquid cooling in the datacenter – gives them pause. We are starting to see some cracks, though. Digital Realty, on its most recent quarterly conference call, finally said this high density stuff is becoming important in our datacenters.
In the meantime, we will keep chugging along under the radar and keep building out incrementally and going in the right direction.
TPM: Last question: If I wanted to do direct liquid cooling into my systems, can you do it or not?
Ben Coughlin: We have got multiple megawatts running today with direct liquid cooled servers using different methodologies. There are lots of different ways of skinning that cat.
To date, what we’ve seen is that the server chasses themselves are liquid cooled, running their own heat exchangers internally and so we’re delivering water to the chassis, and then it’s handling it on the insides. We are seeing more interest in cold plate stuff happening, getting the water distributed even deeper into the system. And it’s a little bit of a Wild West at the moment. To be honest, right now, there’s not been great standardization because it’s early days.
The important thing is, we’ve got the water and the pipes to be able to distribute it. If you come to our datacenter and looked under the floor, we have three or four feet of piping down there.
But this is is the trickiest part of all of this, which people don’t quite understand and I think might be interesting for you. There’s water in all datacenters. The air conditioning units are based on water. It’s not just getting water there – you have to do filtration on the water and add chemicals and making sure the water is pure so there’s no corrosion. But the biggest thing when you distribute the water is that you have to make a lot of decisions on how big your pipes are, what’s the flow rate of the water, what’s the temperature of the water, and that stuff directly impacts those direct liquid cooled platforms.
And so once you get into the really nitty gritty of managing water, there are a lot of decisions you have to make on those variables. And this goes back to the comment I made about standards. If you have one of these CDU providers saying it wants super-fast water in thin pipes, high pressure stuff at a really cold temperature, that requires one setup of infrastructure. If somebody else is saying just give me a big pipe lazy river, like slow flow at a more moderate temperature, that requires something else. If you have one or the other, it’s not so easy for the datacenter to switch approaches.
Fortunately our system is the bigger pipe lazy river kind, and what we’ve seen thus far with most of the cooling platforms have been going for lower flow rate water inputs.