How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

As the demand for AI products grows, developers are creating and tuning a wider variety of models. While adding new models to our growing catalog on Workers AI, we noticed that not all of them are used equally – leaving infrequently used models occupying valuable GPU space. Efficiency is a core value at Cloudflare, and with GPUs being the scarce commodity they are, we realized that we needed to build something to fully maximize our GPU usage.

Omni is an internal platform we’ve built for running and managing AI models on Cloudflare’s edge nodes. It does so by spawning and managing multiple models on a single machine and GPU using lightweight isolation. Omni makes it easy and efficient to run many small and/or low-volume models, combining multiple capabilities by:

Spawning multiple models from a single control plane,
Implementing lightweight process isolation, allowing models to spin up and down quickly,
Isolating the file system between models to easily manage per-model dependencies, and
Over-committing GPU memory to run more models on a single GPU.

Cloudflare aims to place GPUs as close as we possibly can to people and applications that are using them. With Omni in place, we’re now able to run more models on every node in our network, improving model availability, minimizing latency, and reducing power consumed by idle GPUs.

Here’s how.

Omni’s architecture – at a glance

At a high level, Omni is a platform to run AI models. When an inference request is made on Workers AI, we load the model’s configuration from Workers KV and our routing layer forwards it to the closest Omni instance that has available capacity. For inferences using the Asynchronous Batch API, we route to an Omni instance that is idle, which is typically in a location where it’s night.

Omni runs a few checks on the inference request, runs model specific pre and post processing, then hands the request over to the model.

Elastic scaling by spawning multiple models from a single control plane

If you’re developing an AI application, a typical setup is having a container or a VM dedicated to running a single model with a GPU attached to it. This is simple. But it’s also heavy-handed — because it requires managing the entire stack from provisioning the VM, installing GPU drivers, downloading model weights, and managing the Python environment. At scale, managing infrastructure this way is incredibly time consuming and often requires an entire team.

If you’re using Workers AI, we handle all of this for you. Omni uses a single control plane for running multiple models, called the scheduler, which automatically provisions models and spawns new instances as your traffic scales. When starting a new model instance, it downloads model weights, Python code, and any other dependencies. Omni’s scheduler provides fine-grained control and visibility over the model’s lifecycle: it receives incoming inference requests and routes them to the corresponding model processes, being sure to distribute the load between multiple GPUs. It then makes sure the model processes are running, rolls out new versions as they are released, and restarts itself when detecting errors or failure states. It also collects metrics for billing and emits logs.

The inference itself is done by a per-model process, supervised by the scheduler. It receives the inference request and some metadata, then sends back a response. Depending on the model, the response can be various types; for instance, a JSON object or a SSE stream for text generation, or binary for image generation.

The scheduler and the child processes communicate by passing messages over Inter-Process Communication (IPC). Usually the inference request is buffered in the scheduler for applying features, like prompt templating or tool calling, before the request is passed to the child process. For potentially large binary requests, the scheduler hands over the underlying TCP connection to the child process for consuming the request body directly.

Implementing lightweight process and Python isolation

Typically, deploying a model requires its own dedicated container, but we want to colocate more models on a single container to conserve memory and GPU capacity. In order to do so, we needed finer-grained controls over CPU memory and the ability to isolate a model from its dependencies and environment. We deploy Omni in two configurations; a container running multiple models or bare metal running a single model. In both cases, process isolation and Python virtual environments allow us to isolate models with different dependencies by creating namespaces and are limited by cgroups.

Python doesn’t take into account cgroups memory limits for memory allocations, which can lead to OOM errors. Many AI Python libraries rely on psutil for pre-allocating CPU memory. psutil reads /proc/meminfo to determine how much memory is available. Since in Omni each model has its own configurable memory limits, we need psutil to reflect the current usage and limits for a given model, not for the entire system.

The solution for us was to create a virtual file system, using fuse, to mount our own version of /proc/meminfo which reflects the model’s current usage and limits.

To illustrate this, here’s an Omni instance running a model (running as pid 8). If we enter the mount namespace and look at /proc/meminfo it will reflect the model’s configuration:

# Enter the mount (file system) namespace of a child process
$ nsenter -t 8 -m

$ mount
...
none /proc/meminfo fuse ...

$ cat /proc/meminfo
MemTotal:     7340032 kB
MemFree:     7316388 kB
MemAvailable:     7316388 kB

In this case the model has 7Gib of memory available and the entire container 15Gib. If the model tries to allocate more than 7Gib of memory, it will be OOM killed and restarted by the scheduler’s process manager, without causing any problems to the other models.

For isolating Python and some system dependencies, each model runs in a Python virtual environment, managed by uv. Dependencies are cached on the machine and, if possible, shared between models (uv uses symbolic links between its cache and virtual environments).

Also separated processes for models allows to have different CUDA contexts and isolation for error recovery.

Over-committing memory to run more models on a single GPU

Some models don’t receive enough traffic to fully utilize a GPU, and with Omni we can pack more models on a single GPU, freeing up capacity for other workloads. When it comes to GPU memory management, Omni has two main jobs: safely over-commit GPU memory, so that more models than normal can share a single GPU, and enforce memory limits, to prevent any single model from running out of memory while running.

Over-committing memory means allocating more memory than is physically available to the device.

For example, if a GPU has 10 Gib of memory, Omni would allow 2 models of 10Gib each on that GPU.

Right now, Omni is configured to run 13 models and is allocating about 400% GPU memory on a single GPU, saving up 4 GPUs. Omni does this by injecting a CUDA stub library that intercepts CUDA memory allocations (cuMalloc* or cudaMalloc*) calls and forces memory allocations to be performed in unified memory mode.

In Unified memory mode CUDA shares the same memory address space for both the GPU and the CPU:

^CUDA’s^{unified memory mode}

In practice this is what memory over-commitment looks like: imagine 3 models (A, B and C). Models A+B fit in the GPU’s memory but C takes up the entire memory.

Models A+B are loaded first and are in GPU memory, while model C is in CPU memory
Omni receives a request for model C so models A+B are swapped out and C is swapped in.
Omni receives a request for model B, so model C is partly swapped out and model B is swapped back in.
Omni receives a request for model A, so model A is swapped back in and model C is completely swapped out.

The trade-off is added latency: if performing an inference requires memory that is currently on the host system, it must be transferred to the GPU. For smaller models, this latency is minimal, because with PCIe 4.0, the physical bus between your GPU and system, provides 32 GB/sec of bandwidth. On the other hand, if a model need to be “cold started” i.e. it’s been swapped out because it hasn’t been used in a while, the system may need to swap back the entire model – a larger sized model, for example, might use 5Gib of GPU memory for weights and caches, and would take ~156ms to be swapped back into the GPU. Naturally, over time, inactive models are put into CPU memory, while active models stay hot in the GPU.

Rather than allowing the model to choose how much GPU memory it uses, AI frameworks tend to pre-allocate as much GPU memory as possible for performance reasons, making co-locating models more complicated. Omni allows us to control how much memory is actually exposed to any given model to prevent a greedy model from over-using the GPU allocated to it. We do this by overriding the CUDA runtime and driver APIs (cudaMemGetInfo and cuMemGetInfo). Instead of exposing the entire GPU memory, we only expose a subset of memory to each model.

How Omni runs multiple models for Workers AI

AI models can run in a variety of inference engines or backends: vLLM, Python, and now our very own inference engine, Infire. While models have different capabilities, each model needs to support Workers AI features, like batching and function calling. Omni acts as a unified layer for integrating these systems. It integrates into our internal routing and scheduling systems, and provides a Python API for our engineering team to add new models more easily. Let’s take a closer look at how Omni does this in practice:

from omni import Response
import cowsay


def handle_request(request, context):
    try:
        json = request.body.json
        text = json["text"]
    except Exception as err:
        return Response.error(...)

    return cowsay.get_output_string('cow', text)

Similar to how a JavaScript Worker works, Omni calls a request handler, running the model’s logic and returning a response.

Omni installs Python dependencies at model startup. We run an internal Python registry and mirror the public registry. In either case we declare dependencies in requirements.txt:

cowsay==6.1

The handle_request function can be async and return different Python types, including pydantic objects. Omni will convert the return value into a Workers AI response for the eyeball.

A Python package is injected, named omni, containing all the Python APIs to interact with the request, the Workers AI systems, building Responses, error handling, etc. Internally we publish it as regular Python package to be used in standalone, for unit testing for instance:

from omni import Context, Request
from model import handle_request


def test_basic():
    ctx = Context.inactive()
    req = Request(json={"text": "my dog is cooler than you!"})
    out = handle_request(req, ctx)
    assert out == """  __________________________
| my dog is cooler than you! |
  ==========================
                          \\
                           \\
                             ^__^
                             (oo)\\_______
                             (__)\\       )\\/\\
                                 ||----w |
                                 ||     ||"""

Omni allows us to run models more efficiently by spawning them from a single control plane and implementing lightweight process isolation. This enables quick starting and stopping of models, isolated file systems for managing Python and system dependencies, and over-committing GPU memory to run more models on a single GPU. This improves the performance for our entire Workers AI stack, reduces the cost of running GPUs, and allows us to ship new models and features quickly and safely.

Right now, Omni is running in production on a handful of models in the Workers AI catalog, and we’re adding more every week. Check out Workers AI today to experience Omni’s performance benefits on your AI application.

Source link
#Cloudflare #runs #models #GPUs #technical #deepdive

ByAdmin

Omni’s architecture – at a glance

Elastic scaling by spawning multiple models from a single control plane

Implementing lightweight process and Python isolation

Over-committing memory to run more models on a single GPU

How Omni runs multiple models for Workers AI

By Admin

Related Post

Nvidia stock falls on data center revenue miss, Snowflake pops, CrowdStrike drops

Google announces new $9 billion investment in Virginia

Google Workspace announces new gen AI features and no-cost option for Vids

Leave a Reply Cancel reply

You missed

Princess Diana’s Time Capsule Opened 34 Years Later

$442 Million-Dollar PLA with Rio Hondo College Paves Way for Thousands of New Construction Jobs – IBEW

Bold pitch for BETP – The Patriot On Sunday

Progress Toward Cleanup Continues at Energy Technology Engineering Center

Breaking News Updates