Distributed AI Infrastructure

Your GPUs. One Inference Layer.

Turn heterogeneous hardware into a unified AI compute platform. Run LLM inference across NVIDIA, AMD, Intel, and Apple Silicon — from a single API.

Get Started View on GitHub

Go Coordinator

Python Workers

React Dashboard

The Challenge

GPU Resources Are Scattered

Your organization has GPUs everywhere — workstations, servers, cloud instances. Different vendors, different capabilities, different machines. Today, each one is an island. What if you could use them all as one?

Fragmented Hardware

NVIDIA here, AMD there, Apple Silicon on laptops. No unified way to use them.

Idle Capacity

GPUs sit unused while teams wait for "the good machine" to free up.

Complex Orchestration

Load balancing, failover, model routing — building this yourself takes months.

The Solution

Cortex Unifies Your Compute

A lightweight coordinator that turns any GPU into part of your inference cluster. Deploy workers anywhere, route requests intelligently, get results reliably.

Your Applications

APIs, Services, Frontends

OpenAI-compatible API

Cortex Coordinator

Routing • Quorum • Caching • Reputation

gRPC / HTTP

NVIDIA

CUDA

AMD

ROCm

Intel

SYCL

Apple

Metal

GPU Workers

Capabilities

Built for Real Workloads

Hardware Agnostic

NVIDIA (CUDA), AMD (ROCm/Vulkan), Intel (SYCL), Apple Silicon (Metal). Mix and match freely.

Quorum Validation

2-of-3 consensus ensures response accuracy. Catch hallucinations before they reach users.

Worker Reputation

Automatic quality tracking. Unreliable workers get deprioritized, good ones get more work.

Response Caching

SHA256-based caching for deterministic queries. Don't recompute what you've already answered.

OpenAI-Compatible API

Drop-in replacement for /v1/chat/completions. Your existing code just works.

Real-time Dashboard

Monitor workers, track throughput, analyze performance — all from a web UI.

How It Works

Up and Running in Minutes

Start the Coordinator

Single Go binary. No dependencies. Run it on any machine in your network.

./cortex --port 3000

Connect Workers

Point workers at the coordinator. Each worker registers its capabilities.

python worker.py --mothership coordinator:3000 --model llama-3

Send Requests

Use the OpenAI-compatible API. Cortex handles routing and consensus.

curl http://coordinator:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3", "messages": [{"role": "user", "content": "Hello"}]}'

Get Started with Documentation

Use Cases

Who Uses Cortex?

ML Teams

“We have GPUs on every workstation but no way to share capacity across the team.”

Pool team hardware into shared inference
Stop waiting for "the fast machine"
Utilize idle overnight/weekend capacity

Research Labs

“Our cluster has mixed hardware from different grant cycles.”

Unified API across NVIDIA/AMD/Intel
Automatic load balancing
Reproducible results via quorum

On-Prem Enterprises

“We can't send data to cloud APIs but need reliable LLM inference.”

100% on-premises deployment
No data leaves your network
Enterprise-grade reliability

Specifications

Under the Hood

Coordinator

Written in Go for performance and easy deployment
Single binary, no runtime dependencies
Embedded web dashboard
gRPC + REST APIs

Workers

Python with llama.cpp backend
Automatic hardware detection
Hot model loading/unloading
Health monitoring and heartbeats

Protocols

OpenAI-compatible REST API
gRPC for worker communication
HTTP polling for quorum (firewall-friendly)

Requirements

Coordinator: Any machine, minimal resources
Workers: GPU with 8GB+ VRAM recommended
Network: HTTP connectivity between nodes

Early Access

Join the Beta

Cortex v0.7.0 is available now for early adopters. We're looking for teams to help shape the roadmap.

What's Ready

Core inference routing
Quorum validation
Worker reputation
Response caching
Web dashboard
OpenAI-compatible API

Coming Soon

CLI client
Project-based worker pools
Hardware-aware model selection
Distributed project pools

Get Early Access View Roadmap on GitHub

Ready to Unify Your GPU Fleet?

Get started with Cortex today. Free and open source.

Download from GitHub Read the Docs

Apache 2.0 License·Self-hosted·No data leaves your network