Your GPUs. One Inference Layer.
Turn heterogeneous hardware into a unified AI compute platform. Run LLM inference across NVIDIA, AMD, Intel, and Apple Silicon — from a single API.
The Challenge
GPU Resources Are Scattered
Your organization has GPUs everywhere — workstations, servers, cloud instances. Different vendors, different capabilities, different machines. Today, each one is an island. What if you could use them all as one?
Fragmented Hardware
NVIDIA here, AMD there, Apple Silicon on laptops. No unified way to use them.
Idle Capacity
GPUs sit unused while teams wait for "the good machine" to free up.
Complex Orchestration
Load balancing, failover, model routing — building this yourself takes months.
The Solution
Cortex Unifies Your Compute
A lightweight coordinator that turns any GPU into part of your inference cluster. Deploy workers anywhere, route requests intelligently, get results reliably.
Capabilities
Built for Real Workloads
Hardware Agnostic
NVIDIA (CUDA), AMD (ROCm/Vulkan), Intel (SYCL), Apple Silicon (Metal). Mix and match freely.
Quorum Validation
2-of-3 consensus ensures response accuracy. Catch hallucinations before they reach users.
Worker Reputation
Automatic quality tracking. Unreliable workers get deprioritized, good ones get more work.
Response Caching
SHA256-based caching for deterministic queries. Don't recompute what you've already answered.
OpenAI-Compatible API
Drop-in replacement for /v1/chat/completions. Your existing code just works.
Real-time Dashboard
Monitor workers, track throughput, analyze performance — all from a web UI.
How It Works
Up and Running in Minutes
Start the Coordinator
Single Go binary. No dependencies. Run it on any machine in your network.
./cortex --port 3000
Connect Workers
Point workers at the coordinator. Each worker registers its capabilities.
python worker.py --mothership coordinator:3000 --model llama-3
Send Requests
Use the OpenAI-compatible API. Cortex handles routing and consensus.
curl http://coordinator:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3", "messages": [{"role": "user", "content": "Hello"}]}'Use Cases
Who Uses Cortex?
ML Teams
“We have GPUs on every workstation but no way to share capacity across the team.”
- Pool team hardware into shared inference
- Stop waiting for "the fast machine"
- Utilize idle overnight/weekend capacity
Research Labs
“Our cluster has mixed hardware from different grant cycles.”
- Unified API across NVIDIA/AMD/Intel
- Automatic load balancing
- Reproducible results via quorum
On-Prem Enterprises
“We can't send data to cloud APIs but need reliable LLM inference.”
- 100% on-premises deployment
- No data leaves your network
- Enterprise-grade reliability
Specifications
Under the Hood
Coordinator
- Written in Go for performance and easy deployment
- Single binary, no runtime dependencies
- Embedded web dashboard
- gRPC + REST APIs
Workers
- Python with llama.cpp backend
- Automatic hardware detection
- Hot model loading/unloading
- Health monitoring and heartbeats
Protocols
- OpenAI-compatible REST API
- gRPC for worker communication
- HTTP polling for quorum (firewall-friendly)
Requirements
- Coordinator: Any machine, minimal resources
- Workers: GPU with 8GB+ VRAM recommended
- Network: HTTP connectivity between nodes
Early Access
Join the Beta
Cortex v0.7.0 is available now for early adopters. We're looking for teams to help shape the roadmap.
What's Ready
- Core inference routing
- Quorum validation
- Worker reputation
- Response caching
- Web dashboard
- OpenAI-compatible API
Coming Soon
- CLI client
- Project-based worker pools
- Hardware-aware model selection
- Distributed project pools
Ready to Unify Your GPU Fleet?
Get started with Cortex today. Free and open source.