When multiple users share a GPU node, there's no good way to handle fairness during resource contention — heavy users can starve others, and there's no visibility into queue times or VRAM loading latency
A middleware layer that sits in front of vLLM or similar inference engines, providing per-user rate limiting, priority queuing, context window scheduling, worst-case latency estimates, and anti-abuse detection
Freemium — open-source core with paid enterprise features (analytics dashboard, SLA enforcement, usage-based billing integrations)
The pain signals are real and validated by practitioners (the source discussion had 43 upvotes). However, it's a 'hair on fire' problem only for teams already running shared GPU inference at scale — maybe a few thousand companies worldwide today. For most teams, over-provisioning GPUs is the current 'solution' (expensive but simple). The pain intensifies rapidly as GPU costs grow and multi-tenant usage increases.
The GPU scheduling middleware slice of the inference market is estimated at $1-3B TAM by 2028. Near-term addressable market is smaller — perhaps 2,000-5,000 companies running shared vLLM/Triton deployments that would pay for this. At $500-2,000/month per deployment, that's $12-120M near-term addressable revenue. Market is growing fast but the immediate buyer pool is narrow and technical.
Companies spending $10K-100K+/month on GPU inference will gladly pay $500-2K/month for middleware that improves utilization by even 15-20%. The ROI math is straightforward: better scheduling = fewer GPUs needed = direct cost savings. Enterprise GPU cloud providers (CoreWeave customers, etc.) have budget and procurement processes for this. The freemium-to-enterprise path works because the open-source core proves value before the upsell.
This is technically hard. A solo dev with deep systems/GPU expertise could build a basic fair-queue proxy in front of vLLM in 4-8 weeks, but it would be naive. The hard parts: (1) GPU-aware scheduling requires real-time VRAM/compute telemetry from the inference engine, which means deep integration with vLLM internals, not just proxying HTTP requests. (2) Worst-case latency estimation requires modeling batch scheduling behavior, context lengths, and current load — this is closer to research than engineering. (3) Anti-abuse detection needs ML on usage patterns. (4) Must not add meaningful latency to the inference path. A credible MVP requires strong systems programming skills (Rust/C++/Go) and intimate knowledge of inference engine internals.
Genuine whitespace. No one owns this layer. Every existing tool either operates at the wrong abstraction (API gateway vs. GPU scheduler), doesn't understand tenants, or requires significant DIY work. Large AI companies (OpenAI, Anthropic, Together AI) all built proprietary fair-queuing but nothing is available as standalone middleware. The 'sit in front of vLLM' positioning is strong because vLLM is the de facto standard but explicitly punts on multi-tenancy.
Natural subscription: per-GPU-node/month or per-request-volume pricing. Once deployed in the inference stack, switching costs are high (rewriting scheduling logic, retraining ops team). Usage grows with inference volume. Enterprise features (SLA enforcement, analytics, billing integrations) create strong expansion revenue. Infrastructure middleware has among the best retention rates in SaaS.
- +Genuine whitespace — no existing product addresses GPU-level tenant-fair-queuing as middleware
- +Strong recurring revenue dynamics with high switching costs once embedded in the inference stack
- +Clear ROI story: better GPU utilization = direct cost savings that far exceed middleware cost
- +vLLM's dominance as standard engine creates a focused integration target with massive addressable base
- +Open-source core strategy aligns perfectly with infrastructure buyer expectations and builds trust
- !vLLM or NVIDIA could add native multi-tenant scheduling, collapsing the middleware layer overnight — this is the existential risk
- !Deep integration with inference engine internals creates fragility across version upgrades and a maintenance burden
- !Narrow founder profile required (systems + GPU + inference expertise) limits who can execute this
- !Small initial buyer pool — only teams running shared GPU inference today, which is a few thousand companies
- !Worst-case latency estimation and anti-abuse detection are closer to research problems than engineering — risk of under-delivering on key differentiators
Open-source LLM gateway providing unified OpenAI-compatible API across 100+ providers with per-user budgets, rate limiting, load balancing, and spend tracking
Dominant open-source LLM inference engine with PagedAttention, continuous batching, and basic per-request priority scheduling
NVIDIA's production inference platform supporting multiple frameworks with dynamic batching, concurrent model execution, and MIG-based GPU partitioning
Commercial platform behind Ray providing infrastructure for scaling AI workloads including model serving with autoscaling, request routing, and batching
Kubernetes-native model inference platform with standardized serving protocol, autoscaling, canary deployments, and ModelMesh for high-density model packing
A lightweight proxy (Go or Rust) that sits in front of vLLM's OpenAI-compatible API and provides: (1) per-API-key weighted fair queuing with configurable priorities, (2) per-key request rate and token-per-minute limits, (3) real-time queue depth and estimated wait time exposed via API and Prometheus metrics, (4) a simple YAML config for tenant definitions and weights. Skip anti-abuse detection and worst-case latency estimation for MVP — those are V2 features. Ship as a single Docker container with a Helm chart.
Open-source core (fair queuing, basic rate limiting, Prometheus metrics) → Paid Team tier at $500-1,500/month per cluster (analytics dashboard, historical usage reports, tenant billing data export, Slack/PagerDuty alerts) → Enterprise tier at $3K-10K/month (SLA enforcement engine, usage-based billing integrations with Stripe/Orb, priority support, anti-abuse detection, multi-cluster federation, SSO/RBAC)
8-12 weeks to MVP with basic fair queuing. 3-4 months to first paying design partner (find 2-3 teams running shared vLLM who will beta test). 6-9 months to repeatable revenue if the integration works reliably. The long pole is proving that GPU-aware scheduling actually improves outcomes vs. naive rate limiting — you need production telemetry to demonstrate the value gap.
- “I worry about fairness during resource contention”
- “I wouldn't want to eat up the whole system when other users need it”
- “What if I try to hog all resources of a node”
- “how many seconds should I expect my worst case wait time to take until I get my first token”