Models with tool access often ignore their tools and try to reason through problems they should compute, leading to confident but wrong outputs. There's no middleware to enforce or encourage tool use.
A middleware/proxy layer that sits between the user and LLM APIs, detects when a model is attempting to reason through something it has tools for (math, code execution, data lookup), and nudges or forces tool use. Provides analytics on tool utilization rates.
Subscription SaaS with usage-based pricing per API call routed through the layer
Real pain, validated by the Reddit signals — developers DO notice and complain when models ignore tools. However, it's often a 'sigh and retry' annoyance rather than a business-critical blocker. For enterprises building production agents, intensity jumps to 8-9, but the average developer building side projects tolerates it. The pain is real but unevenly distributed.
TAM is substantial: every developer using LLM APIs with tool calling is a potential customer. Estimated 2-5M developers actively building with LLM APIs by 2026. Even at $50/mo average, the addressable market is $1-3B. However, this is a middleware play which historically captures only a thin slice of the value chain.
This is the weakest link. Individual developers will expect this to be open-source or free. Enterprises will pay, but they'll want this bundled into their existing observability stack (LangSmith, Datadog, etc.) rather than adding another vendor. The value prop is clear but the 'is this worth a separate subscription?' question is hard. Usage-based pricing per API call adds cost to already expensive LLM calls — friction.
A proxy layer that intercepts LLM API calls is straightforward to build. The HARD part is the detection logic: reliably determining when a model SHOULD have used a tool but didn't requires understanding intent, which is itself an LLM problem. A naive approach (regex for math expressions, code patterns) gets you 60% there. A sophisticated approach (using a secondary LLM to judge) adds latency and cost. MVP in 4-8 weeks is doable for the proxy + basic detection + dashboard, but the detection quality will be the long-term moat and challenge.
This is genuinely underserved. Existing tools are either observability (see what happened) or output validation (check the result). Nobody is doing real-time intervention at the tool-use decision layer. Guardrails AI is closest in spirit but focused on different problems. This is a clear gap — the question is whether it stays a gap or gets absorbed by existing platforms.
Strong recurring fit. Every API call routed through the layer = ongoing usage. As developers add more tools to their agents, the value increases. Analytics and dashboards create stickiness. Once integrated into a production pipeline, switching costs are moderate. Natural expansion from 'nudge tool use' to 'full agent reliability platform.'
- +Clear, underserved gap — nobody is doing real-time tool-use enforcement today
- +Growing market tailwind as agentic AI adoption accelerates
- +Natural expansion path from narrow middleware to broad agent reliability platform
- +Developer-friendly concept — easy to explain, demo, and prove value
- +Pain signal is model-agnostic — affects OpenAI, Anthropic, Google, open-source equally
- !Platform risk: OpenAI/Anthropic/Google could improve native tool use to the point this becomes unnecessary, or add enforcement features themselves
- !Detection accuracy: The core value prop depends on correctly identifying 'should have used a tool' which is a fuzzy, hard problem — false positives (forcing tool use when reasoning was correct) would destroy trust
- !Latency tax: Any middleware layer adds latency to already-slow LLM calls; developers are latency-sensitive
- !Willingness to pay: Risk of being 'nice to have' rather than 'must have' — could end up as a popular open-source tool that can't monetize
- !Absorption risk: LangChain, Guardrails AI, or observability platforms (Datadog, Arize) could add this as a feature in weeks
Open-source framework for adding validation, structure, and type-checking to LLM outputs. Focuses on output quality via validators
Toolkit for adding programmable guardrails to LLM-based conversational apps. Uses Colang to define conversation flows and safety rails.
Observability and evaluation platform for LLM applications. Traces agent runs, logs tool calls, provides debugging and testing workflows.
AI product evaluation and observability platform. Provides logging, evals, prompt playground, and dataset management for LLM apps.
Built-in tool/function calling capabilities in frontier model APIs. Models are trained to recognize when to use provided tools.
Proxy server (Python/Node) that sits between client and OpenAI/Anthropic APIs. V1 detection: pattern-match for math calculations, date/time questions, data lookup patterns, and code execution attempts in model responses when corresponding tools are available. When detected, either (a) auto-retry with a stronger system prompt nudge, or (b) flag in a simple dashboard. Ship with a CLI install (`npx agentlint` or `pip install agentlint`) and a web dashboard showing tool utilization rates per session. Target LangChain/LangGraph users first — they already have tool definitions wired up.
Open-source core proxy + basic detection rules (free, builds community) -> Cloud dashboard with analytics, alerting, and historical trends ($29-99/mo per seat) -> Enterprise tier with custom detection rules, SSO, audit logs, and SLA ($500+/mo) -> Usage-based pricing for high-volume API routing ($0.001-0.01 per intercepted call)
8-12 weeks to MVP with free tier; 4-6 months to first paying customer. The open-source proxy could gain traction in 2-4 weeks if launched well on HN/Reddit. Converting to paid requires the analytics dashboard to be genuinely useful, which takes iteration. Enterprise deals: 6-9 months from launch.
- “Even though Gemini 3 Deepthink had tool access, it completely ignored it”
- “tried to solve the paradox purely through brute-force reasoning for 15 minutes straight”
- “Gemma 4 31B surprisingly utilized its tool access, constantly running multiple Python scripts”