Stop Exposing Your API Keys: How I Built a Five-Layer AI Proxy That Lets Users Call LLMs Without the Security Nightmare

Someone scraped an OpenAI API key from a public GitHub repo and ran up a fourteen thousand dollar bill in 72 hours.
That's not hypothetical. It happened to a developer I know. The key was in a dotenv example file that accidentally contained a real value. A bot found it within hours. By the time the billing alert fired, the damage was done.
Now imagine you're building a micro-SaaS where every user needs access to AI features — chat completions, image generation, text analysis. You can't give each user their own key. You can't embed your key in the frontend. And you can't just stand up a server endpoint that proxies requests with no rate limiting and no cost controls.
You need a proxy. A smart one. One that holds the keys, tracks the costs, enforces limits per user, and delivers a seamless AI experience without exposing any of the underlying infrastructure.
That's exactly what I built for vibe.oakoliver.com, and this article walks through every architectural decision and every lesson learned from running it in production.
I – The Threat Model: Why AI Keys Are Toxic Assets
When you build a product that calls LLMs, you have exactly three options for where the API key lives. Each option has a fundamentally different risk profile.
Option one: in the frontend. This is catastrophic. Anyone can open DevTools, inspect network requests, search the bundled JavaScript, or read error logs. The key is visible to every user, every browser extension, and every man-in-the-middle proxy. Don't do this. Ever.
Option two: in a simple server proxy with no controls. Better — the key is hidden from the browser. But if any authenticated user can call your AI endpoint unlimited times, a single abusive user, a compromised account, or a buggy frontend retry loop can drain your entire API budget in minutes. You've hidden the key but created an open wallet.
Option three: in a server proxy with per-user credit deduction, rate limiting, input validation, and provider abstraction. This is the answer. This is what Vibe does.
The frontend never sees an API key. It never even knows which AI provider is being used. It posts a prompt to a single endpoint and receives a completion. The proxy handles authentication, validation, credit checks, rate limiting, provider routing, error sanitization, cost tracking, and credit deduction. Every concern lives server-side, behind a clean API boundary.
II – The Five Security Layers
The AI proxy has five security layers, and each one catches a different class of threat. Skip any layer and you leave a specific attack vector open.
Layer one is authentication. Every AI request requires a valid session token. Anonymous access is impossible. This is the most basic layer, but it's also the foundation — without it, none of the other layers matter because you can't identify who's making the request.
Layer two is input validation. The framework's schema validation rejects malformed requests before they reach the handler. Message content is capped at ten thousand characters to prevent prompt injection via enormous payloads. Maximum token output is bounded to prevent runaway costs. And the model selection is restricted to an allowlist — users can only request models you've explicitly approved. This isn't just type safety. It's cost safety. A request for an unrestricted model with unlimited tokens could generate a single response costing hundreds of dollars.
Layer three is the credit check. Even authenticated users with perfectly valid input can't make calls without sufficient credits in their account. Before any AI provider is contacted, the proxy checks the user's credit balance against the estimated cost of the request. If the balance is too low, the response comes back as HTTP 402 — Payment Required — with the required amount, current balance, and a link to the credits page. 402 is the most underused HTTP status code in existence, and it's perfect for this exact situation.
Layer four is rate limiting. Even users with ample credits can't exceed twenty requests per minute. This catches three scenarios: a compromised account being used for bulk extraction, a buggy frontend retrying in a tight loop, and a legitimate user trying to game the system by scripting against the API. The rate limit headers follow the standard convention, so any well-behaved client can display remaining quota to the user.
Layer five is error sanitization. When an AI provider rejects a request — content policy violation, model overload, internal error — the raw provider response is caught, sanitized, and wrapped in a normalized error format. The user sees "Your request could not be processed. Please revise your prompt." They never see the provider's internal error type, parameter names, or infrastructure details. Provider errors contain information about your backend. Never expose them.
Each layer is independent. Each catches a distinct failure mode. Together, they form a defense-in-depth architecture where no single layer's failure compromises the system.
III – The Provider Router: One Interface, Multiple Brains
The provider router is the abstraction layer that decides which LLM to call and normalizes the response format regardless of provider.
The concept is straightforward. A model-to-provider mapping determines routing. Models starting with "gemini" route to Google's API. Models starting with "gpt" route to OpenAI. The router accepts a unified request format — messages, max tokens, temperature, optional system prompt — and returns a unified response format — content, model used, provider name, and token usage.
The abstraction pays off in three ways that matter in production.
First, switching providers is a configuration change, not a code change. When Gemini's pricing shifted, I moved default traffic to a different model by updating the default parameter. No frontend deployment. No API contract change. No user-visible difference.
Second, the frontend doesn't care. The response format is identical regardless of which provider handled the request. React components consume content, usage stats, and remaining credits. Nothing provider-specific leaks through the API boundary. If I added a third provider tomorrow — Anthropic, Mistral, a local model — the frontend would never know.
Third, fallback becomes trivial. If Gemini returns an empty response or times out (which happens more often than you'd expect from a major provider), the router can retry with OpenAI. If OpenAI is slow, prefer Gemini. This resilience layer is invisible to the user. They send a prompt and get a completion. The routing decisions are entirely server-side.
The provider-specific functions handle the differences in request and response format. Gemini uses "model" and "user" roles while OpenAI uses "assistant" and "user." Gemini's system prompt goes in a separate field while OpenAI prepends it as a system message. Gemini returns token counts in a different structure than OpenAI. All of this variation is absorbed by the router so that neither the handler above nor the frontend beyond ever deals with provider-specific logic.
For image generation, the routing is simpler — currently only OpenAI's image API is used — but the interface is the same. When Gemini or another provider offers competitive image generation, adding it to the router is a single function and a routing rule. No API changes required.
IV – Credit Deduction: The Most Important Design Decision
Credits are Vibe's core monetization mechanism. Every AI call costs credits. Credits cost money. This is how a micro-SaaS sustains itself without requiring users to bring their own API keys.
The most important design decision in the entire proxy is this: deduct credits AFTER the API call succeeds, not before.
Here's why. If you deduct before the call and the LLM provider fails — timeout, content policy rejection, internal server error — the user loses credits for nothing. They'll submit a support ticket. They'll feel cheated. They'll lose trust. And you'll spend time issuing manual refunds.
Instead, the proxy performs a pre-check that verifies sufficient balance before making the API call. This pre-check is an estimate based on the maximum possible cost — the model's per-token rate multiplied by the requested maximum tokens. It's intentionally conservative.
The actual deduction happens only after a successful response arrives. And the deduction uses the real token count from the provider's response, not the estimated maximum. This means users are sometimes pleasantly surprised — the estimated cost was 5 credits but the actual usage was 3. That's the right direction to err in.
The database operation for deduction uses an atomic SQL expression that prevents negative balances at the database level. Even if a race condition allowed two concurrent requests to pass the pre-check simultaneously, the deduction operation itself guarantees the balance never goes below zero. Defense-in-depth means anticipating race conditions even when you think your application logic prevents them.
Different models have different per-token costs. Flash models cost roughly one credit per thousand tokens. Pro models cost five to ten. Image generation uses flat pricing — five credits for standard quality, ten for HD. These costs are transparent to users and displayed before they make a request.
The transaction log records every deduction with metadata: which model was used, how many tokens were consumed, which provider handled it. This serves three purposes — user-facing transaction history, internal cost tracking, and dispute resolution if a user questions a charge.
V – Rate Limiting: Protecting Your Budget and Your Users' Accounts
Rate limiting on AI endpoints serves dual purposes that are often conflated. It protects your API budget from runaway costs, and it protects your users' accounts from compromise.
A compromised account without rate limiting is an open wallet for an attacker. Credits drain in seconds. With a rate limit of twenty requests per minute, a compromised account burns credits slowly enough for the legitimate user to notice and respond.
The implementation is an in-memory sliding window counter keyed on user ID. For authenticated requests, the key is the user's database ID. For unauthenticated requests that somehow reach the rate limiter (they shouldn't, given the auth layer, but defense-in-depth), the key falls back to the client IP.
For a single-server deployment on a Hetzner VPS, in-memory rate limiting is perfectly adequate. The map is cleaned up every five minutes to prevent memory growth. If I ever scaled horizontally — multiple server instances behind a load balancer — I'd swap the in-memory store for Redis. But on a single server behind Traefik, adding Redis for rate limiting would be solving a problem I don't have. Resist the urge to over-engineer for scale you haven't reached.
The rate limit response includes standard headers — limit, remaining, and reset timestamp — so the frontend can display remaining quota. The 429 response includes a Retry-After value in seconds, allowing well-behaved clients to back off automatically.
This rate limiter caught a real incident in production. A user's frontend had a bug in its error handling that retried failed AI requests in a tight loop. Without the rate limit, it would have burned through their entire credit balance in seconds. The rate limiter caught the loop, returned 429 responses, the user saw the "too many requests" error, and they fixed their code. The system worked exactly as designed, protecting the user from their own bug.
VI – The Frontend Knows Nothing
From the React side, calling the AI proxy is indistinguishable from calling any other API endpoint. A mutation hook posts a messages array to the chat endpoint, receives content and metadata back, and updates the credit balance in the query cache.
The frontend knows nothing about Gemini. Nothing about OpenAI. Nothing about API keys. Nothing about token pricing. It knows three things: send messages, display responses, and handle three specific error states.
HTTP 402 means insufficient credits — show the user their balance, how many credits they need, and a link to purchase more. HTTP 429 means rate limited — show a "please wait" message with the retry countdown. HTTP 502 means the AI provider failed — show a generic error with a retry button if the error is marked as retryable.
The mutation's success callback updates the credit balance in React Query's cache using the remaining credits value from the response. This means the credit display in the navigation bar updates instantly after every AI call — no separate API request, no polling, no delay. This kind of UX polish comes naturally when you design the proxy response format with the frontend in mind from the start.
Custom error classes on the frontend — InsufficientCreditsError, RateLimitError, AIError — enable typed error handling in the UI layer. The component rendering the chat interface can catch an InsufficientCreditsError specifically and render an upgrade prompt, while catching an AIError and rendering a retry option. Structured errors on the backend enable structured error handling on the frontend. The investment pays forward.
VII – Monitoring: You Can't Manage What You Don't Measure
Running an AI proxy without monitoring is like driving without a dashboard. You need to know costs, latency, error rates, and usage patterns — not just for debugging, but for business decisions.
Every AI request generates a metric record: timestamp, user ID, provider, model, request type, latency in milliseconds, input and output token counts, credits charged, success or failure, and error type if applicable.
The buffered write pattern is important for performance. Individual database inserts for every AI request would hammer the database needlessly. Instead, metrics accumulate in an in-memory buffer and flush to the database every thirty seconds or when the buffer reaches one hundred entries, whichever comes first.
On a single-server deployment, this is reliable enough. Metrics survive normal operation and only risk loss on a hard process crash — which is an acceptable trade-off for the performance gain. If the flush fails, the metrics go back in the buffer for the next attempt. Eventual consistency for metrics is fine. Eventual consistency for credit deductions is not. Know which data requires which guarantee.
The metrics enable query-based dashboards: daily cost breakdown by provider and model, average latency trends, error rate per provider, per-user consumption patterns. When Gemini's error rate spiked one afternoon — empty responses that shouldn't have been empty — the metrics showed it within the next flush cycle. I shifted default routing to OpenAI within minutes.
VIII – Lessons from Four Months in Production
After running this AI proxy through four months of real user traffic, these are the lessons that survived contact with reality.
Estimate conservatively, deduct accurately. The pre-check uses maximum possible cost. The deduction uses actual cost. Users are never overcharged, and the pre-check occasionally blocks requests that would have been cheaper than estimated. That's the right direction to err — slightly too cautious beats slightly too expensive every time.
Gemini is cheaper, OpenAI is more reliable. Gemini Flash is significantly cheaper per token, but it occasionally returns empty responses or times out under load. OpenAI is more expensive but rarely fails. The router currently defaults to Gemini for cost efficiency with automatic fallback to OpenAI on failure. This isn't a permanent architectural decision — it's a knob I adjust based on the metrics.
Image generation is where the money goes. Chat completions are cheap — usually one to three credits. Image generation costs five to ten. About 70% of total credit spend across all users is on images. This influenced the pricing tiers and the decision to show estimated cost before the user clicks Generate.
Keep the abstraction thin. The provider router started as a simple conditional and it's still a simple conditional. I resisted the urge to add retry logic, circuit breakers, request queuing, and load balancing across providers. On a single VPS, that complexity would be solving problems I don't have. The best code is the code you don't write until you need it.
HTTP 402 is your friend. Returning a structured 402 response with the required amount, current balance, and upgrade URL means the frontend can render a rich "insufficient credits" experience without any special-case logic. The status code carries semantic meaning that generic error handling can't match.
Build Your Own in an Afternoon
If you're building a product with AI features, here's the minimum viable proxy. Five steps.
One endpoint that accepts messages and returns completions. One auth check to verify the user is logged in. One credit check to verify they can afford the request. One provider call to get the completion. One deduction to subtract credits after success.
That's an afternoon of work. Everything else — multiple providers, rate limiting, metrics, fallback routing, buffered writes — is iteration. Each layer addresses a specific production concern, and you add them as those concerns become real.
The core principle never changes: your API keys live on the server, your users talk to your proxy, and every request has a measurable cost. Start there. Scale from there.
Want to Get Your AI Architecture Right on the First Try?
I've helped developers design AI proxy architectures that handle everything from simple chat wrappers to multi-provider systems with credit billing and usage analytics.
Whether you're integrating Gemini, OpenAI, Anthropic, or running local models, the architectural patterns are the same. Server-side keys. Per-user cost tracking. Provider abstraction. Defense in depth.
Book a session at mentoring.oakoliver.com and let's design your AI integration together — from proxy architecture to credit system to frontend error handling.
IX – The Key You Never Expose Is the Key That Never Leaks
Security isn't a feature you add at the end. It's an architectural constraint you design around from the beginning.
Every decision in this proxy — the five security layers, the provider abstraction, the credit system, the error sanitization — exists because the alternative is trusting that nothing will go wrong. And in production, things always go wrong.
The developer who lost fourteen thousand dollars to a scraped API key didn't make a careless mistake. They made a reasonable one — they put a key in a file that seemed safe, and they were wrong. The architecture that would have saved them is the same one described in this article: keys on the server, behind authentication, behind credit limits, behind rate limits, behind error boundaries.
Your users don't need to know where the intelligence comes from. They need to know that when they ask for a completion, they get one — quickly, reliably, and at a fair price.
The proxy makes that promise possible. And the five layers make sure you can keep it.
What would it cost your product — in money, in trust, in reputation — if your AI keys leaked tomorrow?
– Antonio