BDD Is Not Dead: How Gherkin Scenarios Turned My Micro-SaaS Test Suite Into a Living Contract

Article Hero

Here's a question that reveals something about how you build software.

If your test suite vanished tomorrow — every file, every assertion, every carefully crafted mock — could a new developer join your project and figure out what the product actually does?

Not what the code does. What the product does. Which user journeys matter. What happens when a payment goes through. What happens when it doesn't. What "logged in" really means in your system.

If the answer is no, you have tests that verify code but don't describe behavior. And there's a gap between those two things that will cost you eventually.

I – Why a Solo Dev Writes Gherkin Scenarios

I know the pushback before you even type it. "BDD is for teams. You're one person. Who are you writing human-readable scenarios for?"

I'm writing them for the version of me that exists three months from now. The one who needs to add a new payment tier and can't remember how credit purchases interact with Stripe redirects. The one who needs to refactor the authentication flow and has forgotten which edge cases are covered.

The feature file is the answer. Not a comment buried in code. Not a Notion page that's six weeks out of date. An executable specification that passes or fails every time CI runs.

For Vibe — the micro-SaaS at vibe.oakoliver.com — the feature files cover every user journey that involves money, authentication, or AI credit consumption. These are the flows where a silent regression doesn't just break the UI. It breaks trust.

The choice was deliberate. BDD isn't something I apply to every component. It's the layer I reserve for the promises that matter most.

II – The Architecture: Three Layers That Talk to Each Other

The BDD setup for Vibe rests on three interlocking layers, each with a distinct responsibility.

Gherkin feature files are the top layer. They live in a features directory organized by domain — auth, vibes, credits, AI. Each file reads like a product specification. A non-developer can open one and understand exactly what the system is supposed to do. The language is user-centric: "Given I am logged in," "When I select the Pro package," "Then my credit balance should show 200."

Step definitions are the translation layer. Each Gherkin step maps to a function that drives Playwright — clicking buttons, filling forms, navigating pages, asserting on visible text. The step definitions are where human language becomes browser automation. And critically, they're reusable. Once you define "When I click {string}," that step works in every scenario across every feature file.

The World object is the glue. It's Cucumber's mechanism for sharing state across steps within a single scenario. In Vibe's case, the World holds the Playwright browser instance, the current user's session data, any vibe IDs created during the scenario, and payment session references. Each scenario gets a fresh World — meaning each scenario gets a fresh browser context with zero leaked state.

This three-layer architecture means the Gherkin stays clean and business-focused, the step definitions handle the mechanical translation, and the World manages lifecycle and shared context. Change the UI, and you update step definitions. Change the product behavior, and you update the feature file. Nothing touches more than one layer at a time.

III – The World Object: Why It's the Most Important Architectural Decision

The World object is the single most underestimated concept in Cucumber.

Most tutorials treat it as a minor implementation detail. In practice, it's the architectural decision that determines whether your BDD suite scales or collapses under its own weight.

Here's what lives in Vibe's custom World class. The Playwright browser, browser context, and page — managed as instance properties so that every step in a scenario operates on the same browser session. The current user's email, auth token, and credit balance — populated by authentication steps and consumed by assertion steps. A slot for the most recently created vibe ID, because scenarios often create something and then verify it. A reference to the last payment session, because payment flows span multiple pages and redirects.

The constructor takes world parameters — primarily the base URL, which defaults to localhost for local runs and can be overridden for staging environments. Two lifecycle methods handle browser creation and teardown.

Why does this matter architecturally? Because Cucumber creates a new World instance for every scenario. That means every scenario starts with a fresh browser, a clean user state, and no residual data from previous tests. You get isolation for free — not through careful cleanup, but through construction.

The authentication method is worth special attention. Instead of walking through the UI login flow for every scenario, it calls a test-only API endpoint that creates a session directly and injects the auth cookie into the browser context. This cuts thirty seconds of UI interaction per scenario down to a single HTTP call. Multiply that by dozens of scenarios and the time savings are enormous.

This is a pattern worth remembering: authenticate through the API, test through the UI. Don't waste browser time on setup that isn't the thing being tested.

IV – Tag-Based Hooks: The Feature That Changed Everything

Cucumber supports tags — annotations you can place on scenarios to categorize, filter, and trigger conditional behavior. This sounds like a minor organizational feature. In practice, it's a game-changer for test setup.

Consider how many of Vibe's scenarios require an authenticated user. Almost all of them. And a significant subset also require the user to have a credit balance. Without tags, every scenario would need to repeat "Given I am logged in" and "Given I have 100 credits" as boilerplate steps.

With tag-based hooks, you annotate the scenario with @authenticated or @with-credits, and the hooks handle the setup transparently. The Gherkin stays focused on the behavior being tested, not the preconditions.

The @authenticated hook calls the World's authentication method, which creates a session via the test API and injects the cookie. The scenario's Given/When/Then steps can assume the user is already logged in.

The @with-credits hook first ensures the user is authenticated (calling the auth method if the @authenticated hook hasn't already run), then seeds 100 credits via another test API endpoint. The scenario can jump straight to the behavior that consumes credits.

You can stack tags. A scenario tagged @authenticated @with-credits gets both hooks. A scenario tagged @smoke can be filtered for quick CI runs. A scenario tagged @wip gets skipped in production but runs locally.

This compositional approach to test setup is what makes BDD suites maintainable at scale. Without it, you end up duplicating setup logic across hundreds of scenarios, and a single change to the auth flow requires updating every file. With tags, you change one hook and every tagged scenario inherits the fix.

V – Step Definitions: The Translation Layer That Compounds

Step definitions are where Gherkin meets Playwright. Each step maps a natural-language pattern to a browser automation function. And the compounding effect of reusable steps is what makes BDD worth the initial investment.

Take the most universal step: "When I click {string}." That single step definition uses Playwright's role-based locator to find a button with the given text and clicks it. It works for "Purchase," "Generate Image," "Sign Out," and every other button in the application. One definition, unlimited scenarios.

"Then I should see {string}" uses Playwright's text-content assertion to verify visible text on the page. "When I navigate to {string}" maps a human-readable page name to a URL path. "Then I should be redirected to {url-pattern}" waits for the URL to match a pattern.

After the initial investment of building 15-20 core step definitions, writing a new scenario becomes a matter of composing existing steps. In Vibe's suite, about 65% of steps in any given scenario are reused from other feature files. New feature files write themselves because the vocabulary already exists.

The domain-specific steps are where the interesting work happens. "Given the following credit packages exist" takes a Gherkin data table — rows of package name, credit amount, and price — and seeds them through the test API. "When the payment is approved" waits for the browser to reach the payment page, clicks an "Approve Payment" button on a mock checkout, and waits for the redirect back to the app. "Then the credit transaction should show {string}" locates the transaction list component and asserts on its contents.

Each domain-specific step encodes one piece of product knowledge. When the payment flow changes, you update one step definition. When the credit display moves to a different component, you update one locator. The Gherkin scenarios — the living documentation — don't change unless the behavior changes.

VI – The Mock Payment Page: Testing Money Without Spending It

Payment flows are the single most important thing to test in any product that charges users. They're also the hardest to test end-to-end because they involve third-party redirects, webhooks, and real financial consequences.

Vibe solves this with a mock payment page that only exists in the test environment. The key phrase is "only exists." This isn't a feature flag. The mock routes aren't loaded, aren't imported, and don't appear in the application's route table unless the environment variable is set to test mode.

The mock checkout page is deliberately simple. It shows the amount, provides an "Approve Payment" button and a "Decline Payment" button, and redirects to the appropriate success or cancel URL. It simulates the Stripe redirect flow without touching Stripe.

The step definitions for "When the payment is approved" and "When the payment is declined" handle the interaction with this mock page. The browser waits for the checkout URL pattern, clicks the appropriate button, and waits for the redirect back.

This pattern gives you three things. First, deterministic payment tests — no network calls, no webhook timing issues, no Stripe rate limits. Second, zero attack surface in production — the mock endpoints simply don't exist outside the test environment. Third, speed — a mock redirect completes in milliseconds, while a real Stripe checkout adds seconds of network latency.

The conditional loading is critical and deserves emphasis. In the server's main entry point, the test helper routes are dynamically imported only when the environment is test. This is a runtime check, not a build-time one. The production bundle never contains test code. Not as dead code. Not behind a flag. Not at all.

VII – Data Tables and Scenario Outlines: Structured Test Input

Gherkin provides two features for structured test data that eliminate the verbosity of writing individual scenarios for every variation.

Data tables let you pass tabular data into a step. "Given the following vibes exist" followed by a table of creator, title, mood, and visibility flags seeds the test database with multiple records in a single step. The step definition receives the table as an array of objects and iterates through them. This is far cleaner than writing five separate "Given a vibe exists" steps.

Data tables are particularly powerful for setup. You can describe an entire starting state — credit packages, existing users, availability windows — in a compact, readable format at the top of a scenario. Anyone reading the feature file sees the full context before the action begins.

Scenario outlines let you parameterize an entire scenario and run it with different inputs from an Examples table. "Given I have {initial} credits, When I use the {feature} feature, Then my balance should show {remaining}" — with an Examples table listing AI Chat at 1 credit, AI Image at 5, AI Image HD at 10. One scenario template, four test executions.

Scenario outlines are the BDD answer to parameterized tests. They're more readable than looping in code, and the Examples table serves as documentation of the expected costs for each feature. When the pricing changes, you update the Examples table and the test becomes the source of truth for the new pricing.

The combination of data tables for setup and scenario outlines for variations covers most of the structured test input needs you'll encounter. They keep the Gherkin concise without sacrificing coverage.

VIII – The Payment Feature File: BDD at Its Most Valuable

The credit purchase feature file is the crown jewel of Vibe's BDD suite. It demonstrates every concept working together — background setup, tag-based hooks, data tables, step reuse, and domain-specific assertions.

The background section seeds three credit packages using a data table: Starter at 50 credits for 4.99, Pro at 200 for 14.99, and Mega at 1000 for 49.99. Every scenario in the file inherits this setup.

The first scenario verifies that an authenticated user sees all three packages with correct pricing on the credits page. Simple but important — it catches rendering bugs and data-binding issues.

The second scenario walks through a successful purchase: navigate to credits, select the Pro package, click Purchase, get redirected to the mock payment page, approve the payment, get redirected back, see the confirmation message, and verify the credit balance updated to 200. Seven steps that verify the full payment lifecycle end-to-end.

The third scenario tests a failed payment — same flow, but the user clicks "Decline" on the mock checkout, and the test verifies they return to the credits page with a "Payment was not completed" message and zero credits.

The fourth scenario is tagged @with-credits and tests credit deduction: the user has 100 credits, generates an AI image, and the balance should show 95 with a transaction entry showing "AI Image Generation: -5 credits."

The fifth scenario tests the insufficient credits path: the user has 2 credits, tries to generate an image that costs 5, and sees a "Not enough credits" message with a link to the purchase page.

Five scenarios. One feature file. They collectively describe and verify every state in Vibe's payment and credit system. If any of these scenarios fails, you know exactly which promise to users is broken. Not which function threw. Which promise.

IX – When BDD Adds Weight Instead of Value

I won't pretend BDD is the right tool for everything. The initial setup took two full days — building the World, writing the hooks, creating the step definitions, implementing the mock payment page. That's a real investment.

BDD adds overhead that isn't justified for pure utility functions. If you're testing a date parser or a string formatter, write a unit test. Ten lines, no browser, instant feedback.

API-only endpoints with no UI are better served by integration tests. If your endpoint receives JSON and returns JSON, testing it through a browser adds latency and complexity with no benefit. Use the application's handle method directly, as I covered in the previous article on Elysia.js integration testing.

Rapidly prototyping features are a poor fit because the UI changes daily. Writing Gherkin scenarios for something that will be completely redesigned next week is wasted effort. Wait until the flow stabilizes, then document it with BDD.

Performance-sensitive tests don't belong in a BDD suite. Browser startup adds overhead. Playwright interactions add latency. If you're benchmarking response times, use direct HTTP calls.

Where BDD earns its keep is multi-step user journeys involving decisions, state changes, and consequences. Authentication. Payments. Content creation. Anything where the behavior involves multiple screens, multiple actors, or multiple possible outcomes. The more complex the journey, the more BDD's clarity advantage compounds.

For Vibe, the split is roughly 70% BDD for user journeys and 30% integration plus unit tests for backend logic. That ratio reflects a product where the user experience is the value proposition.

Want to Write Your First Feature File This Afternoon?

The minimum viable BDD setup is smaller than you think. One Cucumber configuration file. One World class with browser lifecycle methods. One feature file with three scenarios. A handful of step definitions.

You don't need to cover your entire product on day one. Start with the most critical user journey — the one that, if it broke, would cost you the most. Write the Gherkin first. Then make it pass.

Book a session at mentoring.oakoliver.com and we'll write your first feature file together — from Gherkin scenario to passing green in a single session.

Or explore what Vibe does with BDD-tested AI features at vibe.oakoliver.com.

X – The ROI That Doesn't Show Up in Coverage Reports

The return on BDD investment shows up in five places that don't appear in any coverage metric.

Regression protection that reads like English. When I broke the credit deduction flow during a refactor, the failing scenario told me "Scenario: Credit deduction on AI image generation — FAILED at 'Then my credit balance should show 95.'" Not a stack trace. Not a function name. A product behavior that stopped working. The diagnosis time dropped from minutes to seconds.

Living documentation that never drifts. Feature files are the source of truth for product behavior because they're executed on every CI run. A wiki page goes stale the moment someone forgets to update it. A feature file goes red.

Payment confidence. Payments are the one thing you absolutely cannot break silently. The BDD scenarios for credit purchase, deduction, and insufficient balance have caught three bugs that unit tests missed — all related to redirect timing and state transitions across pages.

Compounding step reuse. After the initial setup, new scenarios write quickly because most steps already exist. "When I click {string}" is universal. "Then I should see {string}" is universal. Each new feature file reuses 60-70% of existing step definitions. The marginal cost of a new scenario approaches the time it takes to write the Gherkin itself.

A forcing function for accessible UI. Knowing that Playwright needs to find elements by role, by label, by test ID pushes you toward semantic HTML and proper accessibility attributes. The BDD layer makes the app more accessible as a side effect. Not out of virtue. Out of practical necessity.

XI – BDD Is a Thinking Technique Disguised as a Testing Technique

The deepest value of BDD isn't the automated browser testing. It's the cognitive discipline of writing behavior before implementation.

When you write the Gherkin scenario first, you're forced to think in user language, not developer language. You're forced to define what "done" looks like before you write a single line of production code. You're forced to articulate the preconditions, the actions, and the expected outcomes in terms that anyone could read.

That's not overhead. That's engineering.

Your micro-SaaS isn't a codebase. It's a set of promises. "If you buy credits, they appear in your balance." "If the payment fails, you aren't charged." "If you generate an image, the right number of credits are deducted."

BDD makes those promises explicit, testable, and trackable. When every promise is a passing scenario, you know exactly where you stand.

And when a scenario fails? You know exactly which promise you broke — and for which user it matters.

When was the last time a failing test told you not just what broke, but who it affects?

– Antonio