I Processed 1.2 Billion Map Records on a $7/Month Server. The Lead Database Industry Doesn't Want You to Know This.

Article Hero

There are eight million businesses mapped in OpenStreetMap across France, Germany, Spain, and Portugal.

They have names. Addresses. Phone numbers. Websites. Opening hours. Dozens of other tags. All freely available. Updated by a community of two million mappers. Downloadable as a single file.

And almost nobody in the SaaS world is using this data for lead generation.

I did. I built a pipeline called osm-vibe-all that downloads PBF extracts from four European countries, parses 13GB of compressed map data, filters for businesses with web presences, enriches them with France's official SIRENE business registry, crawls their websites to assess technology stacks and quality, scores each lead on a zero-to-hundred scale, and outputs over one million qualified business leads.

The whole thing runs on a single Hetzner VPS with four CPUs and eight gigs of RAM. Total processing time: under six hours. Compute cost: roughly thirty-five euro cents per run.

No Spark. No Hadoop. No Kubernetes. Just Bun, SQLite, and a lot of streaming.

I – Why the Lead Database Industry Is a Racket

Commercial lead databases charge ten thousand to one hundred thousand dollars per year. ZoomInfo. Apollo. Clearbit. They are optimized for enterprise sales — big companies, C-suite contacts, firmographic data.

But the platform I built — vibe.oakoliver.com — targets small and medium businesses. Restaurants. Shops. Agencies. Clinics. Freelancers. The businesses that need affordable web tools but are almost invisible to enterprise lead databases. Many do not even have LinkedIn company pages.

But they do have a physical location mapped in OpenStreetMap. They do have a website, often tagged right in the map data. They do have a phone number. And they do have opening hours, which implies they are actively operating.

OpenStreetMap is the most comprehensive free database of small businesses in Europe. For France specifically, OSM has better coverage of small businesses than Google Maps in many areas, thanks to the active French mapping community and imports from official government datasets.

The data is licensed under the Open Database License, which allows commercial use with attribution. For a pipeline where the output is your own derived scoring — not a redistribution of raw OSM data — this is perfectly fine.

The entire commercial lead database industry charges a premium for data that is largely derived from the same public sources anyone can access. The only barrier is engineering effort.

So I built the engineering.

II – The Five-Stage Pipeline

The pipeline has five stages, each implemented as a separate script that reads from and writes to a shared SQLite database.

Stage one downloads the data. PBF extracts from Geofabrik — one file per country. France is about 4.5 gigabytes. Germany is about 4.2. Spain is 1.1. Portugal is 0.8. Total: roughly 10.6 gigabytes. On a Hetzner VPS with gigabit bandwidth, this takes about fifteen minutes. Downloads run sequentially to respect Geofabrik's rate limits — they are a community service, not a CDN.

Stage two parses the PBF files and filters for businesses. This is the most technically interesting stage.

Stage three enriches French businesses with SIRENE data. France's official business registry maintained by INSEE.

Stage four crawls websites with headless browsers and scores each lead. The most time-consuming stage.

Stage five exports qualified leads. Filtered, scored, ready for the vibe platform.

SQLite is the glue between stages. Each stage reads its input, processes it, and writes results back. You can re-run any individual stage without re-running the entire pipeline. If the crawling stage crashes at sixty percent, you restart it and it picks up where it left off because it checks which businesses have already been crawled.

III – Why SQLite and Not PostgreSQL

This surprises people.

The pipeline runs on the same VPS that hosts vibe.oakoliver.com, which uses per-user SQLite databases. Using SQLite for the pipeline means no additional database server to manage. A single-file database that can be copied, backed up, or inspected trivially. WAL mode handles the single-writer, multi-reader pattern perfectly.

And the killer advantage: Bun's built-in SQLite module uses the C SQLite library directly via FFI. No N-API overhead. No JavaScript-to-C bridge cost per call. Combined with prepared statements and batched transactions, insertion rates are extraordinary.

For a pipeline that processes data in bulk stages — not concurrent API requests — SQLite's single-writer limitation is irrelevant. You are always doing one thing at a time, and doing it fast.

IV – Parsing a Billion Elements Without Running Out of Memory

PBF files are compressed Protocol Buffer streams containing every node, way, and relation in the OpenStreetMap dataset. A node is a point — latitude and longitude with tags. A way is an ordered list of nodes — buildings, roads. A relation is a grouping that represents complex boundaries.

We only care about nodes and ways that represent businesses. Things tagged as amenity, shop, office, tourism, leisure, craft, healthcare, or club. Restaurants. Bakeries. Hotels. Clinics. Law offices.

The France extract alone is 4.5 gigabytes compressed, which expands to roughly sixty gigabytes of raw XML. You cannot load this into memory. You must stream it.

The parser reads PBF blocks sequentially. For each element, it checks for business-related tags. If the element has a recognized category and a name, it gets extracted. Unnamed businesses are skipped — a bakery with no name is not a useful lead.

Extracted businesses are batched into groups of ten thousand and inserted into SQLite inside a single transaction. This batching is critical. SQLite commits per transaction, so reducing transaction count from 7.4 million individual inserts to 740 batched transactions dramatically reduces filesystem sync overhead.

The results are staggering. The pipeline processed 1.22 billion OSM elements and extracted 7.4 million businesses in under two hours.

France: 485 million elements, 2.1 million businesses, 42 minutes
Germany: 520 million elements, 2.8 million businesses, 48 minutes
Spain: 142 million elements, 1.6 million businesses, 14 minutes
Portugal: 78 million elements, 0.9 million businesses, 8 minutes

Average processing rate: nearly 11 million elements per minute. The bottleneck is PBF decompression — it uses zlib internally — not SQLite insertion.

V – The Four SQLite Optimizations That Made It Possible

Without these four settings, the pipeline would have taken ten times longer.

First: WAL journal mode. Write-Ahead Logging allows readers and writers to operate concurrently. This lets you query the database for monitoring and debugging while the import is running.

Second: NORMAL synchronous mode. This trades some crash safety for roughly five times faster writes. Acceptable for a pipeline where you can always re-run from the PBF source files.

Third: a 512-megabyte page cache. The database file reaches about two gigabytes. Keeping twenty-five percent in memory avoids constant disk reads during index updates.

Fourth: batched transactions. Ten thousand inserts per transaction. This single optimization is probably worth more than the other three combined. The difference between 7.4 million transactions and 740 is the difference between hours and minutes.

VI – Enriching With France's Official Business Registry

For French businesses, we can do something special. The SIRENE registry — maintained by INSEE, France's national statistics bureau — is the country's official business database. It contains the SIREN number, legal form, industry classification, creation date, employee count, and sometimes revenue brackets.

SIRENE data is freely available as a seven-gigabyte CSV download.

The enrichment works by fuzzy-matching OSM businesses to SIRENE establishments. Normalize the business name — lowercase, strip diacritics, remove legal form abbreviations like SARL and SAS, collapse whitespace. Then match on normalized name plus postal code.

Match rate: approximately 34%. This sounds low but is expected. Many small businesses in OSM do not have accurate postcode tags, and name normalization cannot handle all variations — especially abbreviations and informal names.

But the 34% that do match become significantly stronger leads. You can verify they are active, registered businesses. You know their legal structure. You know when they were founded. You sometimes know their approximate size.

That kind of enrichment turns a map pin into actionable business intelligence.

VII – Crawling Two Million Websites Without Getting Banned

For every business with a website URL, the pipeline launches a headless browser, navigates to the site, and analyzes what it finds.

Eight concurrent browser instances run in parallel, pulling uncrawled businesses from the SQLite database. Each page gets fifteen seconds to load. A clear, honest user agent identifies the bot and links to a page explaining the research purpose.

The analysis does not store page content. It derives signals.

Does the site use SSL? Is there a viewport meta tag for mobile responsiveness? How fast did it load? What technology stack is it built on — WordPress, Shopify, Wix, Squarespace? Is there a contact form? Are there social media links? When was it last modified?

Each signal contributes to a lead score between zero and one hundred. A reachable HTTPS site with a fast load time, mobile responsiveness, a contact form, social links, and a recognized platform scores high. An unreachable domain scores zero.

Of 2.34 million businesses with websites, about 1.87 million were successfully crawled. The twelve percent timeout rate is mostly dead domains — valid when mapped in OSM but since expired. The eight percent error rate splits between DNS failures, SSL certificate issues, and servers that reject the bot.

The total crawl time was about three and a half hours. Average rate: roughly 150 pages per minute across eight concurrent browsers.

VIII – One Million Qualified Leads From Free Data

The final stage filters the scored database down to businesses with a score of forty or above, a verified website, and a name.

France: 285,000 qualified leads, average score 47.2
Germany: 412,000 qualified leads, average score 49.1
Spain: 198,000 qualified leads, average score 44.8
Portugal: 127,000 qualified leads, average score 43.6
Total: over 1,022,000 qualified business leads

Each one has a verified website, a quality score, geographic coordinates, category classification, and contact information where available. French leads are further enriched with official business registry data.

Over one million qualified leads from freely available data. Generated on a server that costs less than a cup of coffee to run.

If data engineering, scrappy infrastructure, or building products on open data interests you — I have been pushing these patterns to their limits. I offer hands-on technical mentoring at mentoring.oakoliver.com and build production micro-SaaS tooling at vibe.oakoliver.com. Come see what is possible when you combine Bun's performance with creative engineering.

IX – What I Learned Building This

Bun's SQLite is a superpower for data pipelines. Ten million insertions per minute from a single process, with zero database server overhead. You do not need Spark for datasets that fit on a single machine.

PBF parsing is CPU-bound, not I/O-bound. Decompression is the bottleneck. If you need to go faster, look at parallel decompression of PBF blocks, not faster storage.

Crawling ethically matters. A clear user agent. Respect for robots.txt. Per-domain rate limits. No storage of full page content — only derived signals. The distinction between lead research and scraping matters for legal compliance, especially under GDPR, and for keeping your IP off blocklists.

Incremental processing is essential. Every stage is idempotent. Re-running import skips existing businesses. Re-running crawl checks status flags. Re-running export regenerates the output file. A crash at any point costs at most a few minutes of reprocessing.

SQLite WAL mode is non-negotiable. Without it, the database locks during writes and you cannot monitor anything. With it, readers and writers coexist peacefully.

X – The Total Picture

The entire pipeline — download, parse, enrich, crawl, export — processes 13 gigabytes of raw map data into over one million qualified leads in about six hours. On a single four-CPU, eight-gigabyte machine. For roughly thirty-five cents in compute.

The lead database industry charges tens of thousands per year for data largely derived from these same public sources.

The barrier was never access to data. It was always the engineering to process it at scale, enrich it intelligently, and score it for relevance.

That barrier is lower than it has ever been.

What is the largest dataset you have processed with JavaScript — and what would you build if the data was free?

– Antonio