BACK TO ENGINEERING
Data 13 min read

I Mined 47,000 Businesses from OpenStreetMap and Scored the Ones That Need a Website: Building an Automated Lead Gen Pipeline

Article Hero

There are millions of businesses listed on OpenStreetMap across Europe. Most of them have a name, an address, and a phone number. About 60% have a website listed. And roughly a third of those websites are either broken, look like they were designed in 2005, or redirect to a dead Facebook page.

That's not a guess. That's data from a pipeline I built.

The pipeline is called osm-vibe-all. It parses OpenStreetMap's raw geographic data files — 13 GB covering four European countries — extracts businesses with web presence metadata, crawls their listed websites with Puppeteer, scores the quality of what it finds, and outputs a prioritized list of businesses that most need professional web development.

This is lead generation for a web consultancy built entirely on open-source data and open-source tools. No API subscriptions. No purchased lists. No cold email vendors. Just geography, code, and patience.


I – Why OpenStreetMap Is the Best Lead Source Nobody Uses

OpenStreetMap is the Wikipedia of maps. It's a community-maintained dataset of geographic information covering the entire planet. Unlike Google Maps, the data is free to download, free to process, and free to use commercially under the Open Database License.

Most developers know OSM as the thing behind Mapbox tiles and open-source navigation apps. Almost nobody thinks of it as a business intelligence database. But that's exactly what it is.

Every restaurant, dentist office, car repair shop, and law firm that a community member has mapped includes structured metadata — name, address, phone, website, email, opening hours, business category. The coverage is remarkably dense in European countries where the OSM community is active. Germany, France, the Netherlands, and Portugal — the four countries in my pipeline — have millions of mapped business entities.

The key insight is that this data includes website URLs. Not just existence-or-absence information, but the actual URLs that businesses have listed. This means you can download a country's geographic data, extract every business with a web presence, and then evaluate that web presence programmatically.

Businesses without websites are leads. Businesses with bad websites are better leads. And the data to identify both is sitting on a public server, updated daily, free to download.


II – Parsing 800 Million Nodes Without Running Out of Memory

The raw OSM data comes in PBF format — Protocolbuffer Binary Format. These are compressed binary files containing three primitive types: nodes (points with coordinates and tags), ways (ordered sequences of nodes forming lines or polygons), and relations (groups of nodes and ways with semantic meaning).

For lead generation, the relevant entities are nodes and ways tagged with business categories — amenity equals restaurant, shop equals clothes, office equals insurance, healthcare equals dentist, and dozens more.

The challenge is scale. A single country extract can contain hundreds of millions of nodes. The four-country dataset I process has over 800 million. You cannot load this into memory. You cannot even load it into a database without first filtering it down.

The solution is async generators. The PBF parser reads the binary file as a stream, yielding chunks of decoded entities. An async generator function receives those chunks, examines each entity's tags, and yields only the ones that represent named businesses. The generator pattern means memory usage is constant regardless of file size. You process one entity at a time, yielding matches and discarding everything else.

The filtering is aggressive and intentional. An entity must have a business-indicating tag — amenity, shop, office, craft, tourism, leisure, or healthcare. It must not be infrastructure masquerading as a business — benches, waste baskets, parking lots, bicycle racks, post boxes, fire hydrants. And critically, it must have a name. An unnamed entity is useless for outreach.

From 800 million nodes across four countries, this filter typically produces 45,000 to 60,000 named business entities per pipeline run. That's a compression ratio of roughly 15,000 to 1. The generator pattern makes this practical even on a modest machine with 4 GB of RAM.

The parser also extracts every piece of contact metadata available: the website URL (checking three common tag variations), phone number, email, and structured address components. The more contact data a business has, the more valuable it is as a lead — because reachability directly affects conversion.


III – Categorization and Priority Scoring: Not All Businesses Are Equal

Before any website gets crawled, every extracted business receives a category and an initial priority score. This is where the pipeline becomes opinionated, and intentionally so.

A dentist's office is a more valuable web development client than a corner café. Not because cafés don't need websites — they do — but because a dentist's patient lifetime value is higher, their budget for professional services is larger, and their competitive landscape rewards good web presence more heavily. A lawyer, an accountant, a real estate agency — these are businesses where the website often is the first impression, and a bad first impression costs real revenue.

The categorization maps OSM tag values to business categories and assigns a base priority from one to ten. Healthcare professionals score nine. Professional services — lawyers, accountants, insurance, real estate — score nine. Automotive businesses score eight. Restaurants and cafés score seven or eight. Generic retail scores lower.

The priority score is then adjusted by contact metadata. A business without a website gets a bonus — they're a primary target for "you need a website" outreach. A business with a phone number gets a boost — they're reachable via a call if email fails. A business with a confirmed street address gets a boost — they're a verified physical location, not a digital-only listing that might be stale.

The result is a priority score from 0 to 100 for every business, computed before any website is crawled. This pre-crawl scoring serves two purposes: it determines which businesses are worth the crawling investment (no point crawling a bench's non-existent website), and it provides a business-value component for the final lead score.


IV – Polite Crawling: Five Pages at a Time, Two Seconds Between Batches

For every business that has a listed website, the pipeline launches Puppeteer and evaluates the site's quality. This is where data engineering meets web scraping, and where ethics matter.

The crawling is deliberately polite. Five concurrent browser pages maximum. A two-second delay between batches. A custom user agent that identifies the bot by name and provides a URL where site owners can learn about it. A fifteen-second timeout per page — generous enough for slow servers, strict enough to prevent the pipeline from hanging on unresponsive hosts.

Each crawl produces a structured result with multiple quality signals.

HTTP status. Did the page load at all? A 404 or 500 response tells you the website is actively broken. A DNS resolution failure tells you the domain is dead. Either way, the business thinks they have a web presence and they're wrong — which makes them an excellent lead.

SSL status. Does the site load over HTTPS? In 2026, browsers mark HTTP-only sites as "Not Secure." That warning is devastating for trust, especially for healthcare providers and financial services. Fixing SSL is a quick win for any developer, and pointing it out in outreach demonstrates immediate, tangible value.

Mobile responsiveness. The crawler checks for the viewport meta tag and then actually resizes the browser to mobile dimensions to see if the layout adapts. Over 60% of local business searches happen on mobile. A non-responsive site is actively losing customers every day.

Technology detection. The crawler inspects the DOM for signatures of common frameworks and platforms — React, Vue, Next.js, WordPress, Squarespace, Wix, jQuery, Bootstrap. A business running a well-maintained WordPress site with current plugins doesn't need your help. A business running a jQuery site from 2014 with broken images probably does.

Content freshness. The crawler scans for copyright year in the page footer. A copyright notice reading 2019 or 2020 is a strong signal that the site hasn't been touched in years. This single data point correlates more strongly with "needs a new website" than almost any other signal.

Broken assets. The crawler counts images that fail to load, checks for horizontal scrollbar (a sign of layout breakage), and looks for deprecated technologies like Flash or plugin embeds.

Every crawl result is structured data. No human judgment is needed until the final lead list is reviewed. The pipeline quantifies web presence quality the way a code linter quantifies code quality — by checking observable, measurable properties.


V – The Three-Factor Lead Score: Balancing Need, Value, and Reachability

After parsing and crawling, every business receives a final lead score that balances three factors. This scoring formula is the heart of the entire pipeline.

Factor one: website need (50% weight). This is inversely proportional to website quality. A business with no website scores maximum need. A business with a dead domain scores nearly as high. A business with a non-responsive, non-SSL site with broken images scores high. A business with a polished, modern, mobile-responsive site scores low — they don't need you. The lead score rewards businesses with the worst web presence, because those are the businesses with the most to gain from your services.

Factor two: business value (30% weight). This comes from the pre-crawl priority scoring based on business category and metadata completeness. Dentists, lawyers, and real estate agents carry higher business value than generic retail, because the revenue potential of a web development engagement is higher.

Factor three: reachability (20% weight). Having an email address contributes the most to reachability. A phone number is the next best signal. A confirmed physical address adds a smaller boost. A listed website URL provides a minimal boost (you can at least find a contact form). A perfect lead with no way to reach them isn't a lead — it's data.

The formula produces a score from 0 to 100. Leads scoring 60 or above are exported. In a typical pipeline run, this produces 3,000 to 8,000 actionable leads from the initial 45,000+ businesses. The pipeline reduces noise by roughly 85%, delivering a focused list where every entry has a quantifiable reason for being there.

Each lead comes with human-readable reasons for its score: "No website detected," "Not mobile responsive," "Copyright year is 2019," "Missing SSL certificate," "Email available for outreach." These reasons aren't just diagnostic — they're the opening lines of a personalized outreach message.


VI – The Orchestrator: Four Stages, Under an Hour

The pipeline orchestrator ties the four stages — parsing, crawling, scoring, exporting — into a single executable run.

Stage one: PBF parsing. The orchestrator streams the PBF file through the parser and categorizer, collecting all businesses that meet the minimum priority threshold. For a single country extract of about 3 GB, this takes roughly three minutes. For the full four-country dataset at 13 GB, it takes about twelve.

Stage two: website crawling. The orchestrator filters for businesses with listed websites, validates the URLs, and passes them to the crawler in polite batches. This is the longest stage by far — crawling 5,000 sites at five concurrent pages with two-second delays takes approximately 45 minutes. The orchestrator logs progress every 50 sites.

Stage three: scoring. Every business — with or without a crawl result — receives its three-factor lead score. Businesses without websites get maximum website-need scores automatically. Businesses with crawl results get scored based on the quality signals. This stage is near-instant since it's pure computation in memory.

Stage four: export. Top leads are exported to CSV for immediate use and to JSON for archival. The orchestrator also upserts leads into a database for the outreach tool, using an upsert strategy so that re-running the pipeline updates existing lead scores rather than creating duplicates. This means you can re-run the pipeline monthly and track which businesses have improved their web presence — and which haven't.

The pipeline prints a summary at the end: total businesses parsed, websites crawled, dead websites found, non-responsive sites, missing SSL, and the top ten leads with their scores and reasons. A typical four-country run completes in under an hour.


VII – What the Data Actually Shows

After running the pipeline across four European countries, consistent patterns emerge.

About 31% of listed websites are broken. They return 404, 500, or the domain doesn't resolve at all. These are businesses that believe they have a web presence but actually don't. They're still printing the URL on business cards. They're still listing it on Google Business Profile. And every person who clicks that link bounces to an error page. These are the highest-confidence leads in the dataset because the problem is objectively verifiable.

44% of working websites are not mobile responsive. They load, they have content, but they render as tiny desktop layouts on a phone screen. In a world where more than half of local searches happen on mobile devices, this is a critical failure that the business owner likely doesn't know about — because they only check their site on a desktop.

22% lack SSL certificates. Chrome, Firefox, and Safari all mark these sites with a "Not Secure" warning. For a dentist office or a law firm, that warning is devastating. It tells potential patients and clients that this business can't be trusted with their information. Fixing SSL is often a one-afternoon project that delivers immediate, visible improvement — making it the perfect foot-in-the-door service.

The strongest leads are professional services without any web presence. Dentists, lawyers, accountants, and real estate agents who appear in OSM with phone numbers and addresses but no website. These businesses have high customer lifetime value, established revenue, and the budget for professional web development. They just haven't prioritized it. A single well-crafted outreach email explaining what their competitors' websites look like can change that.


Want to Build Data Pipelines That Generate Revenue?

Whether you're building lead generation systems, processing geographic data, or automating web quality analysis, the architectural patterns are transferable. Streaming parsers for large datasets. Polite crawlers that respect rate limits. Multi-factor scoring systems that balance competing signals.

Book a session at mentoring.oakoliver.com and let's design your data pipeline together — from data source to actionable output.

Or explore the kind of AI-powered micro-apps this lead data feeds into at vibe.oakoliver.com.


VIII – Ethical Data Collection: Why This Isn't Scraping

I want to be explicit about the ethics of this pipeline, because "scraping businesses from a map database" can sound predatory if you don't understand the data source.

The data is open and explicitly licensed for commercial use. OpenStreetMap data is published under the Open Database License, which permits downloading, processing, and using the data for any purpose — including commercial lead generation — as long as you attribute the source and share any improvements to the data itself (not your derivative works). The pipeline reads the data. It doesn't modify or republish it.

The crawling is polite and transparent. The bot identifies itself with a descriptive user agent string and a URL where site owners can learn about the crawl. Five concurrent connections with delays between batches means the pipeline never overwhelms any individual server. The fifteen-second timeout means the pipeline gives up gracefully on slow hosts rather than retrying aggressively.

The outreach is honest and specific. Each lead comes with concrete, verifiable reasons for being contacted: "Your website doesn't load," "Your site isn't mobile responsive," "Your SSL certificate is missing." The outreach references their specific situation, not a generic pitch. This is the opposite of spam — it's personalized, relevant, and backed by data they can verify themselves.

This pipeline replaces purchased lead lists. Instead of buying scraped data of dubious origin from a vendor who may have obtained it unethically, you generate leads from the world's largest open geographic database. The provenance is clear. The license is explicit. The methodology is transparent.


IX – What I'd Build Next

The current pipeline is batch-oriented — run it, get a CSV, do outreach. It works. But there's natural evolution.

Scheduled re-crawls would track which businesses have updated their websites since the last run and which haven't. A business that fixed their SSL since last month drops in priority. A business whose site went from working to dead rises. Longitudinal data makes the scoring more accurate over time.

Automated outreach personalization would generate email copy referencing each lead's specific issues. "Hi Dr. Mueller, I noticed your website at mueller-dental.de returns a 404 error and isn't appearing in mobile search results." The pipeline already produces the data. Generating the message is a natural extension.

Geographic clustering would identify neighborhoods or postal codes with high concentrations of poor web presence. If twelve businesses on the same street all have broken websites, that's a community-level opportunity — potentially enough work to justify a local marketing push.

Competitor benchmarking would, for each lead, find similar businesses in the same area with good websites and include them as examples. "Three other dental practices within 5 km have modern, mobile-responsive websites with online booking. Here's what they look like."

But the current pipeline already produces more qualified leads than I can follow up on. The bottleneck isn't data. It's time. Which is, ultimately, the sign of a pipeline that works.


X – The Map Is Not the Territory, But It's Close Enough

OpenStreetMap is the most underutilized business intelligence resource available to developers. Millions of businesses, structured metadata, geographic coordinates, contact information — all free, all open, all waiting to be processed by anyone with the skills to build a parser and a crawler.

The pipeline described in this article is conceptually simple. Parse structured data. Filter for relevance. Crawl for quality signals. Score on multiple factors. Export the results. Each stage is straightforward. The value isn't in any single stage — it's in the composition of all four into a repeatable, automated workflow.

The businesses on that map didn't list themselves. Community members mapped them, tagged them, and maintained their data over years. The pipeline honors that effort by using the data for something productive — connecting businesses that need better web presence with professionals who can provide it.

Eight hundred million nodes. Forty-seven thousand businesses. Three thousand leads. One pipeline.

What would you build if you had access to the geographic and business metadata of an entire continent?

– Antonio

"Simplicity is the ultimate sophistication."