DEV Community

Cover image for I Built a Free Dataset of 51M+ Companies and Their Tech Stacks (Daily Updates!)
haynajjar
haynajjar

Posted on

I Built a Free Dataset of 51M+ Companies and Their Tech Stacks (Daily Updates!)

Ever wondered which companies use Shopify? Or how many businesses have adopted Stripe? What about finding every site running on Next.js?

I've spent the last few months crawling the web to answer these questions, and today I'm open-sourcing the entire dataset.

What I Built

tech-stack-datasets is a free, open-source collection of company data grouped by the technologies they use. The full database contains 51.3 million companies across 403 technologies, with 500 sample records per technology available as open data.

Think of it as getting a solid sample of the technographic landscape without any barriers.

The data covers 403 different technologies across categories like:

  • E-commerce platforms (Shopify, WooCommerce, Magento)
  • Payment processors (Stripe, PayPal, Adyen)
  • CMS systems (WordPress, Webflow, Contentful)
  • Analytics tools (Google Analytics, Mixpanel, Amplitude)
  • Cloud providers (AWS, Google Cloud, Azure)
  • And hundreds more...

Why This Exists

As a developer who's worked on several SaaS products, I kept running into the same problem: finding potential customers is hard.

If you're building a Shopify app, wouldn't it be helpful to know which companies use Shopify? If you're selling a WordPress plugin, where do you even start looking for leads?

Traditional sales tools charge $99+/month for this data. I thought developers deserved better.

Real-World Use Cases

🎯 For Sales & Marketing

# Each dataset contains 500 sample companies
$ cat companies-using-stripe.csv | wc -l
500
Enter fullscreen mode Exit fullscreen mode

That's 500 verified examples to start your research. The full dataset (68,072 companies) is available through the pro version.

πŸ“Š For Competitive Analysis

Want to know who your competitors' customers are? Cross-reference tech stacks:

# Companies using both Shopify AND Klaviyo
comm -12 <(sort companies-using-shopify.csv) \
         <(sort companies-using-klaviyo.csv)
Enter fullscreen mode Exit fullscreen mode

πŸ”¬ For Market Research

Track adoption trends over time. The data updates daily, so you can monitor:

  • How fast is Next.js growing vs. traditional frameworks?
  • Which e-commerce platform is gaining market share?
  • What's the average tech stack for a successful SaaS company?

πŸ€– For Data Scientists

Train ML models on real tech adoption patterns:

import pandas as pd

# Load multiple datasets
shopify = pd.read_csv('companies-using-shopify.csv')
stripe = pd.read_csv('companies-using-stripe.csv')

# Analyze correlation between technologies
merged = pd.merge(shopify, stripe, on='domain', how='inner')
print(f"Overlap: {len(merged) / len(shopify) * 100:.1f}%")
Enter fullscreen mode Exit fullscreen mode

What's In The Data?

Each dataset provides 500 sample records from the complete database. Each company record includes:

  • Company name and domain
  • Geographic location (country/state)
  • Technology stack detected
  • Service type (B2B, B2C, etc.)
  • Quality scores for data verification
  • Last verified date

Here's a sample:

company_name,domain,country,technology,last_verified
Acme Corp,acme.com,United States,Shopify,2026-02-08
TechStart,techstart.io,United Kingdom,Next.js,2026-02-08
Enter fullscreen mode Exit fullscreen mode

Why 500 records? It's enough to:

  • Test your analysis workflows
  • Understand market patterns
  • Build proof-of-concepts
  • Validate your targeting strategy

Need the full dataset? That's what Leadita Pro is for.

How I Built This

The crawler runs on a distributed system that:

  1. Fetches website HTML and JavaScript bundles
  2. Identifies technology fingerprints (CDN URLs, meta tags, script signatures)
  3. Validates findings with multiple detection methods
  4. Stores results in normalized CSV/JSON formats
  5. Re-crawls daily to keep data fresh

Detection accuracy: ~96% based on manual spot-checks against 1,000 random samples.

Browse by Technology

Each technology has 500 sample companies available. Here are some popular ones:

JavaScript Frameworks:

Backend Stacks:

SaaS Tools:

Getting Started

Download Individual Datasets

# Clone the repo
git clone https://github.com/leadita/tech-stack-datasets.git

# Navigate to a specific technology
cd tech-stack-datasets/leads/companies-using-shopify

# The data is available in both CSV and JSON
ls
# shopify-companies.csv
# shopify-companies.json
Enter fullscreen mode Exit fullscreen mode

Quick Analysis with jq

# Count companies by country
cat companies-using-stripe.json | \
  jq -r '.[] | .country' | \
  sort | uniq -c | sort -rn | head -10
Enter fullscreen mode Exit fullscreen mode

Load into PostgreSQL

CREATE TABLE companies (
  name VARCHAR(255),
  domain VARCHAR(255) PRIMARY KEY,
  country VARCHAR(100),
  technology VARCHAR(100),
  verified_date DATE
);

COPY companies FROM '/path/to/companies-using-shopify.csv' 
DELIMITER ',' CSV HEADER;

-- Query insights
SELECT country, COUNT(*) as total
FROM companies
WHERE technology = 'Shopify'
GROUP BY country
ORDER BY total DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

What's Free vs. What's Pro

Free (Open Source):

  • βœ… 500 sample records per technology
  • βœ… 403 technologies covered
  • βœ… Daily updated data
  • βœ… CSV & JSON formats
  • βœ… No registration required

Pro Version:

  • πŸ”’ Full datasets (millions of records)
  • πŸ”’ Verified email addresses & phone numbers
  • πŸ”’ API access
  • πŸ”’ Historical data & trends
  • πŸ”’ Custom filtering & exports

The sample data is genuinely useful for:

  • Exploratory analysis - Test your hypotheses before committing
  • Proof of concepts - Validate your product-market fit
  • Learning & research - Study tech adoption patterns
  • Portfolio projects - Build data apps with real data

For sales teams needing thousands of leads with contact info, check out Leadita Pro.

Current Stats

In the Open Repo:

  • 200,000+ sample records (500 per tech Γ— 403 technologies)
  • 403 technologies tracked
  • Daily updates for data freshness
  • 100% open source (MIT license)

Total Database:

  • 51.3M+ companies indexed
  • Billions of data points
  • Daily crawls across the web

Roadmap

Some features I'm working on:

  • [ ] Historical data (track tech adoption over time)
  • [ ] API access for programmatic queries
  • [ ] More granular tech detection (framework versions, specific libraries)
  • [ ] Company employee count estimates
  • [ ] Funding/revenue data integration

Contributing

Found a bug? Have suggestions? Want to add new technologies to track?

  • Issues: GitHub Issues
  • Discussions: Share your use cases and ideas
  • PRs: Always welcome!

Try It Yourself

Pick a technology you're interested in and explore who's using it:

  1. Browse the full list of technologies
  2. Download the CSV/JSON for any tech
  3. Run your own analysis
  4. Share what you find!

Links:

If you find this useful, please ⭐ the repo on GitHub. It helps others discover the project!

Building tools that should be free. Follow along for more open data projects.

Top comments (0)