haynajjar

Posted on Feb 10

I Built a Free Dataset of 51M+ Companies and Their Tech Stacks (Daily Updates!)

#showdev #opensource #data #api

Ever wondered which companies use Shopify? Or how many businesses have adopted Stripe? What about finding every site running on Next.js?

I've spent the last few months crawling the web to answer these questions, and today I'm open-sourcing the entire dataset.

What I Built

tech-stack-datasets is a free, open-source collection of company data grouped by the technologies they use. The full database contains 51.3 million companies across 403 technologies, with 500 sample records per technology available as open data.

Think of it as getting a solid sample of the technographic landscape without any barriers.

The data covers 403 different technologies across categories like:

E-commerce platforms (Shopify, WooCommerce, Magento)
Payment processors (Stripe, PayPal, Adyen)
CMS systems (WordPress, Webflow, Contentful)
Analytics tools (Google Analytics, Mixpanel, Amplitude)
Cloud providers (AWS, Google Cloud, Azure)
And hundreds more...

Why This Exists

As a developer who's worked on several SaaS products, I kept running into the same problem: finding potential customers is hard.

If you're building a Shopify app, wouldn't it be helpful to know which companies use Shopify? If you're selling a WordPress plugin, where do you even start looking for leads?

Traditional sales tools charge $99+/month for this data. I thought developers deserved better.

Real-World Use Cases

🎯 For Sales & Marketing

# Each dataset contains 500 sample companies
$ cat companies-using-stripe.csv | wc -l
500

That's 500 verified examples to start your research. The full dataset (68,072 companies) is available through the pro version.

📊 For Competitive Analysis

Want to know who your competitors' customers are? Cross-reference tech stacks:

# Companies using both Shopify AND Klaviyo
comm -12 <(sort companies-using-shopify.csv) \
         <(sort companies-using-klaviyo.csv)

🔬 For Market Research

Track adoption trends over time. The data updates daily, so you can monitor:

How fast is Next.js growing vs. traditional frameworks?
Which e-commerce platform is gaining market share?
What's the average tech stack for a successful SaaS company?

🤖 For Data Scientists

Train ML models on real tech adoption patterns:

import pandas as pd

# Load multiple datasets
shopify = pd.read_csv('companies-using-shopify.csv')
stripe = pd.read_csv('companies-using-stripe.csv')

# Analyze correlation between technologies
merged = pd.merge(shopify, stripe, on='domain', how='inner')
print(f"Overlap: {len(merged) / len(shopify) * 100:.1f}%")

What's In The Data?

Each dataset provides 500 sample records from the complete database. Each company record includes:

Company name and domain
Geographic location (country/state)
Technology stack detected
Service type (B2B, B2C, etc.)
Quality scores for data verification
Last verified date

Here's a sample:

company_name,domain,country,technology,last_verified
Acme Corp,acme.com,United States,Shopify,2026-02-08
TechStart,techstart.io,United Kingdom,Next.js,2026-02-08

Why 500 records? It's enough to:

Test your analysis workflows
Understand market patterns
Build proof-of-concepts
Validate your targeting strategy

Need the full dataset? That's what Leadita Pro is for.

How I Built This

The crawler runs on a distributed system that:

Fetches website HTML and JavaScript bundles
Identifies technology fingerprints (CDN URLs, meta tags, script signatures)
Validates findings with multiple detection methods
Stores results in normalized CSV/JSON formats
Re-crawls daily to keep data fresh

Detection accuracy: ~96% based on manual spot-checks against 1,000 random samples.

Browse by Technology

Each technology has 500 sample companies available. Here are some popular ones:

JavaScript Frameworks:

Companies using React (500 samples from 12,411 total)
Companies using Next.js (500 samples from 340,205 total)
Companies using Vue.js (500 samples)

Backend Stacks:

Companies using Laravel (500 samples from 5,482 total)
Companies using Django (500 samples from 597 total)
Companies using Ruby on Rails (500 samples from 8,222 total)

SaaS Tools:

Companies using HubSpot (500 samples from 86,132 total)
Companies using Intercom (500 samples from 21,293 total)
Companies using Segment (500 samples from 8,573 total)

Getting Started

Download Individual Datasets

# Clone the repo
git clone https://github.com/leadita/tech-stack-datasets.git

# Navigate to a specific technology
cd tech-stack-datasets/leads/companies-using-shopify

# The data is available in both CSV and JSON
ls
# shopify-companies.csv
# shopify-companies.json

Quick Analysis with jq

# Count companies by country
cat companies-using-stripe.json | \
  jq -r '.[] | .country' | \
  sort | uniq -c | sort -rn | head -10

Load into PostgreSQL

CREATE TABLE companies (
  name VARCHAR(255),
  domain VARCHAR(255) PRIMARY KEY,
  country VARCHAR(100),
  technology VARCHAR(100),
  verified_date DATE
);

COPY companies FROM '/path/to/companies-using-shopify.csv' 
DELIMITER ',' CSV HEADER;

-- Query insights
SELECT country, COUNT(*) as total
FROM companies
WHERE technology = 'Shopify'
GROUP BY country
ORDER BY total DESC
LIMIT 10;

What's Free vs. What's Pro

Free (Open Source):

✅ 500 sample records per technology
✅ 403 technologies covered
✅ Daily updated data
✅ CSV & JSON formats
✅ No registration required

Pro Version:

🔒 Full datasets (millions of records)
🔒 Verified email addresses & phone numbers
🔒 API access
🔒 Historical data & trends
🔒 Custom filtering & exports

The sample data is genuinely useful for:

Exploratory analysis - Test your hypotheses before committing
Proof of concepts - Validate your product-market fit
Learning & research - Study tech adoption patterns
Portfolio projects - Build data apps with real data

For sales teams needing thousands of leads with contact info, check out Leadita Pro.

Current Stats

In the Open Repo:

200,000+ sample records (500 per tech × 403 technologies)
403 technologies tracked
Daily updates for data freshness
100% open source (MIT license)

Total Database:

51.3M+ companies indexed
Billions of data points
Daily crawls across the web

Roadmap

Some features I'm working on:

[ ] Historical data (track tech adoption over time)
[ ] API access for programmatic queries
[ ] More granular tech detection (framework versions, specific libraries)
[ ] Company employee count estimates
[ ] Funding/revenue data integration

Contributing

Found a bug? Have suggestions? Want to add new technologies to track?

Issues: GitHub Issues
Discussions: Share your use cases and ideas
PRs: Always welcome!

Try It Yourself

Pick a technology you're interested in and explore who's using it:

Browse the full list of technologies
Download the CSV/JSON for any tech
Run your own analysis
Share what you find!

Links:

GitHub Repo: leadita/tech-stack-datasets
Website: leadita.com
Questions? Drop a comment below 👇

If you find this useful, please ⭐ the repo on GitHub. It helps others discover the project!

Building tools that should be free. Follow along for more open data projects.

DEV Community