- 7B nodes, 11B edges in Airbnb's identity graph
- 5M new edges ingested per day
- P99 read latency: 5.0s → 2.5s (-49% improvement)
- P95 write latency: 353ms → 156ms (-56% improvement)
- 10× write QPS ceiling vs previous vendor maximum
- Zero manual reboots required post-migration
Airbnb's identity graph connects every user, every device, every listing, and every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. In 2024, this graph held 7 billion nodes and 11 billion edges — growing by 5 million new edges every day. The third-party vendor powering it required periodic manual reboots to stay stable, and 8-hop graph traversal queries were hitting 5-second P99 latencies. A small team rebuilt the entire thing internally. The results were not incremental.
The Story
The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between every user, every device, every verified identity, every behavioral signal that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing.
The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists. This worked at low graph density. As individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update — relational databases (database systems built around tables, rows, and SQL joins — optimal for normalised structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimised for multi-hop traversal at graph scale.
Problem
Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density
The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As graph density grew — individual users accumulating hundreds of edges — querying became expensive. JSON deserialisation and cross-table joins are not optimised for the multi-hop traversal patterns that fraud detection requires.
Cause
Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability
The 2021 migration to a third-party SaaS graph database improved horizontal scalability but introduced new problems: P99 read latency reaching 5 seconds on 8-hop queries, operational instability requiring periodic manual reboots, no ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.
Solution
Generation 3: JanusGraph + DynamoDB, Internally Managed
In 2024, Airbnb built an internal graph infrastructure on JanusGraph (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture let Airbnb leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimisations.
Result
49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots
P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing reached 10× the previous vendor's maximum. Manual reboots eliminated entirely. Auto-scaling enabled for the first time.
The Fix
Three JanusGraph Engine Optimisations That Closed the Latency Gap
Deploying stock JanusGraph with DynamoDB would not have been sufficient. Airbnb's query patterns — particularly high-fanout traversals that caused the worst P99 spikes — required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimisations:
- -49% — P99 read latency: 5.0s → 2.5s, directly improving fraud detection response time
- -56% — P95 write latency: 353ms → 156ms, enabling faster ingestion of 5M daily new edges
- 10× — Write QPS ceiling during load testing vs the vendor maximum
- 0 — Manual reboots required post-migration; the internal solution auto-scales
The choice of Gremlin (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked") as the query language was a deliberate migration enabler. Both the outgoing vendor system and the incoming JanusGraph support Gremlin, which meant Airbnb could run the same queries against both systems simultaneously during migration — direct performance benchmarking under real production load before any cutover.
# Three JanusGraph engine optimisations that reduced long-tail latency
# OPTIMISATION 1: DynamoDB conditional writes replace distributed locking
# Old: explicit distributed lock before write = round-trip overhead
def write_edge_default(tx, src_vertex, dst_vertex, edge_label):
lock = acquire_distributed_lock(src_vertex, edge_label) # expensive
try:
tx.add_edge(src_vertex, dst_vertex, edge_label)
tx.commit()
finally:
release_lock(lock)
# New: DynamoDB evaluates condition atomically server-side — no lock round-trip
def write_edge_optimized(tx, src_vertex, dst_vertex, edge_label):
tx.add_edge_with_condition(
src_vertex, dst_vertex, edge_label,
condition="attribute_not_exists(edge_key)"
)
# OPTIMISATION 2: Parallel getMultiSlices for high-fanout nodes
# Before: N sequential DynamoDB calls for a user with 1000+ edges
def get_edges_serial(vertex_id, num_slices=50):
results = []
for slice_key in compute_slice_keys(vertex_id, num_slices):
results.append(dynamo.get_item(slice_key))
return merge(results)
# After: single BatchGetItem — critical for high-fanout nodes
def get_edges_parallel(vertex_id, num_slices=50):
slice_keys = compute_slice_keys(vertex_id, num_slices)
results = dynamo.batch_get_items(slice_keys) # 1 call instead of N
return merge(results)
# OPTIMISATION 3: Distributed tracing in the internal fork
# OSS JanusGraph: no tracing — impossible to profile slow queries
# Internal fork: Airbnb trace context propagated through every graph op
def execute_gremlin_traversal(query, trace_context):
with airbnb_tracer.start_span('janusgraph.traversal', parent=trace_context) as span:
span.set_tag('query.hops', count_hops(query))
span.set_tag('query.fanout', estimated_fanout(query))
result = janusgraph.execute(query)
span.set_tag('result.edges_traversed', result.edge_count)
return result
The shadow traffic migration strategy
Migrating 7 billion nodes and 11 billion edges without downtime required running both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. Because both systems use Gremlin, the same queries ran unchanged on both simultaneously. This shadow traffic phase provided a performance benchmark under real load (not synthetic tests) and correctness validation before any cutover. Only after shadow traffic validated both was production traffic cut over and the vendor deprecated.
Architecture
Airbnb's new graph infrastructure has three conceptual layers. The storage layer is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale. The graph engine layer is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries, with custom optimisations for Airbnb's access patterns. The management layer is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call.
Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Why High-Fanout Nodes Cause Long-Tail Latency
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Performance comparison across query types:
| Query Type | Vendor P95 | Internal P95 | Improvement | Vendor P99 | Internal P99 |
|---|---|---|---|---|---|
| 1-hop query | ~180ms | ~65ms | -64% | ~420ms | ~150ms |
| 2-hop query | ~350ms | ~130ms | -63% | ~900ms | ~280ms |
| 2-hop (high fanout) | ~620ms | ~200ms | -68% | ~1,800ms | ~450ms |
| 4-hop query | ~900ms | ~380ms | -58% | ~2,500ms | ~850ms |
| 8-hop query (max depth) | ~2,100ms | ~1,000ms | -52% | ~5,000ms | ~2,500ms |
| Write (edge creation) | ~353ms | ~156ms | -56% | ~800ms | ~360ms |
Lessons
Know the signals that a vendor relationship has passed its usefulness. Recurring manual operational interventions, inability to instrument the system's internals, no path to tune performance for your access patterns, and P99 latency an order of magnitude worse than P50 — each individually might be tolerable, but all four together mean the vendor is costing more than an internal solution would cost to build.
Pluggable storage backends are what make graph databases practical at scale. JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.
Shadow traffic is the only honest migration validation strategy for a stateful system. You cannot reproduce 7 billion nodes and 11 billion edges in staging. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. Gremlin compatibility between vendor and JanusGraph made shadow traffic feasible here — evaluate migration options partly on query language compatibility.
High-fanout nodes (vertices with an unusually large number of edges — sometimes called supernodes) are the specific failure mode of graph databases at scale. They don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits are the tools that prevent P99 from diverging from P50.
Fork open-source infrastructure when you have specific, documented performance requirements the upstream project doesn't address — and when you intend to maintain the fork. The fork is a commitment that creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.
Engineering Glossary
DynamoDB — Amazon's fully managed NoSQL key-value and document database, used by Airbnb as JanusGraph's storage backend. Provides auto-scaling, multi-region replication, and conditional write operations used in Airbnb's optimised transaction strategy.
Gremlin — a graph traversal language developed as part of the Apache TinkerPop framework. Reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked."
High-fanout node — a vertex in a graph database with an unusually large number of edges, sometimes called a supernode. Causes disproportionate latency on traversal queries because a single hop can require fetching thousands of edges.
JanusGraph — an open-source distributed graph database built on Apache TinkerPop, with a pluggable storage backend that can use Cassandra, DynamoDB, or HBase as the underlying data store.
Long-tail latency — the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median. Particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience.
P99 latency — the response time that 99% of requests complete within. A P99 of 5.0s means 1 in 100 requests takes 5 seconds or longer — directly visible to users at scale.
Pluggable storage backend — an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer.
Shadow traffic — a migration validation strategy where the same production queries are routed to both the old and new systems simultaneously, comparing outputs and latencies before committing to a cutover.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)