AlgoMaster Logo

Graph Databases

Last Updated: January 12, 2026

Ashish

Ashish Pratap Singh

Every database type we have explored so far treats relationships as secondary.

Relational databases store relationships in foreign keys that must be joined at query time. Document databases embed related data or store references that require multiple queries. Wide-column databases denormalize relationships into the data model itself.

Graph databases flip this paradigm. Relationships become first-class citizens, stored and indexed just like the data itself.

Consider LinkedIn's "People You May Know" feature. To find potential connections, you need to traverse the social network: find friends of friends who are not already your friends, filter by shared interests or employers, and rank by connection strength.

In a relational database, each "hop" in the graph requires a JOIN operation. For a query that traverses 3-4 hops across millions of users, you are looking at joining billions of row pairs. Performance degrades exponentially with depth.

In a graph database, these traversals are what the system is optimized for. Relationships are stored as direct pointers between nodes. Traversing from one node to its neighbors is a pointer lookup, not a table scan.

The query "friends of friends of Alice" executes in milliseconds, regardless of how many total users exist in the system.

The Property Graph Model

The property graph model is the most common way to represent data in graph databases. It consists of three elements: nodes, relationships, and properties.

Nodes

Nodes represent entities in your domain. They are similar to rows in a relational table or documents in a document database. Each node:

  • Has a unique identifier
  • Can have one or more labels (like types or categories)
  • Can have properties (key-value pairs)

Labels like "Person," "Company," and "Skill" categorize nodes. A node can have multiple labels, allowing flexible categorization.

Relationships

Relationships connect nodes. Unlike foreign keys in relational databases, relationships are stored as explicit connections between nodes. Each relationship:

  • Has a unique identifier
  • Has a type (like "WORKS_AT" or "KNOWS")
  • Has a direction (from one node to another)
  • Can have properties (key-value pairs)

Note that relationships have their own properties. The FRIENDS_WITH relationship has a "since" property recording when the friendship started. The KNOWS relationship has a "level" property indicating proficiency.

Properties

Both nodes and relationships can have properties, which are key-value pairs storing attributes:

ElementProperty KeyProperty Value
Person nodename"Alice"
Person nodeage28
WORKS_AT relationshipsince2022
WORKS_AT relationshiprole"Engineer"

Properties can be strings, numbers, booleans, dates, or arrays of these types.

Why This Model Matters

The property graph model closely matches how we naturally think about many domains:

  • Social networks: people connected by friendships, follows, messages
  • Organization hierarchies: employees reporting to managers
  • Product catalogs: products in categories, with related products
  • Fraud networks: accounts linked by shared devices, addresses, transactions

When your domain is naturally a graph, modeling it as a graph is intuitive. You draw nodes and edges on a whiteboard, and that becomes your database schema.

Graph Database Architecture

Graph databases use specialized storage and indexing structures to make relationship traversal fast.

Index-Free Adjacency

The key architectural principle is index-free adjacency: each node directly stores pointers to its adjacent nodes. No index lookup is required to find neighbors.

Compare this to relational databases:

OperationRelational (with index)Graph
Find Alice's friendsScan index, retrieve rows, joinFollow pointers from Alice
ComplexityO(log n) per hopO(1) per hop
3-hop traversalO(log n)³ minimumO(neighbors at each hop)

The difference becomes dramatic for deep traversals. In a relational database, each hop requires at least an index lookup. In a graph database, each hop is a pointer dereference.

Relationship Storage

Relationships are stored as first-class objects with their own storage:

Each relationship record contains:

  • Start node pointer
  • End node pointer
  • Relationship type
  • Pointer to next relationship for the start node
  • Pointer to next relationship for the end node
  • Properties

This linked-list structure allows traversing all relationships of a node without indexes.

Traversal Performance

Graph database performance is measured differently than relational databases:

MetricMeaning
Traversals per secondHow many hops per second
Latency per hopTime to move from node to neighbor
Total query timeDepends on graph structure, not data size

The crucial insight: query time depends on the portion of the graph touched, not the total size. Finding friends of friends for a user with 100 friends touches ~10,000 nodes regardless of whether the total graph has 1 million or 1 billion nodes.

Query Languages

Graph databases use specialized query languages designed for expressing graph patterns and traversals.

Cypher (Neo4j)

Cypher is a declarative pattern-matching language. You describe the pattern you want to find, and the database figures out how to find it.

Basic pattern matching:

The ASCII-art syntax is intentional: (node)-[:RELATIONSHIP]->(node) visually represents the graph pattern.

Creating data:

Aggregations:

Variable-length paths:

Gremlin (Apache TinkerPop)

Gremlin is an imperative traversal language. You describe step-by-step how to traverse the graph.

Gremlin reads like a pipeline: start with vertices (V), filter by property, traverse out, deduplicate, extract values.

SPARQL (RDF)

SPARQL is used for RDF (Resource Description Framework) graphs, common in semantic web and knowledge graphs.

Language Comparison

AspectCypherGremlinSPARQL
StyleDeclarative, pattern-basedImperative, step-basedDeclarative, triple-based
Learning curveModerateSteepModerate
Graph modelProperty graphProperty graphRDF triples
Primary databaseNeo4jJanusGraph, NeptuneGraphDB, Virtuoso
ReadabilityHigh (ASCII art patterns)ModerateModerate

Use Cases

Graph databases excel in domains where relationships are the primary focus of queries.

Social Networks

The classic graph use case. Users are nodes, friendships are relationships.

Typical queries:

Recommendation Engines

Find items similar to what a user has liked, or items that similar users have liked.

Fraud Detection

Fraud often involves networks of connected accounts sharing devices, addresses, or payment methods.

Knowledge Graphs

Model entities and their relationships for semantic search and question answering.

Access Control

Model permissions as a graph to answer "Can user X access resource Y?"

Popular Graph Databases

Neo4j

The most popular graph database, with a mature ecosystem and the Cypher query language.

  • Deployment: Self-hosted or Neo4j Aura (managed cloud)
  • Query language: Cypher
  • ACID transactions: Yes, full ACID support
  • Scalability: Enterprise edition supports clustering
  • Strengths: Developer experience, tooling, community

Amazon Neptune

AWS's managed graph database service supporting both property graphs and RDF.

  • Deployment: AWS managed service
  • Query languages: Gremlin (property graph) and SPARQL (RDF)
  • Integration: VPC, IAM, CloudWatch, S3 import/export
  • Scalability: Read replicas, storage auto-scaling

JanusGraph

Open-source, distributed graph database designed for large-scale graphs.

  • Deployment: Self-hosted, runs on top of pluggable storage (Cassandra, HBase)
  • Query language: Gremlin
  • Scalability: Distributed, scales horizontally with storage backend
  • Strengths: Can leverage existing Cassandra/HBase infrastructure

ArangoDB

Multi-model database supporting graphs, documents, and key-value in one system.

  • Deployment: Self-hosted or ArangoDB Cloud
  • Query language: AQL (Arango Query Language)
  • Multi-model: Graph, document, and key-value in one database
  • Strengths: Flexibility to mix models without multiple databases

Comparison

FeatureNeo4jNeptuneJanusGraphArangoDB
Query languageCypherGremlin/SPARQLGremlinAQL
ACIDFullFullLimitedFull
Managed optionAuraYes (AWS)NoArangoDB Cloud
Horizontal scaleEnterpriseYesYesYes
Multi-modelNoNoNoYes

Performance Considerations

Traversal Depth

Query performance depends heavily on traversal depth and the number of relationships per node:

DepthNodes Touched (avg 100 friends)Relational (worst case)Graph
1100100 rows100 hops
210,00010,000+ rows, joins10,000 hops
31,000,000Millions of rows1M hops
4100,000,000Usually impractical100M hops

Graph databases handle depth 2-3 traversals easily. Depth 4+ requires careful consideration of filtering and limits.

Supernodes

Supernodes are nodes with an unusually high number of relationships (e.g., a celebrity with millions of followers). They can cause performance problems:

Mitigation strategies:

  • Limit traversal: Use LIMIT to bound results
  • Filter early: Apply filters before traversing from supernodes
  • Denormalize: Store aggregated data to avoid traversing all relationships
  • Separate processing: Handle supernodes differently in application logic

Indexing

While relationships are traversed without indexes, finding starting nodes requires indexes:

Always index properties used in initial node lookups (the starting point of traversals).

Scaling Graph Databases

Scaling graph databases is challenging because graphs are inherently connected. Partitioning a graph means cutting some relationships, which hurts performance for queries that cross partition boundaries.

Vertical Scaling

The simplest approach: use a bigger server. Neo4j can handle graphs with billions of nodes on a single large machine.

MetricSingle Large Server
NodesBillions
RelationshipsTens of billions
RAMUp to several TB
StorageUp to PB with SSDs

Read Replicas

Scale read throughput by replicating the graph:

Sharding (Challenging)

Partitioning a graph is hard because:

  • Relationships cross partition boundaries
  • Cross-partition traversals require network hops
  • Optimal partitioning depends on query patterns

Approaches:

  1. Random partitioning: Simple but many cross-partition queries
  2. Application-level sharding: Partition by tenant or region if natural boundaries exist
  3. Graph partitioning algorithms: Minimize edge cuts (complex, often imperfect)

When sharding is acceptable:

  • Multi-tenant applications (each tenant is a separate graph)
  • Geographic partitioning (users in US rarely connect to users in EU)
  • Queries that naturally stay within partitions

When to Choose Graph Databases

Graph databases are the right choice when:

  • Relationships are the primary focus. Your queries are about connections, paths, and patterns rather than aggregations over entities.
  • Traversal depth varies. You need queries like "find all paths up to 5 hops" where the depth is not fixed.
  • Schema is fluid. Nodes and relationships of different types coexist naturally. Adding new relationship types does not require schema migrations.
  • Query patterns involve pattern matching. Finding subgraph patterns, cycles, or specific relationship structures.

When to Consider Alternatives

Graph databases may not fit when:

  • Simple lookups dominate. If you mostly retrieve single entities by ID, key-value or document databases are simpler.
  • Aggregations are common. "Total sales by region" is better served by relational or analytical databases.
  • Relationships are not queried. If you store relationships but rarely traverse them, the graph model adds unnecessary complexity.
  • Scale exceeds single-server capacity. Distributed graph processing is complex. Consider pre-computing results or using batch processing.

Summary

Graph databases treat relationships as first-class citizens:

AspectGraph Database Approach
Data modelNodes, relationships, and properties
StorageIndex-free adjacency, relationships stored as pointers
QueriesPattern matching and traversal (Cypher, Gremlin)
StrengthMulti-hop relationship queries
ScalingChallenging due to graph connectivity

Key concepts:

  • Property graph model: Nodes and relationships both have types and properties
  • Index-free adjacency: O(1) traversal from node to neighbors
  • Query languages: Cypher (Neo4j), Gremlin (TinkerPop), SPARQL (RDF)
  • Supernodes: High-degree nodes that can impact performance

The next chapter explores time-series databases, which are optimized for a completely different access pattern: data that is indexed by time and queried primarily with time-range filters and aggregations.