Last Updated: January 12, 2026
In the previous chapter, we explored relational databases and their powerful guarantees around data integrity and transactions.
But that power comes with a cost: rigidity. Every row in a table must conform to the same schema. Every change to that schema requires a migration. And when your data is naturally hierarchical, you must flatten it into multiple tables and reassemble it with joins on every read.
Document databases emerged as a response to these constraints. Instead of storing data in rows and columns, they store it in documents, typically JSON or a binary equivalent like BSON.
Each document is a self-contained unit that can have a different structure from other documents in the same collection. There are no joins because related data is embedded directly within the document.
This flexibility fundamentally changes how applications are built.
A document database stores data as documents. A document is a self-describing data structure, typically JSON, that can contain nested objects, arrays, and scalar values.
Unlike relational rows, which must all have the same columns, documents in the same collection can have completely different structures.
Here is a typical document representing a blog post:
Several things to notice about this document:
author and stats fields contain objects with their own properties.tags field is an array of strings. The comments field is an array of objects.featured_image or series, or lack fields like comments.Documents are grouped into collections, which are roughly analogous to tables in relational databases. However, there is a key difference: collections do not enforce a schema. You can store documents with completely different structures in the same collection.
In practice, documents in a collection typically share a similar structure because they represent the same type of entity. But the flexibility to have variations is valuable during development and for handling edge cases.
Every document has a unique identifier. In MongoDB, this is the _id field, which is automatically generated if not provided. The default ID is an ObjectId, a 12-byte value that encodes:
This structure means ObjectIds are roughly chronologically sorted, which can be useful for range queries on creation time. However, you can also use any unique value as the ID: UUIDs, natural keys like email addresses, or application-generated identifiers.
The most important decision in document database design is choosing between embedding related data and referencing it. This is fundamentally different from relational databases, where you always normalize and join.
Embedding stores related data directly within the document:
Referencing stores only the ID of related data:
To display this order, the application must make additional queries to fetch the customer and product details:
| Factor | Favor Embedding | Favor Referencing |
|---|---|---|
| Access pattern | Always accessed together | Accessed independently |
| Relationship | One-to-one or one-to-few | One-to-many or many-to-many |
| Update frequency | Rarely updated | Frequently updated |
| Data size | Small, bounded | Large or unbounded |
| Consistency needs | Can tolerate duplication | Must be authoritative |
Document databases offer rich query capabilities, though they differ from SQL. Instead of declarative SQL, you typically use a query API or a query language specific to the database.
In MongoDB, queries use a JSON-like syntax:
You can retrieve only specific fields:
Document databases excel at querying within arrays:
For complex data transformations, MongoDB provides an aggregation pipeline:
The aggregation pipeline is surprisingly powerful, capable of joining collections (using $lookup), reshaping documents, performing statistical operations, and more.
Despite rich query capabilities, document databases have limitations compared to SQL:
| Operation | SQL | Document DB |
|---|---|---|
| Join tables | Native, efficient | $lookup (less efficient) |
| Subquery | Native | Limited or requires application logic |
| Cross-collection transactions | Native | Supported but with overhead |
| Aggregations | Native | Aggregation pipeline |
| Window functions | Native | Limited support |
The fundamental limitation is that document databases are optimized for operations within a single document. Cross-document operations are possible but less efficient.
Without indexes, every query would scan every document in a collection. Indexes make queries fast by creating sorted data structures that point to documents.
Document databases support various index types:
| Index Type | Use Case | Example |
|---|---|---|
| Single field | Queries on one field | Index on email |
| Compound | Queries on multiple fields | Index on {status, created_at} |
| Multikey | Queries into arrays | Index on tags field |
| Text | Full-text search | Search within content field |
| Geospatial | Location-based queries | Find nearby restaurants |
| Hashed | Sharding distribution | Shard key index |
While document databases are often called "schemaless," modern databases like MongoDB support optional schema validation. This gives you flexibility with guardrails.
| Validation Setting | Behavior |
|---|---|
validationLevel: "strict" | Validate all inserts and updates |
validationLevel: "moderate" | Only validate inserts and updates to documents that already match |
validationAction: "error" | Reject invalid documents |
validationAction: "warn" | Allow invalid documents but log a warning |
Schema validation provides a middle ground: flexibility for development and iteration, with constraints for production stability.
Historically, document databases only provided atomicity at the document level. Multi-document operations were not atomic, which was a significant limitation for certain use cases.
Modern document databases now support multi-document transactions:
However, there are important caveats:
| Aspect | Single-Document | Multi-Document Transaction |
|---|---|---|
| Atomicity | Guaranteed | Guaranteed |
| Performance | Optimal | Overhead (locks, coordination) |
| Complexity | Simple | More complex error handling |
| Best practice | Prefer this | Use when necessary |
The recommendation is to design your data model so that operations affecting related data can be performed on a single document. Use multi-document transactions when the data model cannot accommodate this or when you are migrating from a relational database.
Document databases are designed for horizontal scaling from the ground up. Sharding distributes data across multiple servers.
The shard key determines how data is distributed. This is the most important decision in sharding:
| Shard Key Property | Good | Bad |
|---|---|---|
| Cardinality | High (many unique values) | Low (few values like status) |
| Frequency | Even distribution | Skewed (one value dominates) |
| Monotonicity | Random or hashed | Sequential (like timestamps) |
| Query pattern | Matches common queries | Doesn't match access patterns |
Good shard key examples:
user_id for a multi-tenant application_id for even distribution{tenant_id, created_at} for time-series per tenantBad shard key examples:
status (low cardinality, uneven distribution)created_at alone (monotonically increasing, all writes go to one shard)When the query includes the shard key, the router can send the query directly to the relevant shard (targeted query). When it does not, the router must query all shards and merge results (scatter-gather).
Targeted queries are efficient. Scatter-gather queries are expensive, especially as the cluster grows. Design your shard key and queries to maximize targeted operations.
MongoDB is the most widely used document database. It has become nearly synonymous with the document model.
AWS's managed document database service with MongoDB API compatibility.
Google's serverless document database, part of Firebase.
Apache's document database with unique features.
| Feature | MongoDB | DocumentDB | Firestore | CouchDB |
|---|---|---|---|---|
| Hosting | Self/Cloud | AWS only | GCP only | Self-hosted |
| Real-time | Change streams | - | Native | - |
| Offline sync | - | - | Native | Native |
| Transactions | Multi-doc | Multi-doc | Multi-doc | Single-doc |
| Aggregation | Powerful | Compatible | Limited | MapReduce |
Document databases are the right choice when:
Document databases may not be the best fit when:
Document databases offer a fundamentally different approach to data storage compared to relational databases:
| Aspect | Document DB Approach |
|---|---|
| Data model | JSON-like documents with nested objects and arrays |
| Schema | Flexible by default, optional validation available |
| Relationships | Embedding (denormalization) or referencing (application-level joins) |
| Transactions | Single-document atomic, multi-document with overhead |
| Scaling | Designed for horizontal scaling via sharding |
| Query language | Database-specific API or query language, not SQL |
The next chapter explores key-value stores, which take simplicity to the extreme. Where document databases offer flexible structure, key-value stores offer almost no structure at all, trading it for raw speed and simplicity.