NoSQL Basics
1. What is a NoSQL database?
A NoSQL database is a class of data storage system designed to handle large volumes of structured, semi-structured, and unstructured data without requiring a fixed relational schema. The term originally meant "non-SQL" but is now commonly interpreted as "Not Only SQL," reflecting that many such systems can coexist with relational databases. NoSQL systems prioritize horizontal scalability, flexible data models, and high availability over the rigid table-based structure of traditional relational databases.
2. How does NoSQL differ from a relational (SQL) database?
Relational databases store data in tables with predefined schemas, enforce relationships through foreign keys, and rely on JOIN operations and SQL for querying. NoSQL databases use flexible, schema-less or schema-on-read models such as documents, key-value pairs, wide columns, or graphs, and they typically scale horizontally across commodity servers. While SQL databases emphasize strong consistency and ACID transactions, many NoSQL systems trade some consistency for availability and partition tolerance to achieve scale.
3. Why did NoSQL databases emerge?
NoSQL databases gained popularity in the late 2000s as web-scale companies faced data volumes, velocities, and varieties that relational databases struggled to handle cost-effectively. The need to scale horizontally across many cheap machines, store rapidly evolving and semi-structured data, and serve high-traffic applications with low latency drove their adoption. The rise of big data, cloud computing, and agile development cycles further accelerated the move toward flexible, distributed data stores.
4. What does "schema-less" or "schema-on-read" mean?
A schema-less database does not enforce a fixed structure on the data at write time, so records in the same collection can have different fields and shapes. This is often called "schema-on-read" because the application interprets the structure when it reads the data, rather than the database validating it on write. This flexibility speeds up development and accommodates evolving requirements, but it shifts the responsibility for data consistency and validation onto the application layer.
5. What are the main advantages of NoSQL databases?
NoSQL databases offer flexible data models that adapt easily to changing requirements, horizontal scalability that distributes load across many servers, and high performance for specific access patterns. They handle large volumes of unstructured or semi-structured data well and are often more cost-effective at scale because they run on commodity hardware. Many also provide built-in replication and high availability, making them well suited to distributed, cloud-native applications.
6. What are the main disadvantages or trade-offs of NoSQL databases?
NoSQL databases often relax strong consistency guarantees, which can lead to temporary discrepancies under the eventual-consistency model. They frequently lack standardized query languages, mature JOIN support, and the rich transactional guarantees of relational systems, which can complicate complex analytical queries. The flexible schema also pushes data validation into the application, and the relative youth of some tools can mean a steeper learning curve and less tooling than the SQL ecosystem.
7. When should you choose NoSQL over a relational database?
NoSQL is a strong choice when you need to scale horizontally to handle massive read or write throughput, when your data is unstructured or evolving rapidly, or when your access patterns are well known and query-driven. It fits use cases such as real-time analytics, content management, IoT data ingestion, session storage, and applications requiring flexible, denormalized documents. Conversely, if you need complex multi-table transactions, strong consistency, and ad-hoc querying, a relational database is usually the better fit.
8. What is the BASE model and how does it relate to ACID?
BASE stands for Basically Available, Soft state, and Eventual consistency, and it describes the design philosophy of many distributed NoSQL systems. It contrasts with ACID (Atomicity, Consistency, Isolation, Durability), which guarantees that transactions are reliable and immediately consistent in relational databases. BASE accepts that the system may be temporarily inconsistent in exchange for higher availability and partition tolerance, with data converging to a consistent state over time.
9. What does "horizontal scaling" mean in the context of NoSQL?
Horizontal scaling, or scaling out, means adding more servers (nodes) to a cluster to distribute data and workload, rather than upgrading a single machine. NoSQL databases are designed to partition data across these nodes automatically, allowing capacity and throughput to grow nearly linearly as machines are added. This contrasts with vertical scaling (scaling up), which increases the resources of one machine and eventually hits hardware and cost limits.
10. What is denormalization and why is it common in NoSQL?
Denormalization is the practice of duplicating data across records or embedding related data together to optimize read performance. In NoSQL systems, which often lack efficient JOIN operations, denormalization lets an application retrieve everything it needs in a single read rather than assembling data from multiple tables. The trade-off is increased storage usage and the need to update duplicated data in multiple places when it changes.
Types of NoSQL Databases
11. What are the four main types of NoSQL databases?
The four primary categories are document stores, key-value stores, column-family (wide-column) stores, and graph databases. Document stores keep data as JSON-like documents, key-value stores map unique keys to opaque values, column-family stores organize data into rows with dynamic columns grouped into families, and graph databases model data as nodes and relationships. Each type is optimized for a particular data shape and set of access patterns.
12. What is a key-value store and when is it useful?
A key-value store is the simplest NoSQL model, mapping a unique key to a value that the database treats as an opaque blob. Lookups by key are extremely fast, making this model ideal for caching, session management, user preferences, and shopping carts. Popular examples include Redis and Amazon DynamoDB; the main limitation is that you generally cannot query by the contents of the value, only by the key.
13. What is a document database?
A document database stores data as self-describing documents, typically in JSON, BSON, or XML format, where each document is identified by a unique key. Documents can contain nested fields, arrays, and varying structures, making this model intuitive for representing application objects. MongoDB and Couchbase are well-known examples, and they support querying on fields within documents, which gives them more expressive power than pure key-value stores.
14. What is a column-family (wide-column) store?
A column-family store organizes data into rows identified by a key, where each row can have a large and varying set of columns grouped into column families. This model is optimized for storing and querying massive datasets with sparse data and high write throughput, as columns are stored together on disk for efficient retrieval. Apache Cassandra and HBase are leading examples, frequently used for time-series data, logging, and analytics at scale.
15. What is a graph database?
A graph database stores data as nodes (entities) and edges (relationships between entities), with properties attached to both. This structure makes it highly efficient to traverse and query relationships, such as finding connections in a social network, detecting fraud rings, or powering recommendation engines. Neo4j and Amazon Neptune are common examples, and they typically use specialized query languages like Cypher or Gremlin to express graph traversals.
16. How do you decide which type of NoSQL database to use?
The choice depends primarily on your data structure and dominant access patterns. Use a key-value store for simple, fast lookups by identifier; a document store when your data is naturally hierarchical and queried by field; a column-family store for write-heavy, large-scale analytical workloads; and a graph database when relationships between entities are central to your queries. Evaluating read/write ratios, consistency needs, and scalability requirements alongside data shape leads to the best fit.
17. What is Redis and what are its typical use cases?
Redis is an in-memory key-value store known for extremely low latency, often used as a cache, session store, message broker, and real-time leaderboard backend. It supports rich data structures beyond simple strings, including lists, sets, sorted sets, hashes, and streams, which enable many specialized patterns. Because it primarily operates in memory, Redis offers optional persistence to disk through snapshotting (RDB) and append-only logging (AOF) to balance speed and durability.
18. What is Apache Cassandra known for?
Cassandra is a distributed column-family database designed for high write throughput, linear scalability, and no single point of failure. It uses a peer-to-peer "masterless" architecture in which every node is equal, and it relies on tunable consistency and a partitioning scheme based on consistent hashing. These properties make it ideal for use cases like time-series data, event logging, and globally distributed applications that demand high availability.
19. What is a multi-model database?
A multi-model database supports more than one data model within a single, integrated backend, such as combining document, graph, and key-value capabilities. This lets developers use the most appropriate model for each part of an application without operating multiple separate database systems. Examples include ArangoDB, Couchbase, and Amazon DynamoDB to varying degrees, reducing operational complexity at the cost of potentially being less specialized than a single-purpose engine.
20. What is the difference between a column-family store and a column-oriented relational database?
A column-family store (like Cassandra) is a NoSQL model that groups columns into families per row and allows each row to have a different, sparse set of columns, optimizing for distributed writes and flexible schemas. A column-oriented relational database (like a columnar analytics warehouse) stores each column of a table contiguously on disk to accelerate analytical scans and aggregations, while still maintaining a fixed relational schema. The two share the word "column" but solve different problems: flexible distributed storage versus efficient analytical querying.
MongoDB & Document Stores
21. What is MongoDB?
MongoDB is a popular open-source, document-oriented NoSQL database that stores data as flexible, JSON-like documents in a binary format called BSON. It organizes documents into collections, which are roughly analogous to tables, but without enforcing a fixed schema across documents. MongoDB offers rich querying, secondary indexes, aggregation pipelines, horizontal scaling through sharding, and high availability through replica sets.
22. What is BSON and how does it differ from JSON?
BSON (Binary JSON) is the binary-encoded serialization format MongoDB uses to store documents and transmit data. While it is conceptually similar to JSON, BSON supports additional data types such as Date, ObjectId, binary data, and distinct integer and floating-point types, and it is designed to be efficiently traversable and parseable. BSON also includes length prefixes that make scanning and skipping fields faster than parsing plain text JSON.
23. What is a collection and a document in MongoDB?
In MongoDB, a document is a single record stored as a set of field-and-value pairs in BSON, similar to a row in a relational table but with a flexible structure. A collection is a grouping of documents and is analogous to a table, though documents within a collection need not share the same fields. This structure allows related but differently shaped records to live together while still being queried and indexed.
24. What is the _id field in MongoDB?
Every MongoDB document has a unique _id field that serves as its primary key within a collection. If you do not supply one, MongoDB automatically generates an ObjectId, a 12-byte value that encodes a timestamp, a machine identifier, a process identifier, and an incrementing counter to ensure uniqueness. The _id field is automatically indexed, guaranteeing fast lookups and preventing duplicate primary keys.
25. What is the aggregation pipeline in MongoDB?
The aggregation pipeline is a framework for processing and transforming documents through a sequence of stages, where the output of one stage becomes the input to the next. Common stages include $match for filtering, $group for aggregating, $sort for ordering, $project for reshaping, and $lookup for joining collections. This pipeline enables complex data analysis and reporting directly within the database, similar to GROUP BY and aggregate functions in SQL.
26. How does MongoDB support transactions?
Since version 4.0, MongoDB supports multi-document ACID transactions, allowing multiple read and write operations across one or more documents to either all succeed or all roll back. Initially limited to replica sets, transaction support was extended to sharded clusters in version 4.2. While powerful, transactions add overhead and contention, so MongoDB's data-modeling guidance still favors embedding related data in a single document to avoid needing them where possible.
27. What is the difference between find() and findOne() in MongoDB?
The find() method returns a cursor to all documents that match a given query filter, which the application can then iterate over. The findOne() method returns only the first single document that matches the filter, or null if no match is found. Using findOne() is more convenient and efficient when you expect or need just one result, such as looking up a record by a unique field.
28. What is a capped collection in MongoDB?
A capped collection is a fixed-size collection that maintains insertion order and automatically overwrites its oldest documents when it reaches its size limit, behaving like a circular buffer. Because it preserves insertion order and supports high-throughput inserts, it is well suited to logging, caching recent events, and similar scenarios. You cannot delete individual documents from a capped collection or grow it beyond its preallocated size.
29. What are some common ways to query nested fields in a document?
In MongoDB you query nested or embedded fields using dot notation, for example db.users.find({ "address.city": "Mumbai" }) to match a city inside an embedded address object. For arrays of subdocuments, operators like $elemMatch let you match documents where a single array element satisfies multiple conditions. This expressive querying on nested structures is a key advantage of document databases over simpler key-value stores.
30. What is the difference between embedding and referencing in MongoDB?
Embedding stores related data directly within a parent document as nested fields or arrays, enabling fast single-read retrieval of all related information. Referencing stores the related data in a separate document and links to it via an identifier, similar to a foreign key, which keeps documents smaller and avoids duplication. Embedding suits tightly coupled, frequently co-accessed data, while referencing is preferable for large, independently changing, or many-to-many related data.
Data Modeling
31. How does data modeling in NoSQL differ from relational modeling?
Relational modeling starts from the data and normalizes it into tables to eliminate redundancy, then relies on JOIN operations to recombine it at query time. NoSQL modeling starts from the application's access patterns and structures data to make the most common queries fast, often through denormalization and embedding. The guiding principle is "design for your queries" rather than "design for the data," accepting some redundancy in exchange for read performance and scalability.
32. What is normalization and why is it less emphasized in NoSQL?
Normalization is the process of organizing data to reduce redundancy by splitting it into related tables connected through keys, which keeps each fact stored in exactly one place. It is less emphasized in NoSQL because many NoSQL systems lack efficient joins, so reassembling normalized data across documents or partitions is expensive. Instead, NoSQL favors denormalization to keep related data together, trading storage and update complexity for faster reads.
33. What are the trade-offs between embedding and referencing data?
Embedding data improves read performance and atomicity since related information lives in one document, but it can lead to large documents, duplicated data, and update anomalies when shared data changes. Referencing keeps documents smaller and avoids duplication, but it requires additional queries or $lookup operations to assemble related data, increasing read latency. The right choice depends on how often the data is read together, how large it grows, and how frequently it changes.
34. How do you model a one-to-many relationship in a document database?
A one-to-many relationship can be modeled by embedding the "many" items as an array within the "one" parent document, which is efficient when the related items are bounded in number and usually read together. Alternatively, you can reference the children by storing their identifiers in the parent or storing the parent's identifier in each child, which is better when the "many" side is large or grows unbounded. The decision hinges on the cardinality and access patterns of the relationship.
35. How do you model a many-to-many relationship in NoSQL?
Many-to-many relationships are commonly modeled by storing arrays of references on one or both sides of the relationship, or by using a separate linking collection that holds pairs of identifiers. In document databases you might embed a small list of references in each document, while in graph databases the relationship is modeled natively as edges between nodes. Because many-to-many relationships can grow large, referencing is usually preferred over embedding to avoid unbounded document growth.
36. What is a schema validation feature and why might you use it?
Many NoSQL databases, including MongoDB, offer optional schema validation that enforces rules on document structure, data types, and required fields at write time. You might use it to add a safety net that prevents malformed or unexpected data from entering a collection while still retaining flexibility. This provides a middle ground between fully schema-less freedom and the rigid enforcement of a relational schema, improving data quality without sacrificing all agility.
37. What is the "design for queries" principle in NoSQL data modeling?
The "design for queries" principle means structuring your data model around the specific read and write patterns your application performs most often, rather than around an idealized normalized form. This often involves duplicating data, pre-aggregating results, or creating purpose-built collections so that common queries can be served by a single, efficient lookup. The approach maximizes performance and scalability at the cost of more complex writes and potential redundancy.
38. How do you handle data that changes frequently across many documents?
When data is duplicated through denormalization and changes frequently, you must update every copy, which can be expensive and error-prone. Strategies include referencing the volatile data in a single canonical document instead of embedding it, batching or asynchronously propagating updates, or accepting eventual consistency where slight staleness is tolerable. The key is to weigh read performance gained from duplication against the cost and complexity of keeping copies synchronized.
39. What is a polymorphic schema and how is it handled in document stores?
A polymorphic schema is one where documents within the same collection have different structures or fields, often because they represent variations of a similar concept, such as different product types. Document stores handle this naturally because they do not enforce a uniform schema, allowing each document to carry only the fields relevant to its type. A common pattern is to include a type discriminator field so the application can interpret each document's shape correctly.
40. How does indexing influence your data model in NoSQL?
Indexing decisions are tightly coupled to the data model because indexes accelerate the queries your model is designed to serve, and choosing them well is essential for performance. You should index the fields used in frequent query filters, sorts, and lookups, while avoiding excessive indexes that slow down writes and consume storage. The data model and indexing strategy must be planned together so that the fields you query are efficiently searchable.
Scaling, Sharding & Replication
41. What is sharding and why is it used?
Sharding is the technique of horizontally partitioning a dataset across multiple servers, called shards, so that each shard holds a subset of the data. It is used to scale a database beyond the capacity, throughput, or storage limits of a single machine, enabling near-linear growth as shards are added. By distributing both data and load, sharding supports very large datasets and high request volumes that a single node could not handle.
42. What is a shard key and why is it important?
A shard key is the field or set of fields used to determine how documents are distributed across shards in a cluster. Choosing a good shard key is critical because it directly affects how evenly data and load are balanced; a poor choice can create "hotspots" where one shard receives disproportionate traffic. An ideal shard key has high cardinality, even distribution of values, and aligns with the most common query patterns to enable targeted, efficient queries.
43. What is replication and how does it improve availability?
Replication is the process of maintaining multiple copies of data across different nodes so that the system can continue operating if a node fails. It improves availability and durability by providing redundancy, and it can improve read throughput by allowing reads to be served from secondary copies. In MongoDB this is implemented through replica sets, where one primary handles writes and secondaries replicate its data and can be promoted if the primary becomes unavailable.
44. What is a MongoDB replica set?
A MongoDB replica set is a group of mongod instances that maintain the same dataset, consisting of one primary node and one or more secondary nodes. The primary receives all write operations and records them in an operation log (the oplog), which secondaries asynchronously apply to stay synchronized. If the primary fails, the remaining members hold an election to automatically promote a secondary, providing automatic failover and high availability.
45. What is the difference between sharding and replication?
Sharding partitions data so that different nodes hold different subsets of the data, which scales capacity and write throughput. Replication copies the same data to multiple nodes, which provides redundancy, high availability, and additional read capacity. The two are complementary and often combined: a sharded cluster typically replicates each shard so that the system is both scalable and fault-tolerant.
46. What strategies exist for partitioning data across nodes?
Common partitioning strategies include range-based partitioning, which assigns contiguous ranges of the key to each node, and hash-based partitioning, which applies a hash function to the key to distribute data evenly. Consistent hashing is a refinement used by systems like Cassandra and DynamoDB that minimizes data movement when nodes are added or removed. Each approach balances even distribution against the ability to perform efficient range queries.
47. What are read and write concerns or consistency levels in distributed NoSQL?
Many distributed NoSQL systems let you tune how many replicas must acknowledge an operation before it is considered successful, balancing consistency, latency, and availability. In MongoDB, a write concern of majority requires acknowledgment from most replica-set members for durability, while a read concern controls the consistency of the data returned. In Cassandra, tunable consistency levels such as ONE, QUORUM, and ALL let you choose how many nodes must respond per operation.
48. What is a hotspot in a sharded cluster and how do you avoid it?
A hotspot occurs when a disproportionate amount of read or write traffic is directed to a single shard, often because the shard key has low cardinality or monotonically increasing values such as timestamps. Hotspots undermine the benefits of sharding by overloading one node while others remain idle. You can avoid them by choosing a high-cardinality shard key, using a hashed shard key to spread sequential values, or designing a compound key that distributes load evenly.
CAP Theorem & Consistency
49. What is the CAP theorem?
The CAP theorem, formulated by Eric Brewer, states that a distributed data store can simultaneously guarantee at most two of three properties: Consistency, Availability, and Partition tolerance. Consistency means every read receives the most recent write or an error, availability means every request receives a non-error response, and partition tolerance means the system continues operating despite network partitions between nodes. Because network partitions are unavoidable in distributed systems, designers must choose between favoring consistency or availability when a partition occurs.
50. What is the difference between strong consistency and eventual consistency?
Strong consistency guarantees that once a write is acknowledged, all subsequent reads return that updated value, providing the most up-to-date view at the cost of higher latency or reduced availability during partitions. Eventual consistency allows replicas to temporarily diverge after a write, with the guarantee that, in the absence of new updates, all replicas will converge to the same value over time. Many NoSQL systems offer tunable consistency so applications can choose strong consistency for critical operations and eventual consistency where higher availability and lower latency matter more.