Main Uses for Couchbase Primary Index
Find out when to use primary indexes in Couchbase.
Join the DZone community and get the full member experience.
Join For Free FROM customer USE KEYS ["cx:123"]
Couchbase is a distributed database. It supports a flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during the insertion or updation of the data. Here’s an example document.
select meta().id, travel
from `travel-sample` travel
where type = 'airline' limit 1;
[
{
"id": "airline_10",
"travel": {
"callsign": "MILE-AIR",
"country": "United States",
"iata": "Q5",
"icao": "MLA",
"id": 10,
"name": "40-Mile Air",
"type": "airline"
}
}
]
INSERT INTO customer (key, value) VALUES(“cx:123”, {“name”:”joe”, “zip”: 94040, “state”:ca});
SELECT META().id FROM customer;
cx:123
Each Couchbase bucket can store data in multiple types: customer, order, catalog, etc. When you load the “travel-sample” dataset, you load five distinct types of documents: airline, airport, hotel, route, and landmarks.
But, by default, Couchbase does not have the equivalent of “full table scan” to scan all the documents from start to finish. The primary index scan provides you the “full table scan” equivalent.
You may also like: Everything You Need To Know About Couchbase Architecture.
CREATE PRIMARY INDEX ix_customer_primary ON customer;
The primary index is:
- A sorted list of all the document keys of every document type within the bucket customer.
- Maintained asynchronously, just like other secondary indexes.
- Keeps just the document key and nothing else.
- Supports all the scan consistencies:
- Unbounded.
- AT_PLUS.
- REQUEST_PLUS.
The primary index allows the query engine to access all the documents, then do the filtering, joining, aggregation, etc operations on them.
EXPLAIN SELECT * from customer WHERE zip = 94040 name name = “joe” and type =
“cx”;
This is slow. Very slow. Unnecessary document fetches. Unnecessary filtering. Wasted memory and CPU. Primary scans will retrieve ALL the documents of all types in the bucket whether or not your query eventually returns them to the user. While I said primary scan is like a table scan, it’s much slower than your table scan, since it has to scan all of the documents of all types.
You should not use primary indexes. Do not use them. Especially in production.
Then why do we have primary indexes to begin with?
- When you’re starting to play with new sample data, you can run most queries without worrying about create specific indexes. At this point, your primary concern is to understand the data than tuning for throughput.
- When you know the range of primary keys you want to scan.
-
WHERE META().id between "cx:123" and "cx:458"
-
- When you know the trailing META().id pattern like below
-
WHERE META().id LIKE "cx:1%"
- DO NOT use:
LIKE “%:123”
. This will result in full scan
-
- When you do know the full META().id or list of META().id, you can use the USE KEYS to directly fetch the document without consulting the primary index
-
FROM customer USE KEYS ["cx:123"]
-
FROM customer USE KEYS ["cx:123", "cx:359", "cx:948"]
FROM customer USE KEYS(SELECT raw docid FROM mylist WHERE zip=94501)
-
Primary Index
Create the primary index on ‘travel-sample’
.
The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index in Couchbase, is maintained asynchronously. You set the recency of the data by setting the consistency level for your query.
Here is the metadata for this index:
select * from system:indexes where name = ‘#primary’;
"indexes": {
"datastore_id": "http://127.0.0.1:8091",
"id": "f6e3c75d6f396e7d",
"index_key": [],
"is_primary": true,
"keyspace_id": "travel-sample",
"name": "#primary",
"namespace_id": "default",
"state": "online",
"using": "gsi"
}
The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state) and the indexing method (using).
The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or no other index or access path can be used. In Couchbase, you store multiple keyspaces (documents of a different type, customer, orders, inventory, etc) in a single bucket.
So, when you do the primary scan, the query will use the index to get the document-keys and fetch all the documents in the bucket and then apply the filter. So, this is VERY EXPENSIVE.
The document key design is somewhat like primary key design with multiple parts.
Lastname:firstname:customerid
Example: smith:john:X1A1849
In Couchbase, it’s a best practice to prefix the key with the type of the document. Since this is a customer document, let’s prefix with CX. Now, the key becomes:
CX:smith:john:X1A1849
So, in the same bucket, there will be other types of documents.
ORDER type: OD:US:CA:294
ITEMS type:IT:KD92823
These are simply best practices. There is no restriction on the format or structure of the document key in Couchbase, except they have to be unique within a bucket.
Now, if you have documents with various keys and have a primary index, you can use following queries to efficiently.
Example 1: Looking for a specific document key.
SELECT * FROM sales WHERE META().id = “CX:smith:john:X1A1849”;
{
"#operator": "IndexScan2",
"index": "#primary",
"index_id": "4c92ab0bcca9690a",
"keyspace": "sales",
"namespace": "default",
"spans": [
{
"exact": true,
"range": [
{
"high": "\"CX:smith:john:X1A1849\"",
"inclusion": 3,
"low": "\"CX:smith:john:X1A1849\""
}
]
}
],
If you do know the full document key, you can use the following statement and avoid the index access altogether.
SELECT * FROM sales USE KEYS ["CX:smith:john:X1A1849"]
You can get more than one document in a statement.
SELECT * FROM sales USE KEYS ["CX:smith:john:X1A1849", "CX:smithjr:john:X1A18492"]
Example 2: Look for a pattern. Get ALL the customer documents.
SELECT * FROM sales WHERE META().id LIKE “CX:%”;
{
"#operator": "IndexScan2",
"index": "#primary",
"index_id": "4c92ab0bcca9690a",
"keyspace": "sales",
"namespace": "default",
"spans": [
{
"exact": true,
"range": [
{
"high": "\"CX;\"",
"inclusion": 1,
"low": "\"CX:\""
}
]
}
],
Example 3: Get all the customers with smith as their last name.
The following query uses the primary index efficiently, only fetching the customers with a particular range. Note: This scan is case sensitive. To do a case insensitive scan, you’ve create a secondary index with UPPER() or LOWER() of the document key.
SELECT * FROM sales WHERE META().id LIKE "CX:smith%";
{
"#operator": "IndexScan2",
"index": "#primary",
"index_id": "4c92ab0bcca9690a",
"keyspace": "sales",
"namespace": "default",
"spans": [
{
"exact": true,
"range": [
{
"high": "\"CX:smiti\"",
"inclusion": 1,
"low": "\"CX:smith\""
}
]
}
],
Example 4: It’s common for some applications to use use an email address as part of the document key since they’re unique. In that case, you need to find out all of customers with gmail.com. If this is a typical requirement, then, store the REVERSE of of the email address as the key and simply do the scan of leading string pattern.
Email:johnsmith@gmail.com key: reverse("johnsmith@gmail.com") =>
moc.liamg@htimsnhoj
Email: janesnow@yahoo.com key: reverse("janesnow@yahoo.com") =>
moc.oohay@wonsenaj
SELECT *
FROM sales
WHERE meta().id LIKE (reverse("@yahoo.com") || "%");
"#operator": "IndexScan2",
"index": "#primary",
"index_id": "4c92ab0bcca9690a",
"keyspace": "sales",
"namespace": "default",
"spans": [
{
"range": [
{
"high": "\"moc.oohayA\"",
"inclusion": 1,
"low": "\"moc.oohay@\""
}
]
}
],
Named Primary Index
In Couchbase 5.0, you can create multiple replica of any index with a simple parameter to CREATE INDEX. Following will create 3 copies of the index and there has to be minimum of 3 index nodes in the cluster.
CRETE PRIMATY INDEX ON 'travel-simple' WITH{"num_replica":2}; CREATE PRIMARY
INDEX 'def_primary' ON 'travel-sample';
You can also name the primary index. The rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in Couchbase versions before 5.0 using different names. Duplicate indices help with high availability as well as query load distribution throughout them. This is true for both primary indices and secondary indices.
select meta().id as documentkey, `travel-sample` airline
from `travel-sample`
where type = 'airline' limit 1;
{
"airline": {
"callsign": "MILE-AIR",
"country": "United States",
"iata": "Q5",
"icao": "MLA",
"id": 10,
"name": "40-Mile Air",
"type": "airline"
},
"documentkey": "airline_10"
}
Finally, in Couchbase 6.5, we’ve introduced the Index advisor. It can analyze a single N1QL statement or workload. Read the details at:
- N1QL Index Advisor: Improve Query Performance and Productivity
- Index Advisor for N1QL Query Statement
- Index Advisor For Query Workload
This index advisor only advises suitable secondary indexes and never a primary index. If you’ve read the article so far, you know why!
Opinions expressed by DZone contributors are their own.
Comments