Approaches to Query Optimization in NoSQL
Explore some different approaches to query optimization in NoSQL.
Join the DZone community and get the full member experience.
Join For FreeA man returned home after walking around the globe for eleven years. The next day, when he told his wife he's going to the corner store, she asked him, "Are you taking the short route or the long one?"
Queries can be executed in many different ways. All paths lead to the same query result. The Query optimizer evaluates the possibilities and selects the efficient plan. Efficiency is measured in latency and throughput, depending on the workload. The cost of Memory, CPU, disk usage is added to the cost of a plan in a cost-based optimizer.
Now, most NoSQL databases have SQL-like query language support. So, a good optimizer is mandatory. When you don't have a good optimizer, developers have to live with feature restrictions and DBAs have to live with performance issues.
Database Optimizer
A query optimizer chooses an optimal index and access paths to execute the query. At a very high level, SQL optimizers decide the following before creating the execution tree:
- Query rewrite based on heuristics, cost or both.
- Index selection.
- Selecting the optimal index(es) for each of the table (keyspaces in Couchbase N1QL, collection in case of MongoDB)
- Depending on the index selected, choose the predicates to push down, see the query is covered or not, decide on sort and pagination strategy.
- Join reordering
- Join type
Consider the case of MongoDB restriction." A collection can have at most one text index." https://docs.mongodb.com/manual/core/index-text/#restrictions It documents a few other restrictions along with this. For this article, explaining this one restriction will suffice.
Why Should You Care About This Restriction?
- MongoDB and other NoSQL databases encourage you to denormalize (aggregate) your schema so you create a single large document representing an object: a customer, a partner, etc. so the majority of your operations happen on a single (JSON) document. So, a single customer document can contain customer information, customers orders, customer shipping information, and customer billing information. Having a single search index means you need to create a single VERY LARGE index combining all the fields you'd ever want to search. Here's the problem: when you search for customer address, you don't want to see the shipping address. When you search for the shipping order ID, you don't want to see the returned order ID.
- You can create multiple indexes on scalars in MongoDB. Why the restriction on the text index?
Why Does MongoDB Text Index Is Restricted to One Index per Collection?
- Is it the number of the text indexes? Search indexes are typically built with an inverted-tree data structure, but MongoDB has chosen to build it with the B-Tree index. This is unlikely to be the issue.
- Is it the size of the text indexes? Text indexes generate an array of tokens on text and index them. So, it's an array index. Its size can grow exponentially when you use an array index. The size of the index increases linearly to the number of word indexes and not the number of documents. They may cause issues.
- Is it a problem with the optimizer? When you have multiple indexes, the optimizer will have to choose the right index for the query. If you restrict to one text index, the choice is easy. This is a symptom of a larger problem in the MongoDB optimizer; it makes ad-hoc decisions resulting in restrictions like this.
MongoDB's query language is simplistic even if it's trying to mimic the SQL operations. Let's see how MongoDB's optimizer handles these.
- Query rewrite: Unsupported. MongoDB's queries are simplistic in the find(), save(), remove(), and update() methods. The aggregation pipeline is procedural and verbose. While it's theoretically possible to rewrite, there's nothing in the documentation or plan to indicate any query rewrites.
- Index selection: Supported. MongoDB's optimizer tries to pick up a suitable index for each portion of the query and index can/should be used. More on this below.
- Join reordering: Unsupported. MongoDB's $lookup is part of the convoluted aggregation framework where the query is written like Unix pipeline, a procedural approach.
- Join type selection: Unsupported since there's only one type join in MongoDB. MongoDB has a constrained left outer join support via $lookup operator — arrays are unsupported in the join condition. If you do use $lookup, the optimizer automatically uses the default join algorithm. There's no mention of the type of join done.
Essentially, MongoDB's query optimizer only does index selection before creating the execution plan, but it seems to select the indexes in an odd fashion — neither by rule not by statistics.
- Pick a random index on one or more qualified index.
- Use that plan if a subsequent query matches the query predicates, even if the constants, selectivities, and cardinalities are different.
- Then, at runtime, if the index scan returns more than 100 keys (!), run each of the alternative plans to see which one returns the keys first. At some point, it aborts the parallel execution and picks up one of them. It also replaces the plan in its plan cache.
Collection t1, with 3000 documenbts.
Create the following indexes: Appendix 1 for the definition:
MongoDB Enterprise > db.t1.createIndex({x:1})
MongoDB Enterprise > db.t1.createIndex({y:1})
MongoDB Enterprise > db.t1.createIndex({x:1, y:1})
MongoDB Enterprise > db.t1.createIndex({y:1, x:1})
This is a single collection with 4 indexes on (x), (y), (x, y) and (y, x). Now, see this:
MongoDB Enterprise > db.t1.find({x:{$gt:0}, y:99}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.t1",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"y" : {
"$eq" : 99
}
},
{
"x" : {
"$gt" : 0
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"x" : {
"$gt" : 0
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"y" : 1
},
"indexName" : "y_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"y" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"y" : [
"[99.0, 99.0]"
]
}
}
},
Even on this simple document structure, MongoDB selects the index on (y) even though there the query has filters on x and y: ({x:{$gt:0}, y:99}).
To manage all of these uncertainties and the performance issues it will lead to, MongoDB provides a number of APIs to manage the query plan cache: flush specific cache entry, and flush the whole plan cache. Instead of developing applications, MongoDB developers and DBAs need to manage the plan cache. Developers and DBAs don't need to manage the plan cache in other enterprise databases.
Back to the original question: Why you can't create multiple text indexes on MongoDB?
Building multiple indexes shouldn't be an issue if they simply allowed it. The real problem is that when you give a text predicate in your query, the MongoDB optimizer unable to choose the right index. It cannot validate these text indexes against the text predicate. MongoDB optimizer doesn't follow a natural logic or a logical framework. Hence the restriction.
And, it could even hurt you!
Couchbase N1QL has added text index to N1QL for the upcoming release. See the details here. Users can create any number of text indexes and the optimizer will choose a qualified (sargable) index and use it. It also supports searching during joins, post index scans, etc. because the optimizer understands the search predicate and layers into its decision logic. There's no new API or new plan to manage. That's the power of having Couchbase!
Resources
Published at DZone with permission of Keshav Murthy, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments