CData Elasticsearch Driver Features and Differentiators
In this article, we explore how CData drivers grant access to all of Elasticsearch, enable full SQL querying of Elasticsearch, and more.
Join the DZone community and get the full member experience.
Join For FreeElasticsearch is a distributed, RESTful, full-text search engine designed to store, index, retrieve, and manage document-oriented or semi-structured data. Common uses for Elasticsearch range from building a simple search engine for a web site or document collection, to supporting auto-completion, analytics, AI, and cognitive computing workloads.
Because of the schemaless / NoSQL nature of Elasticsearch, data management is often a challenge. Through REST APIs users can access the search capabilities and features of Elasticsearch, but that requires custom development to connect it with other applications and services. The other option of using the SQL API and vendor-supplied libraries offers limited read-only SQL query capability and still only offers a small subset of tool integrations.
The CData Elasticsearch drivers offer the best of both worlds. The CData Drivers expand on the capabilities of the REST and SQL APIs, providing standards-compliant SQL-92 interfaces for data management, and the support for standards like ODBC / JDBC / ADO.NET provides near-universal seamless tool integration. The effectiveness of this methodology is best represented by our knowledge base of Elasticsearch integrations, which is only a small sample of what is possible.
In this article, we explore how CData drivers:
- Grant access to all of Elasticsearch
- Enable full SQL querying of Elasticsearch
- Support JSON structures in Elasticsearch
- Provide connectivity across major tools, platforms, and applications
- Outperform the native drivers when querying Elasticsearch
Access to All of Elasticsearch
CData Drivers for Elasticsearch provide unmatched access to Elasticsearch with many features beyond those available from the native drivers. The distinctive features for enabling access to Elasticsearch are outlined below:
- Connects to the SQL endpoint and REST endpoint (as opposed to only the SQL endpoint)
- Enables SQL access to Elasticsearch v2.2 and above (compared to v6.3 and above)
- Compatible with Open Source Elasticsearch subscriptions in addition to other commercially available subscriptions
- Returns the relevance score as a _score column for queries without explicitly requiring the SCORE() function
- Allows querying multiple indices (details below)
Querying Multiple Indices
Elasticsearch SQL supports querying multiple indices (or tables) through pattern matching. The CData Drivers support querying multiple indices, but for both the REST and SQL endpoints (as opposed to only the SQL endpoint). Multiple indices can be queried using any of the following formats:
Result Source | SQL Statement | REST API Endpoint |
---|---|---|
All available indices | SELECT * FROM [_all] | /_all/_search |
A list of specific indices | SELECT * FROM [index1, index2, index3] | /index1%2Cindex2%2Cindex3/_search |
Indices matching a wildcard pattern | SELECT * FROM [index*] | /index*/_search |
Indices matching a pattern (with exclusions) | SELECT * FROM [index*, -index3] | /index*%2C-index/_search |
SQL Support
The CData Drivers for Elasticsearch fully support standard SQL queries, whether crafted manually for custom applications and visualizations or generated automatically by a BI, reporting, or ETL tool.
Full CRUD Support
Since the CData Drivers include support for both the SQL and REST API endpoints, users can submit SELECT, INSERT, UPDATE, and DELETE statements to easily read from and write to Elasticsearch with CData Drivers.
Operation | SQL Statement | Endpoint(s) Used |
---|---|---|
Query all documents in an index | SELECT * FROM [myIndex]; | REST, SQL |
Create a new document in an index | INSERT INTO [myIndex] (myField1, myField2, myField3, ...) VALUES ('myValue1', 'myValue2', 'myValue3', ...); |
REST |
Update an existing document in an index | UPDATE [myIndex] SET myField1 = 'myNewValue1' WHERE myId = '1'; |
REST |
DELETE an existing document from an index | DELETE FROM [myIndex] WHERE myId = '1'; |
REST |
Filtering
CData Drivers allow for filtering using LIKE and = operators, as opposed to only supporting LIKE operators, offering more precise querying. This functionality is useful, but can still lead to imprecise results. For example, Elasticsearch may return records that contain the same words but in a different order: a search performed using the value "blue sky" will return a record with "sky blue." The Query Type logic built-in to the CData drivers identifies columns as Analyzed or Non-Analyzed and the drivers issue the best possible Elasticsearch query based on the specific operator and the search value.
WHERE Clause | Column Type | Elasticsearch Query Type |
---|---|---|
column = 'value' | Analyzed | Query String Query |
column = 'value with spaces' | Analyzed | Match Phrase Query |
column LIKE 'v_lu%' | Analyzed | Query String with wildcards |
column = 'value' | Non-Analyzed | Query String Query* |
column = 'value with spaces' | Non-Analyzed | Wildcard Query |
column LIKE 'v_lu%' | Non-Analyzed | Wildcard Query with wildcards |
* Non-analyzed columns can be matched case-insensitive, so four popular cases are checked: 'myValue' OR 'MYVALUE' OR 'myvalue' OR 'Myvalue'
JOIN Support
In Elasticsearch, the join datatype creates a parent/child relation within documents of the same index. CData Drivers can leverage the join datatypes to split related tables and enable SQL JOIN queries across those parent/child relationships. If you need to JOIN across indexes, the built-in SQL engine can perform a client-side, in-memory JOIN to eliminate restrictions on how you work with your data. Using the Elasticsearch example, we can build related tables using the join datatype.
For example, if we have the following documents in the sample index, we get different results from a query, depending on whether we use the native drivers or the CData drivers.
Example Index: sample
x
{
"_index": "sample",
"_type": "doc",
"_id": "2",
"_score": 1,
"_source": {
"text": "What is your favorite motorcycle?",
"question_id": {
"name": "question"
}
}
},
{
"_index": "sample",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"text": "What is your favorite company?",
"question_id": {
"name": "question"
}
}
},
{
"_index": "sample",
"_type": "doc",
"_id": "3",
"_score": 1,
"_routing": "1",
"_source": {
"text": "CData Software",
"question_id": {
"name": "answer",
"parent": "1"
}
}
},
{
"_index": "sample",
"_type": "doc",
"_id": "4",
"_score": 1,
"_routing": "1",
"_source": {
"text": "Elastic",
"question_id": {
"name": "answer",
"parent": "1"
}
}
},
{
"_index": "sample",
"_type": "doc",
"_id": "5",
"_score": 1,
"_routing": "1",
"_source": {
"text": "Suzuki V-Strom 650",
"question_id": {
"name": "answer",
"parent": "2"
}
}
},
{
"_index": "sample",
"_type": "doc",
"_id": "6",
"_score": 1,
"_routing": "1",
"_source": {
"text": "Yamaha FZ6",
"question_id": {
"name": "answer",
"parent": "2"
}
}
}
The native drivers expose a single table, sample, with a single column, text, but with no preservation of the relationship between the documents. The CData Drivers expose two tables, sample_question and sample_answer based on the relationship between the documents in the index.
Native Drivers
When querying the data, the native Elasticsearch driver returns a single table with one column, 'text':
Query Result (sample)
text |
---|
What is your favorite company? |
What is your favorite motorcycle? |
CData Software |
Elastic |
Suzuki V-Strom 650 |
Yamaha FZ6 |
CData Drivers
In contrast, CData Drivers expose two separate related tables, with several columns, including foreign keys to define the relationships:
Parent (sample_question)
_id | _score | text |
---|---|---|
1 | 0.0 | What is your favorite company? |
2 | 0.0 | What is your favorite motorcycle? |
Child (sample_answer)
_id | sample_question_id | _score | text |
---|---|---|---|
3 | 1 | 0.0 | CData Software |
4 | 1 | 0.0 | Elastic |
5 | 2 | 0.0 | Suzuki V-Strom 650 |
6 | 2 | 0.0 | Yamaha FZ6 |
Support for JSON Structures
CData Drivers provide support for querying JSON structures, like arrays and nested JSON objects, which can often be found in Elasticsearch records. With CData, users will be able to get exactly the data they want from Elasticsearch, thanks to built-in schema discovery and JSON structure flattening.
Array Support
Native Elasticsearch drivers do not (at the time of writing) support querying data found in arrays. CData Drivers not only support accessing the arrays (as raw JSON data), but also support flattening the arrays to drill down into the relevant data. Consider the following JSON array:
x
{
...
"coords" : [35.91,-79.06],
...
}
CData Elasticsearch Drivers expose the columns coords coords.0 and coords.1 (sample results below).
coords | coords.0 | coords.1 |
---|---|---|
[35.92,-79.06] | 35.91 | -79.06 |
Nested Type Support
Data stored for use with Elasticsearch often contains nested JSON objects. Without support for these objects, native drivers will miss a lot of available data in their queries. CData Drivers support searching nested objects and include functionality to flatten the nested objects to atomize the data. Consider the following JSON object:
xxxxxxxxxx
{
"manager": {
"name": "Ms. Manager",
"age": 35,
"location": "Flagship Store",
}
}
CData Elasticsearch Drivers expose manager as a column, along with a separate column for each nested element: manager.name, manager.age, manager.location (sample results below):
manager | manager.name | manager.age | manager.location |
---|---|---|---|
{ "name": "Ms. Manager", "age": 35, "location": "Flagship Store", ... } | Ms. Manager | 35 | Flagship Store |
Platform Support
Outside of the version and subscription restrictions, one of the largest limitations of the native Elasticsearch drivers is that they only support the JDBC and ODBC standards. In comparison, CData offers connectivity across a variety of data-centric standards like ODBC (Windows, Linux, Mac), JDBC, and ADO.NET and directly within tools like Excel, SSIS, BizTalk, and PowerShell. Whether you are using tools and applications for BI, analytics, reporting, and ETL or building custom applications, CData drivers provide live access to Elasticsearch data wherever you need it.
Performance Comparison
The CData JDBC Driver is able to read data more than twice as fast as the native driver. The CData engineers built a driver that makes better use of client-side resources to work with large datasets quickly.
Rows Queried | CData Software | Native |
---|---|---|
JDBC Query Times by Company (in milliseconds) | ||
25,000 | 1,016.9 (+35%) | 1,375.4 |
~10,000,000 | 199,577 (+155%) | 508,338 |
For more information, refer to our performance comparison article.
Other Features
Beyond these major features and key differentiators, CData Drivers for Elasticsearch come with additional features, that are included across all of the full range of data sources that CData supports.
- Collaborative query processing - a powerful, built-in SQL engine pushes supported complex queries (filtering, aggregations, JOINS, etc.) down to Elasticsearch for server-side processing and features client-side processing for unsupported queries
- Active development & maintenance — changes to the Elasticsearch API are implemented and supported as soon as possible, keeping up with the latest versions and updates
- World-class support — our Support Team is available to answer your questions via email, live chat, and phone.
Free Trials and More Information
You will find the latest version of our Elasticsearch drivers here and all of our drivers include a free, 30-day trial. If you want to learn more about using our drivers in your existing tools or applications, you can read more in our Knowledge Base.
Published at DZone with permission of Jerod Johnson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments