RDBMS & Graphs: SQL vs. Cypher Query Languages
If SQL is like a poem then Cypher is like a Haiku: minimal and deep. Read on and you'll understand the analogy.
Join the DZone community and get the full member experience.
Join For Freewhen it comes to a database query language, linguistic efficiency matters.
querying relational databases is easy with sql . as a declarative query language, sql allows both for easy ad-hoc querying in a database tool as well as specifying use-case related queries in your code. even object-relational mappers use sql under the hood to talk to the database.
but sql runs up against major performance challenges when it tries to navigate connected data. for data-relationship questions, a single query in sql can be many lines longer than the same query in a graph database query language like cypher (more on cypher below).
lengthy sql queries not only take more time to run, but they are also more likely to include human coding mistakes because of their complexity. in addition, shorter queries increase the ease of understanding and maintenance across your team of developers.
for example, imagine if an outside developer had to pick through a complicated sql query and try to figure out the intent of the original developer–trouble would certainly ensue.
but, what level of efficiency gains are we talking about between sql queries and graph queries? how much more efficient is one versus another?
the answer: fast enough to make a significant difference to your organization .
the efficiency of graph queries means they run in real time, and in an economy that runs at the speed of a single tweet , that’s a bottom-line difference you can’t afford to ignore.
in this rdbms & graphs blog series, we’ll explore how relational databases compare to their graph counterparts, including data models, query languages, deployment paradigms, and more. in previous weeks, we explored why rdbms aren’t always enough , graph basics for the relational developer and relational vs. graph data modeling
this week, we’ll compare query languages for both relational and graph databases, specifically examining sql (the de facto query language of rdbms) and cypher (the most widely used graph query language).
the critical relationship between query languages and data modeling
it’s worth noting that a query language isn’t just about asking (a.k.a. querying) the database for a particular set of results; it’s also about modeling that data in the first place .
we know from last week that data modeling for a graph database is as easy as connecting circles and lines on a whiteboard. what you sketch on the whiteboard is what you store in the database.
on its own, this ease of modeling has many business benefits, the most obvious of which is that you can understand what your database developers are actually creating. but there’s more to it: an intuitive model built with the right database query language ensures there’s no mismatch between how you built the data and how you analyze it.
a query language represents its model closely. that’s why sql is all about tables and joins while cypher is about relationships between entities. as much as the graph model is more natural to work with, so is cypher as it borrows from the pictorial representation of circles connected with arrows which any stakeholder (whether technical or non-technical) can understand.
in a relational database, the data modeling process is so far abstracted from actual day-to-day sql queries that there’s a major disparity between analysis and implementation. in other words, the process of building a relational database model isn’t fit for asking (and answering) questions efficiently from that same model.
graph database models, on the other hand, not only communicate how your data is related, but they also help you clearly communicate the kinds of questions you want to ask of your data model. graph models and graph queries are just two sides of the same coin.
the right database query language helps us traverse both sides.
an introduction to cypher, the graph query language
just like sql is the standard query language for relational databases, cypher is an open, multi-vendor query language for graph technologies. the advent of the opencypher project has expanded the reach of cypher well beyond just neo4j, its original sponsor.
cypher – also a declarative query language – is built on the basic concepts and clauses of sql but with added graph-specific functionality, making it simple to work with a rich graph model without being overly verbose.
(note: this introduction isn’t a reference document for cypher but merely a high-level overview.)
cypher is designed to be easily read and understood by developers, database professionals and business stakeholders alike. it’s easy to use because it matches the way we intuitively describe graphs using diagrams.
if you have ever tried to write a sql statement with a large number of joins, you know that you quickly lose sight of what the query actually does, due to all the technical noise. in contrast, cypher syntax stays clean and focused on domain concepts since queries are expressed visually.
the basic notion of cypher is that it allows you to ask the database to find data that matches a specific pattern. colloquially, we might ask the database to “find things like this,” and the way we describe what “things like this” look like is to draw them using ascii art .
consider the social graph below describing three mutual friends:
a social graph describing the relationship between three friends.
if we want to express the pattern of this basic graph in cypher, we would write:
(emil)<-[:knows]-(jim)-[:knows]->(ian)-[:knows]->(emil)
this cypher statement describes a path which forms a triangle that connects a node we call
jim
to the two nodes we call
ian
and
emil
, and which also connects the
ian
node to the
emil
node. as you can see, cypher naturally follows the way we draw graphs on the whiteboard.
now, while this cypher pattern describes a simple graph structure it doesn’t yet refer to any particular data in a graph database. to bind the pattern to specific nodes and relationships in an existing dataset we first need to specify some property values and node labels that help locate the relevant elements in the dataset.
here’s our more fleshed-out cypher pattern:
(emil:person {name:'emil'})
<-[:knows]-(jim:person {name:'jim'})
-[:knows]->(ian:person {name:'ian'})
-[:knows]->(emil)
here we’ve bound each node to its identifier using its
name
property and
person
label. the
emil
identifier, for example, is bound to a node in the dataset with a label
person
and a
name
property whose value is
emil
. anchoring parts of the pattern to real data in this way is normal cypher practice.
the rdbms developer’s guide to cypher clauses
like most query languages, cypher is composed of clauses.
the simplest queries consist of a
match
clause followed by a
return
clause. here’s an example of a cypher query that uses these two clauses to find the mutual friends of a user named
jim
(from our social graph pictured above):
match (a:person {name:'jim'})-[:knows]->(b:person)-[:knows]->(c:person), (a)-[:knows]->(c)
return b, c
let’s look at each clause in further detail:
match
the
match
clause is at the heart of most cypher queries.
using ascii characters to represent nodes and relationships, we draw the data we’re interested in. we draw nodes with parentheses, just like in these examples from the query above:
(a:person {name:'jim'})
(b:person)
(c:person)
(a)
we draw relationships using pairs of dashes with greater-than or less-than signs (
-->
and
<--
) where the
<
and
>
signs indicate relationship direction. between the dashes, relationship names are enclosed by square brackets and prefixed by a colon, like in this example from the query above:
-[:knows]->
node labels are also prefixed by a colon. as you see in the first node of the query, person is the applicable label.
(a:person … )
node (and relationship) property key-value pairs are then specified within curly braces, like in this example:
( … {name:'jim'})
in our original example query, we’re looking for a node labeled
person
with a
name
property whose value is
jim
. the return value from this lookup is bound to the identifier
a
. this identifier allows us to refer to the node that represents jim throughout the rest of the query.
it’s worth noting that this pattern:
(a)-[:knows]->(b)-[:knows]->(c), (a)-[:knows]->(c)
could, in theory, occur many times throughout our graph, especially in a large user dataset.
to confine the query, we need to anchor some part of it to one or more places in the graph. in specifying that we’re looking for a node labeled
person
whose name property value is
jim
, we’ve bound the pattern to a specific node in the graph—the node representing jim.
cypher then matches the remainder of the pattern to the graph immediately surrounding this anchor point based on the provided information on relationships and neighboring nodes. as it does so, it discovers nodes to bind to the other identifiers. while
a
will always be anchored to jim,
b
and
c
will be bound to a sequence of nodes as the query executes.
return
this clause specifies which expressions, relationships and properties in the matched data should be returned to the client. in our example query, we’re interested in returning the nodes bound to the
b
and
c
identifiers.
other cypher clauses
other clauses you can use in a cypher query include:
where
provides criteria for filtering pattern matching results.
create
and
create unique
creates nodes and relationships.
merge
ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.
delete/remove
removes nodes, relationships and properties.
set
sets property values and labels.
order by
sorts results as part of a
return
skip limit
skip results at the top and limit the number of results.
foreach
performs an updating action for each element in a list.
union
merges results from two or more queries.
with
chains subsequent query parts and forwards results from one to the next. similar to piping commands in unix.
if these clauses look familiar–especially if you’re an rdbms developer–that’s great! cypher is intended to be easy-to-learn for sql veterans while also being simple enough for beginners. ( click here for the most up-to-date cypher refcard to take a deeper dive into the cypher query language.)
sql vs. cypher query examples: the good, the bad, and the ugly
now that you have a basic understanding of cypher, it’s time to compare it side by side with sql to realize the linguistic efficiency of the former–and the inefficiency of the latter–when it comes to queries around connected data.
our first example uses the organizational domain (from last week ) pictured below as a relational data model:
a relational data model of an organizational domain.
in the organizational domain depicted in the model above, what would an sql statement that lists the employees in the “it department” look like? and, how does that statement compare to a cypher statement?
sql statement:
select name from person
left join person_department
on person.id = person_department.personid
left join department
on department.id = person_department.departmentid
where department.name = "it department"
cypher statement:
match (p:person)<-[:employee]-(d:department)
where d.name = "it department"
return p.name
in this example on the previous page, the cypher query is half the length of the sql statement and significantly simpler. not only would this cypher query be faster to create and run, but it also reduces chances for error.
now for another example, this one more extreme. we’ll start with the cypher query:
cypher statement:
match (u:customer {customer_id:'customer-one'})-[:bought]->(p:product)<- [:bought]-(peer:customer)-[:bought]->(reco:product)
where not (u)-[:bought]->(reco)
return reco as recommendation, count(*) as frequency
order by frequency desc limit 5;
this cypher query says that for each customer who bought a product, look at the products that peer customers have purchased and add them as recommendations. the
where
clause removes products that the customer has already
each of the arrows in the
match
clause of the cypher query represents a relationship that would be modeled
here’s the equivalent query in sql:
sql statement:
select product.product_name as recommendation, count(1) as frequency
from product, customer_product_mapping, (select cpm3.product_id, cpm3.customer_id
from customer_product_mapping cpm, customer_product_mapping cpm2, customer_product_mapping cpm3
where cpm.customer_id = ‘customer-one’
and cpm.product_id = cpm2.product_id
and cpm2.customer_id != ‘customer-one’
and cpm3.customer_id = cpm2.customer_id
and cpm3.product_id not in (select distinct product_id
from customer_product_mapping cpm
where cpm.customer_id = ‘customer-one’)
) recommended_products
where customer_product_mapping.product_id = product.product_id
and customer_product_mapping.product_id in recommended_products.product_id
and customer_product_mapping.customer_id = recommended_products.customer_id
group by product.product_name
order by frequency desc
this sql statement is three times as long as the equivalent cypher query. it will not only suffer from performance issues due to the join complexity but will also degrade in performance as the dataset grows.
conclusion
when it comes to application performance, your database query language matters .
sql is well-optimized for relational database models, but once it has to handle complex, relationship-oriented queries, its performance quickly degrades. in these instances, the root problem doesn’t lie with sql but with the relational model itself, which isn’t designed to handle connected data.
for domains with highly connected data, the graph model is a must, and as a result, so is a graph query language like cypher. if your development team comes from an sql background, then cypher will be easy to learn and even easier to execute.
when it comes to your next graph-powered, enterprise-level application, you’ll be glad that the query language underpinning it all is build for speed and efficiency.
next week, we'll explore different deployment paradigms for relational and graph databases, including polyglot persistence.
Published at DZone with permission of , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments