Accumulators vs SQL GROUP BY Aggregation
In this article, take a look at accumulators vs SQL GROUP BY aggregation.
Join the DZone community and get the full member experience.
Join For FreeGraph query language has always been among the top considerations when users choose a graph database for serious production use. Some considerations include but not limited to ease-of-use, expressiveness, and conformance to ISO standard. When it comes to putting graph databases into production, our experience shows that sufficient expressive power comes first.
In our previous blog, we anatomize the basic semantics and usage pattern of accumulators. We got a lot of feedback. One of the most frequently asked questions is that Can you do all accumulator-based aggregation in SQL GroupBy style aggregation?
The answer is that for some aggregation queries, it is not easy if it is possible at all! Accumulators can express certain classes of queries in a concise and clear form while admitting high-performance implementation. For this class of queries, SQL group by style aggregation appears to be cumbersome and inducing wasteful computation.
In this article, we continue our discussion on accumulators by using it to simulate different SQL groupby aggregation syntaxes. This exercise reveals the benefits of accumulators over the traditional SQL groupby aggregation.
Simulating SQL-Style GroupBy Aggregation
The ACCUM clause in a SELECT-FROM-WHERE query block is the main place where the accumulator accumulates its value. It turns out that this query block structure can express all conventional SQL-style Groupby aggregation.
For each built-in aggregation function of SQL (MIN/MAX/SUM/COUNT/AVG), GSQL features an accumulator type (MinAccum/MaxAccum/SumAccum/SumAccum/AvgAccum). Here we show a simple example which uses GSQL built-in accumulator to do SQL-style groupby aggregation.
Example 1
Given an Employee table, where each row stores the employee’s name, gender, and current salary, we want to find out for each gender, what are the minimum, maximum, total and average salary, and what is the total number of employees in each gender? In SQL, this query can be expressed by a single query block as below.
SELECT gender,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
AVG(salary) AS avg_salary,
SUM(salary) AS tot_salary,
COUNT(*) AS tot_count
FROM Employee
GROUP BY gender
In GSQL, we can use a group by accumulator to achieve the same result as below.
xxxxxxxxxx
GroupByAccum<string gender,
MinAccum<double> min_salary,
MaxAccum<double> max_salary,
AvgAccum<double> avg_salary,
SumAccum<double> tot_salary,
SumAccum<int> tot_count> @@gender_aggs;
R = SELECT e
FROM Employee:e
ACCUM @@gender_aggs += (e.gender->e.salary, e.salary, e.salary, e.salary, 1);
In the GSQL query above, we declare a GroupByAccum named @@gender_aggs. It uses gender as the group by key, and the remaining 5 accumulators as the aggregators to aggregate values for each group. For the Employe vertex type, we bind a variable “e”, the ACCUM clause will parallelly send the key e.gender, and value (e.salary, e.salary, e.salary, e.salary, 1) to the accumulator @@gender_aggs.
Both the SQL and GSQL queries can be implemented by a single-pass algorithm, where we go through the employee table once and calculate the aggregation results for each aggregator.
In general, the conventional SQL group-by aggregation can be specified as
xxxxxxxxxx
SELECT K1,K2,...,Kn,agg1(A1),agg2(A2),agg3(A3)..., aggm(Am)
FROM ...
GROUP BY K1,K2,...,Kn
with Ki as group by key and Aj as aggregate value. In GSQL, this is achieved by the ACCUM clause
xxxxxxxxxx
ACCUM A += (K1,K2,...,Kn → A1,A2,...,Am)
where A is declared as an accumulator of the type GroupByAccum <K1, K2, …, Kn, Acc1, Acc2,...Acc_m>.
Simulating Advance SQL-Style Aggregation
Multiple group-by aggregations, such as the CUBE, ROLLUP and GROUPING SET extensions of the SQL GROUP BY clause are also eminently expressible using accumulators.
They each compute the aggregation for several subsets of the grouping attributes, outer unioning the results of each grouping. The GROUPING SETS extension is the most flexible one, allowing targeted selection of grouping attribute subsets.
Example 2
For instance, replacing a three-key (k1, k2, k3) GROUP BY clause with GROUP BY GROUPING SETS ((k1,k2), (k3)) can be simulated in GSQL by
xxxxxxxxxx
ACCUM A += (k1,k2,null → a1,a2,a3),
A += (null,null,k3 → a1,a2,a3)
Similarly, the CUBE (k1,k2,k3) extension can be simulated with 8 accumulator assignments (one for each subset of {k1,k2,k3}), and the ROLLUP (k1,k2,k3) extension with 4 accumulator assignments, (one for each of {k1,k2,k3}, {k1,k2}, {k1}, {}).
TigerGraph is an active participant and contributor to the current current SQL/PGQ and GQL standard drafts, and we have surveyed other popular graph query languages on the market, none of the other existing graph query languages support the extended Group By syntax (Cube/Roll up/Group set). In contrast, as illustrated by the above example, GSQL’s accumulators facilitate the straightforward addition of the CUBE, ROLLUP and GROUPING SET keywords as syntactic sugar that preserves the intended single-pass execution.
Deficiency of CUBE/ROLLUP/GROUPING SET in SQL-Style Aggregation
The CUBE, the ROLLUP, and the GROUPING SET essentially are convenient syntax sugar to specify subsets of the group by attributes onto which we can compute all the aggregation results. One deficiency of these group-by syntax extensions is that they do not allow SQL users to designate different aggregations from the SELECT clause to different grouping sets. SQL adopts a take-all approach. That is, any aggregation functions appearing on the SELECT clause will be computed for each of the grouping sets induced by the advanced group by syntax sugars.
This will cause wasteful computation when users want to bind specific aggregators to different grouping sets.
Example 3 Wasteful Aggregation per Grouping Set
Assume we wish to compute for grouping sets (K1), (K2), (K3) a sum, min and avg aggregate, respectively, the GROUPING SET semantics would force the computation of all three aggregates (of which two are unwanted) per grouping set. In GSQL form, it’s expressed as
xxxxxxxxxx
ACCUM A1 += (k1 → a1, a2, a3), A2 += (k2 → a1, a2, a3),A3 += (k3 → a1, a2,a3)
In contrast, in GSQL we can dedicate a separate accumulator Ai to each grouping set Ki, each Ai performing only the desired aggregation:
xxxxxxxxxx
ACCUM A1 +=(k1 → a1), A2 += (k2 → a2), A3 += (k3 → a3)
It is clear that the finer binding of aggregators avoids computing a2, a3 for k1, a1, a3 for k2, a1,a2 for k3, which are wasteful efforts.
Quantifying Wasteful Aggregation
We report an experiment that quantifies the savings of wasteful aggregation due to GSQL style accumulator-based aggregation.
Over the course of the graph traversal, the overhead wasted in unwanted aggregate computation can add up to a measurable performance penalty. We illustrate this by an experiment that compares the performance of two styles of multi-aggregation:
Accumulator-based.
SQL GROUPING SET (this is the most efficient among the SQL options as it avoids computing undesirable grouping sets – though, like all others, it cannot avoid computing unwanted aggregates per grouping set).
The Data
Towards working with a scalable graph, we adopted LDBC SNB benchmark, using graphs ranging from size 1GB to 1TB.
The Queries
We adapted the SNB-provided queries to compute multiple aggregates. We report here on one such query (its measured behavior is representative for the others). The query navigates from persons to the city they live in and to the comments they liked, as long as published between 2010 and 2012 (persons, cities and comments are modeled as vertices, the relationships between them as edges). It computes three grouping sets, each with its own aggregations:
Per (comment publication year), it computes
• the 20 most recent comments, favoring the longest ones in tie breaks,
• the 20 earliest comments, favoring the longest,
• the 20 longest comments, favoring the most recent,
• the 20 shortest comments, favoring the most recent,
• the top 10 comments by oldest authors, favoring the longest,
• the top 10 comments by youngest authors, favoring the longest.
Per (author’s city, browser type, publication year, month, comment length), it counts the comments.
Per (author’s city, gender, browser type, publication year and month), it averages comment length.
We expressed the query in GSQL, conforming to two styles:
Q_gs mimics SQL GROUPING SET aggregation using accumulators as in Example 3, conforming to GROUPING SET semantics, Q_gs computes all 8 aggregates for each of the three grouping sets. The query is published on GitHub. (Note this query requires TigerGraph 3.0 feature, where HeapAccum can be used within GroupByAccum).
Q_acc computes for each grouping set only the desired aggregates, using appropriate accumulators as shown in Example 3. The query is published on GitHub.
What We Measured
We measured the running times of Q_gs and Q_acc on graphs generated by SNB’s generator, at scale factors SF-1 (1GB), SF-10 (10GB), SF-100 (10GB) and SF-1000 (1TB). For each graph, we ran each query 5 times, computing the median running time.
The Results
We observed the following running times for the queries (all expressed in seconds), showing that, on graph sizes ranging from 1GB to 1TB, accumulators speed up aggregation by a factor of up to 3x when compared to SQL-style aggregation.
Scale Factor |
Q_gs median time |
Q_acc median time |
Speedup |
1 10 100 1000 |
4.84 59.87 440.39 2972.68 |
1.95 22.15 167.42 973.54 |
2.48 2.70 2.63 3.05 |
The Platform. We loaded the graphs (and ran the queries) at scale factors 1, 10 and 100 on an Amazon EC2 instance of type r4.8xlarge. For scale factor 1000, we used a Microsoft Azure 3-node cluster, with each node of type E64a v4 (both graph storage and query execution were distributed transparently by TigerGraph’s engine, in the sense that the user query did not need to be edited).
Final Remarks
We have detailed the comparison of Accumulator and the SQL-style aggregation. Accumulator subsumes the expressive power of SQL-style aggregation in terms of conciseness and avoiding wasteful aggregation.
We have designed accumulators with the intention to make graph analytic queries easier to write. As a side result, it provides conciseness and flexibility in writing single-pass aggregation queries over certain class of aggregation queries while SQL group-by aggregation syntax cannot easily do. As illustrated in the study, SQL users cannot bind individual aggregations from the SELECT clause to a grouping set, while accumulator can.
In GSQL early day’s design (2015 summer), what made us focused and excited is the composition effect and the reduction of complexity in writing analytic graph queries brought by decorating graph topology with runtime attributes (local accumulators) and global state variables (global accumulators). Accumulators are meant as a means to achieve powerful and efficient composition results by adorning vertices and graphs with computed state shared between query blocks. With these decorations, graph analytics becomes easy and fun. Do you agree? Send us your feedback at https://community.tigergraph.com/
Opinions expressed by DZone contributors are their own.
Comments