More than half of professional developers work with SQL. And for allRead More
How do the different SQL on Hadoop tools fare in the 99 TPC-DS test queries?
Over the past year, the Kognitio Analytics Team has been testing different open source SQL on Hadoop solutions against our own free-to-use software.
Due to increasing demands for performance comparisons of different SQL on Hadoop solutions, the TPC-DS query set has become the de-facto industry standard benchmark for Big Data systems. We have used these 99 queries to conduct our own performance comparison.
It should be noted that we always optimized the platform for each technology, aiming to get the best out of whichever SQL software we were testing. We have been very transparent about how we set up each test.
Enterprise level performance
I feel it is essential to be clear that the way we run the TPC-DS tests means that the queries keep the randomization that is inherent to the original TPC-DS benchmark. Random behavior manifests itself in 2 distinct ways in TPC-DS testing:
- The ordering of the 99 queries within each generated query stream is completely automated and randomized within each stream. Each query is then submitted directly after the previous query in the stream is completed
- Each individual query has a number of parameters built in that are assigned values randomly from a specified range when each query stream is generated
This means, query 72, for example, is the first query submitted in stream 1. It will have the same general SQL structure as query 72 in streams 2 to 10, but will differ slightly in each stream due to the random parameters utilized. Query submission order is randomized in each stream so query 72 is the 39th query submitted in stream 2, the 60th query submitted in stream 3, etc.
In my opinion this variability in the queries submitted, is the single most important feature of the TPC-DS benchmark when assessing SQL tools for enterprise level performance. (More on this in a future post.)
What did we test?
For each SQL platform we assessed the following:
1. SQL support: How many of the 99 TPC-DS queries can the platform run?
Here we categorize the queries based on:
- Runs ‘out of the box’ (no changes needed)
- Minor syntax changes – such as removing reserved words and editing column aliases
- No support – syntax not currently supported
- Long-running (>1 hour at 1GB scale)
2. Performance at scale: executing a single query at a time over 1TB of data.
Running a single query stream at a 1TB scale is a starting point for assessing platform performance
3. Performance of mixed workload: how does it perform when 10 concurrent random queries are run simultaneously on the platform?
To support enterprise level “self-serve” BI it is essential that SQL on Hadoop platforms can perform under mixed concurrent workloads.
(Note: At 1TB scales, more queries were classified as long running (>10mins for a single stream or >1hr for 10 streams) and were removed from the 10 stream benchmark as they severely skewed all concurrent performance.)
These tests have been run over the last year. During this time we have been migrating our lab into AWS. Therefore 3 different sets of similar hardware (but not identical) were used so direct comparisons between all products are not currently possible. However we intend to re-run the TPC-DS queries with each SQL engine in the coming weeks – all on the same AWS infrastructure.
|Kognitio – Version 8.1.50
Apache Spark – Version 2.0 (beta)
Apache Impala Version 2.6.0 (in CDH 5.8.2)
|Each platform utilized the same 12 node infrastructure running Cloudera CDH 5.8.2.|
|Kognitio – Version 8.2.0
Apache Hive 2 LLAP (Hive 1.2.1)
|Each platform utilized the same 9 node AWS infrastructure: 1 master, 8 workers. We deployed this using the Hortonworks Data Cloud (HDC) available in Amazon Marketplace|
|Kognitio – Version 8.2.0
Presto Version 0.170
|9 node AWS system running Amazon EMR|
Could the platforms run the SQL?
In testing the functional SQL support we submitted all 99 TPC-DS queries without alteration on a 1GB TPC-DS data set to test which queries run “out of the box”.
For queries that errored we allowed only minor changes such as
- Removal of reserved words
- Changes to column aliases and removal of quoted aliases if not supported
- Date field formatting changes for use of intervals and date differences
We did not allow the SQL to be rewritten to obtain the same results, such as replacing OLAP functions, ROLLUP or CUBE with a series of UNION ALL statements or restructuring sub- queries. In these cases the SQL was flagged as “no support”.
Results of the functional SQL support tests are shown in the graph below.
SparkSQL, Presto and Kognitio run all 99 queries either “out of the box” or with minor changes. Even at this tiny 1GB scale, Presto had 5 queries that were long-running and did not complete in over an hour. Further work on the Presto configuration is on going. Hive LLAP ran all but 5 queries, the majority of these were due to errors in the GROUPING function associated with OLAP functions. Impala had the worst SQL support running just 73 out of the 99 queries successfully. As well as 2 long running queries, other errors included unsupported OLAP or SET functions and complex sub-querying.
SQL on Hadoop – comparisons with Kognitio
The performance graphs that we use from this point onwards, show horizontal bars depicting the relative speed between Kognitio and one other platform.
How to interpret the following comparison graphs for each evaluation
In all graphs each of the 99 TPC-DS queries is represented by a horizontal block. If there is a more dominant color on the chart this indicates that platform has performed better. For example, a block where 75% is colored blue and 25% green, indicates Kognitio is 3x faster than Impala (75/25=3).
In each figure the left hand graph shows the 99 queries in order of largest Kognitio speed improvement. This allows for an easier performance comparison by colour. In the right hand graph the queries are ordered by TPC-DS query number. If you want to review the actual execution times then please see the full benchmarking papers for Impala, Spark versus Kognitio 8.1 and Hive LLAP versus Kognitio 8.2.0.
- Skip ahead to Apache Impala results
- Skip ahead to Apache SparkSQL results
- Skip ahead to Hive LLAP results
- Skip ahead to Presto results
Apache Impala v Kognitio
Quick summary of performance
Kognitio outperforms Impala in both the single stream and ten stream tests as can be seen above. In the ten stream tests there were 7 long running queries (>1 hour) that were removed from the test runs. These are represented by the grey chevrons in the graphic.
How does Impala perform at scale?
For a single query stream, Kognitio outperforms Impala. In the left-hand graph of Figure 2 below, the solid blue block at the top represents the 23 queries that Kognitio can run but Impala does not support plus the further 3 queries that were long-running on Impala. This leaves 73 queries for direct comparison and Kognitio runs 67 out the remaining 73 queries faster than Impala with 7 of these over 10x times faster. Impala runs 6 faster.
How does Impala perform under mixed concurrent workload?
Figure 3 below shows the speed comparisons between Kognitio and Impala when running 10 random concurrent query streams. The 7 long-running queries that were completely removed from the test are represented by grey blocks in both graphs. This leaves 92 queries – 24 of which Impala cannot run (again illustrated by blue block at top of left-hand graph). In the remaining 68 queries Kognitio is faster in 56 queries and Impala is faster than Kognitio in 12 queries.
Apache Spark v Kognitio
Quick summary of performance
Comparing Spark to Kognitio at the 1 TB scale shows that Kognitio is faster than Spark SQl in all but 1 query in a single stream and faster in all 92 queries for 10 concurrent query streams. In the 10 stream test the 7 long running queries (>1 hour) are represented by grey chevrons.
How does Spark perform at scale?
When we tested SparkSQL at 1TB scale it required configuration changes in order for many of the queries to run. Despite these changes there were still 10 queries that Spark could not complete at the 1TB scale. These are represented by solid blue bars at the top of the left hand graph on figure 4 as Kognitio can run all 99 queries at 1TB.
From the comparison we see Kognitio is faster than Spark for 88 of the remaining 89 queries in the benchmark. SparkSQL was only faster for query 67. The graphs are dominated by blue indicating Kognitio’s superior speed. There are 61 queries where Kognitio is over 10x faster than Spark and 5 queries where Kognitio is over 100x faster.
How does SparkSQL perform under mixed concurrent workload?
When tested with 10 random concurrent query streams running on the platform, there were 7 queries which neither platform could run or were long-running (indicated as grey bars in figure 5). SparkSQL also required a further 3 queries to be removed from this test (compared to a single stream) as these queries became long running (over 1 hour) as system resources had to be shared as concurrency increased. Thererefore there is a total of 13 queries that Kognitio can run but are long running in SparkSQL, represented by the solid blue bars at the top of the left hand graph in figure 5.
In the SparkSQL benchmark, the available YARN resources limited the number of concurrent Spark jobs. Clearly this impacts query times as concurrency increases and queries have to wait to be executed. On our Cloudera test system, typical Spark concurrency was between 7 and 8 jobs.
These issues left 79 queries where Spark could be compared directly to Kognitio. Spark was slower than Kognitio for all of these comparable queries. Kognitio was over 10x faster in 18 queries.
SparkSQL requires some configuration depending on data sizes and workload and is still some way off the ‘deploy and go’ experience you see with Kognitio. Kognitio needs no configuration other than resource allocation via YARN and a one-off creation of memory images of the data.
Hive LLAP v Kognitio
Quick summary performance
When we first conducted the Impala and SparkSQL TPC-DS tests we were going to include standard Hive MapReduce in the comparison. But a lack of SQL support (with Hive MapReduce running just 40 queries out of the box and a further 21 with minor changes), along with poor single thread performance meant it was removed from testing.
After Hive LLAP was released, we ran the full TPC-DS query tests against Hive LLAP, utilizing the Hortonworks Data Cloud (HDC) available in the Amazon Marketplace. It should be noted that these tests compare Apache Hive LLAP with a newer version of Kognitio (8.2.0). The Impala and SparkSQL tests were on a previous version of Kognitio (8.1.50). As indicated earlier in this blog, we intend to re-run the TPC-DS query set against the new versions of Impala and SparkSQL on the same AWS infrastructure in the coming weeks.
How does Hive LLAP perform at scale?
At 1TB scale there were 2 queries that failed to complete in Hive LLAP. Logs suggest that this was due to a lack of memory resource available to complete the query. Kognitio runs all 99 queries without issue at this scale and is faster than Hive LLAP for 88 out of 99 queries with 12 over 10x times faster.
How does Hive LLAP perform under mixed concurrent workload?
When concurrency was increased to 10 streams, the memory resource on the edge node was insufficient to support the 10 instances of beeline (Hive 2 command-line tool) used to submit the Hive LLAP query scripts. The size of the edge node was increased to work around this. More investigation is required to resolve the utilization of memory on the edge node during the Hive LLAP benchmark.
The longest running queries (>10 mins for a single stream or >1hr for 10 streams) were again removed from the 10 stream benchmark as they severely skewed all the concurrent performance. For Kognitio Version 8.2.0 this was just a single query leaving 98 in the benchmark compared to 7 that needed removing in Kognitio version 8.1.50. For Hive LLAP, as well as the 5 unsupported queries and the 2 erroring due to memory resource, a further 3 long-running queries were removed leaving 89 queries for direct comparison.
From the comparison in figure 7, it is clear that Kognitio is more performant than Hive LLAP. As well as the 9 queries only Kognitio can run, Kognitio was also faster in 74 out of the remaining 89 queries.
Presto v Kognitio
Quick summary of Presto v Kognitio performance
We haven’t fully integrated Presto in the test suite yet so don’t currently have mixed workload results for this tool. We have completed an initial review on whether Presto could run the SQL in the 99 TPC-DS queries. Presto runs all but 5 of the 99 queries: 73 run “out of the box” and a further 21 requiring minor changes. We tested a single stream at 1TB scale on the same AWS 9-node infrastructure as used for Hive LLAP, although in this case we used Amazon EMR for ease of deployment.
How does Presto perform at scale?
For Presto there were 8 queries that were either long running or errored at the 1TB scale and Kognitio completes all 99 TPC-DS queries faster than Presto. The longest query (Q67) took Kognitio 12m 54s to run, but 90 out of the 99 queries took less than 30s and 64 queries took less than 10s.
For Presto there were only 7 queries that completed in under 30 seconds. There were 67 queries where Kognitio was over 10x faster than Presto and 11 of these were over 100x faster. For further details see our Initial Presto results blog.
In conclusion, tests show that the years of experience parallelizing SQL gives Kognitio a big advantage over other SQL engines, shown through the superior performance against TPC-DS query tests.
Kognitio is completely free to use – with no restrictions on time, scale or functionality. So if you’re looking to improve the interactive querying performance of data in Hadoop – why not download Kognitio and get started?
Want more detail?
For more detail about any of the comparisons mentioned above, you can refer to the individual results:
- Full details of TPC-DS queries run with Impala, SparkSQL & Kognitio
- Full details of TPC-DS queries run with Hive, Hive LLAP & Kognitio
- Initial Presto results blog
Want to stay informed about future testing?
Then click the button below to subscribe to our SQL on Hadoop emails