Configuring RaptorX Multi-Level Caching With Presto
In this tutorial, we'll show you how to get multi-level caching for Presto with RaptorX, Meta's multi-level cache built for performance.
Join the DZone community and get the full member experience.
Join For FreeRaptorX and Presto: Background and Context
Meta introduced a multi-level cache at PrestoCon 2021, the open-source community conference for Presto. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta-scale petabyte workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.
Presto is the open-source SQL query engine for data analytics and the data lakehouse. It enables you to scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between the storage tier and the compute tier is going to be IO-bound over the network. As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.
Let’s understand the normal workflow of how the Presto-Hive connector works:
- During a read operation, the planner sends a request to the metastore for metadata (partition info)
- The scheduler sends requests to remote storage to get a list of files and does the scheduling
- On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
- Based on the footer, Presto understands what the data blocks or chucks we need to read from remote storage
- Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.
These are a lot of RPC calls not just for the Hive Metastore to get the partition information but also for the remote storage to list files, schedule those files, open files, and then retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance, and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.
RaptorX introduces a total of five types of caches and a scheduler. This cache system is only applicable to Hive connectors.
Multi-layer Cache | Type | Affinity Scheduling | Benefits |
Data IO | Local Disk | Required | Reduced query latency |
Intermediate Result Set | Local Disk | Required | Reduced query latency and CPU utilization for aggregation queries |
File Metadata | In-memory | Required | Reduced CPU & latency decrease |
Metastore | In-memory | NA | Reduced query latency |
File List | In-memory | NA | Reduced query latency |
We'll explain how you can configure and test various layers of RaptorX cache in your Presto cluster.
#1 Data(IO) Cache
This cache makes use of a library that is built using the alluxio LocalCacheFileSystem, which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC, Parquet, etc.) on the remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions, and in order to enable this cache, we require local disks.
To enable this cache, worker configuration needs to be updated with the below properties at
etc/catalog/<catalog-name>.properties
cache.enabled=true
cache.type=ALLUXIO
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement.
cache.base-directory=file:///mnt/disk1/cache
Also, add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache -Dalluxio.user.app.id=presto
#2 Fragment Result Set Cache
This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries, which will improve your query performance and decrease CPU usage.
Add the following properties under the /config.properties
fragment-result-cache.enabled=true
fragment-result-cache.max-cached-entries=1000000
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache
fragment-result-cache.cache-ttl=24h
#3 Metastore cache
A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions
calls to metastore. This cache is versioned to confirm the validity of cached metadata.
In order to enable metadata cache, set the below properties at /<catalog-name>.properties
hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000
#4 File List cache
A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile
calls to remote storage. This is a coordinator in-memory cache.
Enable file list cache by setting the below set of properties at
/catalog/<catalog-name>.properties
# List file cache
hive.file-status-cache-expire-time=24h
hive.file-status-cache-size=100000000
hive.file-status-cache-tables=*
#5 File Metadata Cache
Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also for reducing CPU utilization.
This is in the memory cache and suitable for ORC and Parquet file formats.
For ORC, it includes file tail (postscript, file footer, file metadata), stripe footer, and stripe stream (row indexes/bloom filters).
For Parquet, it caches the file and block-level metadata.
In order to enable metadata cache, set the below properties at /<catalog-name>.properties
# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true
<catalog-name>.orc.file-tail-cache-size=100MB
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h
<catalog-name>.orc.stripe-metadata-cache-enabled=true
<catalog-name>.orc.stripe-footer-cache-size=100MB
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h
<catalog-name>.orc.stripe-stream-cache-size=300MB
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h
# For Parquet metadata cache:
<catalog-name>.parquet.metadata-cache-enabled=true
<catalog-name>.parquet.metadata-cache-size=100MB
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h
The <catalog-name>
in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”.
#6 Affinity Scheduler
With affinity scheduling, the Presto Coordinator schedules requests that process certain data/files to the same Presto worker node to maximize the cache hits. Sending requests for the same data consistently to the same worker node means fewer remote calls to retrieve data.
Data caching is not supported with random node scheduling. Hence, this is a must-have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache work.
In order to enable the affinity scheduler set the below property at /catalog.properties
hive.node-selection-strategy=SOFT_AFFINITY
How Can You Test or Debug Your RaptorX Cache Setup With JMX Metrics?
Each section describes queries to be run and queries the JMX metrics to verify the cache usage.
Note: If your catalog is not named ‘ahana_hive,’ you will need to change the table names to verify the cache usage. Substitute ahana_hive
with your catalog name.
Data IO Cache
Queries to trigger Data IO cache usage
USE ahana_hive.default;
SELECT count(*) from customer_orc group by nationkey;
SELECT count(*) from customer_orc group by nationkey;
Queries To Verify Data IO Data Cache Usage
-- Cache hit rate. SELECT * from jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges"; -- Bytes read from the cache SELECT * FROM jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters"; -- Bytes requested from cache SELECT * FROM jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters"; -- Bytes written to cache on each node. SELECT * from jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters"; -- The number of cache pages(of size 1MB) currently on disk SELECT * from jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters"; -- The amount of cache space available. SELECT * from jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges"; -- There are many other metrics tables that you can view using the below command. SHOW TABLES FROM jmx.current like '%alluxio%';
Fragment Result Cache
An example of the query plan fragment that is eligible for having its results cached is shown below.
Fragment 1 [SOURCE]
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution
Strategy: UNGROUPED_EXECUTION
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*)
- TableScan[TableHandle {connectorId='hive',
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc,
analyzePartitionValues=Optional.empty}',
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B),
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}
Queries To Trigger Fragment Result Cache Usage
SELECT count(*) from customer_orc;
SELECT count(*) from customer_orc;
Query Fragment Set Result Cache JMX Metrics
-- All Fragment result set cache metrics like cachehit, cache entries, size, etc
SELECT * FROM
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";
ORC Metadata Cache
Queries to trigger ORC cache usage:
SELECT count(*) from customer_orc;
SELECT count(*) from customer_orc;
Query ORC Metadata cache JMX metrics:
-- File tail cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";
-- Stripe footer cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean";
-- Stripe stream(Row index) cache metrics
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";
Parquet Metadata Cache
Queries to trigger Parquet metadata cache:
SELECT count(*) from customer_parquet;
SELECT count(*) from customer_parquet;
Query Parquet Metadata cache JMX metrics:
-- Verify cache usage
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";
File List Cache
Query File List cache JMX metrics:
-- Verify cache usage
SELECT * FROM
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";
This should help you get started with RaptorX multi-level caching with Presto. If you have specific questions about your Presto deployment, head over to the Presto open-source community Slack, where you can get help.
Published at DZone with permission of Rohan Pednekar. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments