Data Store Options for Operational Analytics/Data Engineering
Using Managed SQL server instance in Azure with Column-store indexes for better performing Data Analytics and Transaction Processing.
Join the DZone community and get the full member experience.
Join For FreeIn this article, we will delve into essential concepts within the domain of analytics databases, conducting a comparative analysis of the available offerings for Azure SQL databases based on these foundational principles. Let’s look at some key concepts before we delve into the data storage options in Azure.
Clustered Index
- A B-Tree clustered index organizes rows physically in memory in sorted order, automatically created when the primary key is established.
- The key advantage of a clustered index lies in the swift searching of a range of values. Internally utilizing a B-Tree data structure, the leaf node of the B-Tree clustered index contains the actual table data.
- It is important to note that only one clustered index can be created for a table.
Non-Clustered Index
- A non-clustered index also employs a B-Tree data structure, with the distinction that the leaf node of the B-Tree or non-clustered index contains pointers to the pages containing the actual table data.
- Unlike a clustered index, a non-clustered index does not organize rows physically in memory in a sorted order.
- Importantly, it is permissible to create more than one non-clustered index for a table.
Clustered Column Store
Clustered column-store storage involves organizing all data in a table in a columnar format, significantly compressing the data and facilitating rapid execution of analytical queries and reports. Depending on the data characteristics, data size may be reduced by a factor of 10x to 100x. The clustered column-store model excels in the quick ingestion of substantial data volumes (bulk-load) as large batches exceeding 100,000 rows undergo compression before storage on disk. This model is particularly well-suited for classic data warehouse scenarios.
Non-Clustered Column-Store
In the non-clustered column-store model, data is stored in the traditional row-store table, supplemented by an index in column-store format specifically designed for analytical queries. This configuration supports Hybrid Transactional-Analytic Processing (HTAP), allowing the seamless execution of real-time analytics alongside transactional workloads. Operational queries (OLTP) are processed on the row-store table, optimized for accessing a limited set of rows, while analytical queries (OLAP) leverage the column-store index, preferred for scans and complex analytics. The query optimizer dynamically selects between row-store and column-store formats based on the nature of the query. Importantly, non-clustered column-store indexes maintain the original data size, preserving the dataset in the original row-store table unaltered. However, the size of the additional column-store index is typically orders of magnitude smaller than an equivalent B-tree index.
When and Where Should We Use Clustered Column-Store Index?
The Clustered Column-store index primarily targets analytics workloads. The table below shows the common scenarios that can be achieved using this technology.
Column-store Option |
Workload |
Compression |
Clustered Column-store Index |
|
10x on average |
CCI/NCI (with one or more nonclustered indexes) |
|
10X on average + additional storage for NCIs |
The selection of databases in Azure depends on the nature of the workload, the specific features offered by each Azure database service, and the performance characteristics required for analytics engineering tasks. Azure provides a range of SQL database services with different capabilities to cater to diverse use cases in the analytics domain.
SQL Virtual Machines (IAAS)
SQL virtual machines provide comprehensive administrative control over both the SQL Server instance and the underlying operating system during the migration to Azure. Key capabilities include:
- Full access to SQL Server and the operating system.
- Extensive support for various versions of SQL Server and the operating system.
- Automated manageability features tailored for SQL Server.
- This offering allows for the flexibility to choose the desired SQL Server version, compute resources, and storage options based on specific requirements.
Azure SQL Managed Instances: (PAAS)
SQL Managed Instance offers SQL Server access and feature compatibility, providing a seamless option for deploying on the infrastructure of your preference and facilitating the migration of SQL from on-premises to the cloud. Key capabilities include:
- Deployment flexibility on the infrastructure of your choice.
- Built-in management features for enhanced control.
- Native support for virtual networks.
- Fully managed as a service, reducing operational overheads.
Azure SQL Database (Platform as a Service or Database as a Service)
This flavor of database services is ideally suited for modern, cloud-native applications that demand a fully managed database with consistent performance. Key capabilities include:
- Hyperscale storage, supporting up to 100TB of data.
- Serverless computing, offering flexibility and cost savings.
- Simplified management for increased operational efficiency.
Deployment Options:
- Single Database: Enables swift deployment for individual database needs.
- Elastic Pools: Facilitates resource sharing among a group of databases, reducing costs and streamlining management for specific applications.
- Hyperscale: Allows databases to scale beyond the 4 TB limit of a single database, ideal for large-scale data requirements.
- Serverless: Cost-effective option with auto-pause functionality for non-production workloads that do not require continuous database access.
Query Performance Benchmark (Provided by Microsoft)
Key point to note below is that with clustered columnstore index, the example query runs 5x faster on P1 compared to the same query running on P15 with rowstore with no tuning. This can significantly lower the cost you need to pay to meet your workload requirements.
Pricing Tier |
With Rowstore |
With Columnstore |
Performance Gains |
P1 |
30.6 secs |
4.2 secs |
14x |
P15 |
19.5 secs |
0.319 secs |
60x |
Storage Size: The storage savings with column-store compared to PAGE or NONE compressed tables are shown below. While the cost of storage is already included with AzureDB, but lower storage can enable you to choose a lower tier. Note that this is generated test data, so the compression is lower than one would get for customer workloads.
Number of Rows |
Size Rowstore (MB) |
|
Size columnstore (MB) |
Savings |
3626191 |
212 (PAGE compression) |
|
120 |
1.8x |
3626191 |
756 (NONE compression) |
|
120MB |
6.2x |
The best part of columnstore index technology is that it does not require any changes to your application. All you need to do is to either create or replace an existing index with columnstore index on your table(s).
In conclusion, Azure SQL Database stands out as the optimal choice for operational analytics, offering a tailored solution that leverages columnar storage indices with robust transaction support. This strategic combination not only ensures efficient storage and retrieval of data but also provides a foundation for seamless transactional operations, making it the ideal platform for organizations seeking enhanced performance and analytical capabilities.
Opinions expressed by DZone contributors are their own.
Comments