Understanding PolyBase and External Stages: Making Informed Decisions for Data Querying
Explore the differences and strengths of PolyBase and Snowflake external tables to optimize data querying strategies and achieve efficient data integration.
Join the DZone community and get the full member experience.
Join For FreeIn the realm of Big Data and AI, data lakes and lake houses play a vital role in bringing insights and advanced analytics. The ability of the lake house is measured effectively when it can query data from external sources like cloud storage.
We have two popular methodologies for querying data from external sources:
- PolyBase
- External tables
PolyBase is associated with the Microsoft technologies of SQL Server and Synapse Analytics, which offers the ability to query data from external sources. A similar provision is available in the Snowflake warehouse through external tables. This comprehension helps you make decisions about which approach to use.
What Is PolyBase?
PolyBase is a data virtualization technology used in Microsoft SQL Server and Azure Synapse Analytics. It allows users to query data from external sources, such as Hadoop, Azure Blob Storage, and other relational databases as if they were part of the local database. PolyBase abstracts the complexities of accessing external data, enabling seamless integration and query execution without moving data to a local database, reducing redundancies, and optimizing storage across applications. It leverages the parallel processing capabilities of SQL Server and Azure Synapse Analytics to execute queries efficiently, even on large external datasets.
What Are External Tables in Snowflake?
External tables allow us to query data stored in external locations, such as Amazon S3, Azure Blob Storage, and Google Cloud Storage without loading it into data warehouse, reducing the need for data ingestion and duplication. It provides a mechanism to access and query data in its original format by leveraging Snowflake's scalable architecture ensuring efficient performance when querying large external datasets. It supports various formats of Parquet, ORC, Avro, JSON, and CSV, providing flexibility in accessing different types of data.
PolyBase To Read Data Lake in Azure SQL DW
Snowflake External Table
How To Create a PolyBase for Azure Synapse
We have data stored in Azure Blob Storage. Let’s see the PolyBase creation on Azure Synapse Analytics to query and analyze this data directly from its external location.
Step 1
- Create an external data source and external file format For Azure Blob Storage:
-- Create an external data source for Azure Blob Storage
CREATE EXTERNAL DATA SOURCE BlobStorage
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/',
CREDENTIAL = BlobStorageCredential -- Credential object must be created separately
);
-- Create external file format for CSV files
CREATE EXTERNAL FILE FORMAT CSVFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2 -- Skip header row if present
)
);
----------------------------------------------
Create External Table:
-- Create external table to query sales data
CREATE EXTERNAL TABLE SalesData
(
OrderID INT,
ProductID INT,
Quantity INT,
Price DECIMAL(10, 2),
OrderDate DATE
)
WITH (
LOCATION = '/sales/', -- Path within Azure Blob Storage container
DATA_SOURCE = BlobStorage,
FILE_FORMAT = CSVFormat
);
----------------------------------
Querying External Table:
-- Query external table to analyze sales data
SELECT
OrderDate,
SUM(Quantity * Price) AS TotalSales
FROM
SalesData
WHERE
OrderDate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
OrderDate
ORDER BY
OrderDate;
PolyBase With SQL Server
- Set up external data source in SQL Server:
-- Create external data source for Hadoop
CREATE EXTERNAL DATA SOURCE HadoopCluster
WITH (
TYPE = HADOOP,
LOCATION = 'hdfs://hadoop-cluster-name:8020/',
CREDENTIAL = HadoopCredential -- Credential object must be created separately
);
----------------------------------
Create External Table in SQL Server:
-- Create external table to query customer data from Hadoop
CREATE EXTERNAL TABLE CustomerData
(
CustomerID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
Email VARCHAR(100),
RegistrationDate DATE
)
WITH (
LOCATION = '/path/to/data/', -- Path within Hadoop cluster
DATA_SOURCE = HadoopCluster,
FILE_FORMAT = TEXTFILE -- Specify file format if necessary
);
----------------------------------
- Query external table in SQL Server:
-- Query external table to analyze customer data
SELECT
FirstName,
LastName,
COUNT(*) AS TotalOrders
FROM
CustomerData cd
INNER JOIN Orders o ON cd.CustomerID = o.CustomerID
GROUP BY
FirstName,
LastName
ORDER BY
TotalOrders DESC;
PolyBase With Azure Synapse Analytics
- Set up external data source in Azure Synapse Analytics:
-- Create external data source for on-premises SQL Server
CREATE EXTERNAL DATA SOURCE OnPremSQLServer
WITH (
TYPE = RDBMS,
LOCATION = 'your-sql-server.database.windows.net',
DATABASE_NAME = 'YourDatabase',
CREDENTIAL = OnPremSQLCredential -- Credential object must be created separately
);
- Create external table in Azure Synapse Analytics:
-- Create external table to query product inventory data from on-premises SQL Server
CREATE EXTERNAL TABLE ProductInventory
(
ProductID INT,
ProductName VARCHAR(100),
QuantityOnHand INT,
LastUpdated DATETIME
)
WITH (
LOCATION = 'dbo.Inventory', -- Table or view name in SQL Server database
DATA_SOURCE = OnPremSQLServer,
-- Specify credential if required
CREDENTIAL = OnPremSQLCredential
);
- Query external table in Azure Synapse Analytics:
-- Query external table to analyze product inventory data
SELECT
ProductName,
SUM(QuantityOnHand) AS TotalInventory
FROM
ProductInventory
GROUP BY
ProductName
ORDER BY
TotalInventory DESC;
Step 2
External Tables in Snowflake
Let’s demonstrate how to create and query an external table from the data stored in Amazon S3.
- Create stage for external data:
-- Create a stage pointing to Amazon S3 bucket
CREATE OR REPLACE STAGE s3_stage
URL = 's3://your-bucket-name/path/to/files/'
CREDENTIALS = (AWS_KEY_ID = 'your-access-key-id' AWS_SECRET_KEY = 'your-secret-key');
-- List files in the stage
LIST @s3_stage;
- Create external table:
-- Create external table to query customer demographic data
CREATE OR REPLACE EXTERNAL TABLE customer_demographics
(
customer_id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
age INT,
city VARCHAR(50),
state VARCHAR(50),
country VARCHAR(50)
)
WITH LOCATION = @s3_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);
-- Describe external table schema
DESCRIBE customer_demographics;
- Query external table:
-- Query external table to analyze customer demographics
SELECT
country,
COUNT(*) AS num_customers
FROM
customer_demographics
GROUP BY
country
ORDER BY
num_customers DESC;
External Tables in Google BigQuery
- Create external data source (cloud storage):
-- Create an external data source pointing to Google Cloud Storage
CREATE EXTERNAL DATA SOURCE gcs_data_source
TYPE = GOOGLE_CLOUD
OPTIONS (
bucket_uri = 'gs://your-bucket-name/path/to/files/'
);
- Create external table:
-- Create external table to query customer purchase data
CREATE EXTERNAL TABLE customer_purchases
(
transaction_id INT64,
customer_id INT64,
product_id INT64,
purchase_date DATE,
amount FLOAT64
)
USING CSV
OPTIONS (
skip_leading_rows = 1, -- Skip header row
format = 'CSV',
field_delimiter = ','
)
LOCATION 'gs://your-bucket-name/path/to/files/';
- Query external table:
-- Query external table to analyze customer purchases
SELECT
customer_id,
COUNT(transaction_id) AS num_transactions,
SUM(amount) AS total_spent
FROM
customer_purchases
WHERE
purchase_date BETWEEN DATE('2023-01-01') AND DATE('2023-12-31')
GROUP BY
customer_id
ORDER BY
total_spent DESC;
External Table in Snowflake for Azure Blob Storage
- Create an external stage:
-- Create an external stage pointing to Azure Blob Storage
CREATE OR REPLACE STAGE sales_stage
URL = 'azure://<storage-account-name>.blob.core.windows.net/<container-name>/sales/'
CREDENTIALS = (AZURE_SAS_TOKEN = 'your-sas-token');
Create External Table:
-- Create external table to query sales data from Azure Blob Storage
CREATE OR REPLACE EXTERNAL TABLE sales_data_external
(
OrderID INT,
ProductID INT,
Quantity INT,
Price DECIMAL(10, 2),
OrderDate DATE
)
USING (DATA_SOURCE_NAME = 'AZURE_STORAGE',
LOCATION = '@sales_stage'
);
- Query external table:
-- Query external table to analyze sales data
SELECT
OrderDate,
SUM(Quantity * Price) AS TotalSales
FROM
sales_data_external
WHERE
OrderDate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
OrderDate
ORDER BY
OrderDate;
Conclusion
Both PolyBase and Snowflake external tables offer powerful capabilities for querying external data, but their suitability depends on the specific needs and infrastructure of the organization. PolyBase is a robust choice for organizations with diverse data ecosystems and a strong Microsoft technology presence. In contrast, Snowflake external tables excel in cloud-native environments, providing seamless integration with cloud storage services and scalable performance. By understanding the differences and strengths of each approach, organizations can make informed decisions to optimize their data querying strategies and achieve efficient data integration.
Opinions expressed by DZone contributors are their own.
Comments