Reading Table Metadata With Flight SQL
An introduction to Apache Flight SQL, the benefits of Flight SQL, and how to read table metadata with the Python Flight SQL Client to explore your data.
Join the DZone community and get the full member experience.
Join For FreeApache Arrow Flight SQL is “a new client-server protocol developed by the Apache Arrow community for interacting with SQL databases that makes use of the Arrow in-memory columnar format and the Flight RPC framework.” In this article, we’ll explore some of the advantages of Arrow Flight SQL, as well as how to read table metadata from an SQL database. Reading table metadata from an SQL database is crucial for several reasons.
First, it provides essential information about the table's structure, including column names, data types, and constraints.
Second, by understanding this structure, developers can craft accurate and efficient SQL queries.
Third, metadata helps maintain data integrity and consistency, making sure that operations align with the table's design.
Lastly, it aids in database documentation and schema evolution, helping teams to keep track of changes and maintain interoperability.
Benefits of Arrow Flight SQL
To grasp the advantages of Arrow and Arrow Flight SQL, one must first recognize their predecessors. APIs like Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) are traditional gateways to databases. Think of JDBC as the Java counterpart to ODBC, which has its roots in C. These APIs set the standards for database interactions, like crafting SQL queries, managing outcomes, and error responses. For years, JDBC and ODBC have been the go-to for:
Launching queries
Setting up prepared statements
Procuring SQL dialect details and supported types
Yet, they don't establish a wire format — the blueprint that dictates how to organize data for efficient network transmission. This is where formats like JSON, XML, and Protobuf come in, requiring a definition by the database driver.
Arrow Flight SQL offers a breath of fresh air for database developers. Instead of inventing a new wire protocol, they can adopt the Arrow columnar format. While ODBC and JDBC lean toward row-based architectures, they're less accommodating to columnar formats. This is where Arrow shines — it negates the need to reshape data found in row-based APIs, facilitating seamless communication with Arrow-centric databases. This happens without cumbersome data transformations or intensive serialization processes.
Additionally, Arrow Flight SQL equips developers with built-in encryption and authentication features, substantially minimizing the developmental workload. Much of this convenience stems from the Flight RPC framework, which is layered over gRPC. To top it off, Arrow Flight SQL introduces more sophisticated tools like parallel data access. This feature breaks down large data read or write tasks into smaller, concurrent assignments. Considering the data might be spread across numerous disks, nodes, or database sections, a single query could simultaneously tap into various partitions, thereby enhancing efficiency.
Reading Table Metadata
Here is a Python script to read table metadata with the Apache Arrow Python Flight SQL Client:
from flightsql import FlightSQLClient, TableRef
from pyarrow.ipc import read_message, read_schema
# Instantiate a FlightSQLClient configured for a database
client = FlightSQLClient(host='<your host>',
token='<your token>',
metadata={'database': '<your database>'},
features={'metadata-reflection': 'true'})
def punch_ticket(flight_info):
ticket = flight_info.endpoints[0].ticket
reader = client.do_get(ticket)
table = reader.read_all()
return table
table_ref = TableRef("task_log")
flight_info = client.get_tables(include_schema=True)
tables_table = punch_ticket(flight_info)
print(type(tables_table))
print(tables_table.column_names)
name_column = tables_table.column('table_name').to_pylist()
schema_column = tables_table.column('table_schema').to_pylist()
column_defintions = []
tables_meta_data = {}
def pack_schema(schema_bytes):
msg = read_message(schema_bytes)
schema = read_schema(msg)
column_defintions.append(tuple(zip(schema.names,schema.types)))
for schema_bytes in schema_column:
pack_schema(schema_bytes)
tables_meta_data = dict(zip(name_column, column_defintions))
print(tables_meta_data)
# for name, schema in tables_meta_data.items():
# print(name, schema)
# print(f"Table Name: {name}")
# print("Schema:")
# print(f"column: {schema.name}, type: {schema.type}")
The script above does the following:
Imports and configuration: First, it imports modules related to FlightSQL and PyArrow. It also creates a FlightSQLClient instance to connect to a specific database using the provided host, token, and other metadata.
Retrieve table data with `punch_ticket` function:
Accepts Flight information as input.
Extracts a ticket from the Flight information.
Uses this ticket to fetch data from the database, converting it into a table format.
Returns the table.
Request table information:
Establishes a reference to the "task_log" table.
The client requests details about available tables, including their schema.
Extraction of column data:
Prints the type and column names of the retrieved table.
Converts the 'table_name' and 'table_schema' columns into Python lists for further processing.
Decode schema with `pack_schema` function:
Converts the provided schema bytes into readable schema information.
Appends each schema's column names and types to the `column_definitions` list.
Construct and display metadata:
Pairs table names with their corresponding column definitions.
Prints the constructed metadata dictionary (`tables_meta_data`), showing each table's name alongside its schema.
Specific Example
You can use Apache Arrow Flight SQL to get table metadata from any SQL database. Let’s take a look at how to get data from InfluxDB v3, for example. InfluxDB is a tool for storing all your time series data. It’s written on top of the Apache ecosystem and leverages technologies like DataFusion, Arrow, and Parquet to enable efficient writes, storage, and querying.
We just have to instantiate the client:
# Instantiate a FlightSQLClient configured for a database
client = FlightSQLClient(host='us-east-1-2.aws.cloud2.influxdata.com',
token='xxx==',
metadata={'database': 'cpu'},
features={'metadata-reflection': 'true'})
The rest of the script remains the same. In this instance, we are querying from a database called: `cpu`. This will work for any InfluxDB table that has a schema type of `iox`. All tables in InfluxDB v3 Cloud have this schema type. You can also use the SHOW TABLES statement to confirm the table schema type in the InfluxDB UI:
Final Thoughts
Apache Arrow Flight SQL is a powerful tool for transporting large datasets over a network interface. Understanding how to read table metadata is important for crafting SQL queries.
Opinions expressed by DZone contributors are their own.
Comments