Query a Database With Arrow Flight
This post walks you through querying a database with Arrow Flight and Arrow Flight SQL, the differences between the two libraries, and the advantages of each.
Join the DZone community and get the full member experience.
Join For FreeArrow Flight is a “new general-purpose client-server framework to simplify high-performance transport of large datasets over network interfaces.” Flight uses gRPC and IPC protocols as its foundation. The structure of Flight primarily focuses on the transmission of Arrow record batches. The protocols for transmitting methods and data use Protobuf outlines, providing compatibility with clients equipped to handle gRPC and Arrow individually, even if they do not support Flight directly. Moreover, additional enhancements fine-tune Flight to reduce performance penalties typically associated with Protobuf use, particularly those related to unnecessary memory duplication.
You can also use Apache Arrow Flight SQL to query a database. Apache Arrow Flight SQL is “a new client-server protocol developed by the Apache Arrow community for interacting with SQL databases that makes use of the Arrow in-memory columnar format and the Flight RPC framework.” Essentially, the Flight SQL clients wrap the underlying Flight client. However, it also provides methods for defining streams of Arrow record batches and how to read them. FlightSQL is useful for writing database-agnostic code for different databases that support it. This more general approach is what this article covers. For example, if you are creating a UI or other user experience that might query different kinds of databases.
However, if you intend to use InfluxDB, we suggest using the upstream Flight libraries directly, because they have simpler dependencies and support both InfluxQL and SQL. In this post, we’ll learn how to use Apache Flight to query a database. Specifically, we’ll be querying InfluxDB 3.0, however, you can use this approach to query a variety of other data tools. We’ll also learn how to use FlightSQL for the sake of comparison.
Requirements
To follow this tutorial, please ensure you meet the following requirements.
Python Environment: Ensure Python is installed on your machine.
Dependencies: Install necessary Python packages like Yarrow, which includes support for Arrow Flight SQL.
pip install pyarrow
Specific Example
Querying a database (like InfluxDB v3) with Arrow Flight involves the following steps:
- Create a FlightClient by passing in the URL.
- Write a JSON script that has the necessary information for running the query on the specific platform. For example, you include the query and query language that you want to use to query your database.
- Create a ticket from that JSON script.
- Return a reader that you can use to do what you want.
from pyarrow.flight import FlightClient, Ticket
import json
host = "us-east-1-2.aws.cloud2.influxdata.com"
database = 'my_db'
sql = "SELECT * from my_table"
client = FlightClient(f"grpc+tls://{host}:443")
ticket_data = {
"namespace_name": "my_db",
"sql_query": sql,
"query_type": "sql",
}
ticket_bytes = json.dumps(ticket_data)
ticket = Ticket(ticket_bytes)
flight_reader = client.do_get(ticket)
arrow_table = flight_reader.read_all()
data_frame = arrow_table.to_pandas()
print(data_frame.to_markdown())
Note how you specify the sql_query. If you are querying InfluxDB exclusively, you could easily switch between SQL and InfluxQL, making ArrowFlight a more convenient tool for querying InfluxDB specifically.
Comparison: Leveraging Apache Arrow Flight SQL Python Client
Sometimes the best way to learn is by contrast or comparison. Here’s an example of how to query InfluxDB with the Apache Arrow Flight SQL Python Client. First, you need to install the Arrow Flight SQL Python Client with:
pip install flightsql-dbapi
As you can see, using the Arrow Flight SQL Python Client is quite similar to using the Arrow Flight Python Client. However, the FlightSQL Client contains Readers that return Arrow Record batches in streams and methods to read the data from those streams. It also doesn’t provide a query language option, as SQL is used by default.
from flightsql import FlightSQLClient
# Instantiate a FlightSQLClient configured for a database
client = FlightSQLClient(host='cluster-id.influxdb.io',
token='DATABASE_TOKEN',
metadata={'database': 'DATABASE_NAME'},
features={'metadata-reflection': 'true'})
# Execute the query to retrieve FlightInfo
info = client.execute("SELECT * FROM home")
# Extract the token for retrieving data
ticket = info.endpoints[0].ticket
# Use the ticket to request the Arrow data stream.
# Return a FlightStreamReader for streaming the results.
reader = client.do_get(ticket)
# Read all data to a pyarrow.Table
table = reader.read_all()
print(table)
It’s also important to note that the way the FlightSQL clients return streams of data differs quite a lot across different languages. Refer to this repo to see more examples of how to query InfluxDB with C++, Go, and Java the Arrow Flight SQLClients. While those examples are specific to InfluxDB,
Final thoughts
Apache Arrow Flight is a powerful tool for transporting large datasets over a network interface.
If you need any help with Arrow Flight SQL or InfluxDB, please reach out using our community site. I’d love to hear about what you’re trying to achieve and what features you’d like InfluxDB to have. I hope this tutorial helped familiarize you with how to query a database with Arrow Flight.
If you’re interested in learning more about how to use Apache Flight SQL instead, I encourage you to take a look at the following resources:
- A repository with examples on how to query InfluxDB with the Java, Go, and C++ Flight SQL clients
- InfluxData Documentation on using Apache Arrow Flight Clients
- InfluxData Documentation on using Python Flight SQL DBAPI Client
If you’re using InfluxDB specifically, I recommend that you use one of our client libraries because they further wrap those Flight libraries to make writing and reading data to InfluxDB with Arrow simpler. Here are some resources for the InfluxDB v3 client libraries:
- All of the InfluxDB v3 client libraries are maintained here.
- The InfluxDB v3 client library documentation.
Opinions expressed by DZone contributors are their own.
Comments