How to Create Data Lineage With the Tableau GraphQL Metadata API
Tableau has a rich metadata API exposed through a GraphQL interface that you can use to extract your data lineage.
Join the DZone community and get the full member experience.
Join For FreeI love data. The ways it can be used to curate value and express relationships never ceases to amaze me. To this extent, visualizing data is often one of the most powerful ways to share insights and Tableau certainly is one of - if not the - most popular data visualization tools on the market. It's extremely simple for non-technical users to develop rich and meaningful graphs with a pretty intuitive UI and there are some really nice features under the hood that are used to speed up query performance when extracts are stored within Tableau.
My absolute favorite Tableau feature is that you can query your metadata using the same GraphQL API that Tableau itself uses. A portion of the metadata exposed includes the lineage for the fields, sheets, tables and data stores that exist within your Tableau Site. Exposing the metadata via an extensive API like this is a really forward thinking idea from the team behind Tableau.
How to Use the Tableau Metadata API
The Tableau Metadata API is exposed via GraphQL and Tableau is wrapped in a python library, the Tableau Server Client. This library is one of the easiest APIs to use - Tableau has simplified all authentication and serialization to allow users to just focus on the query they want to execute.
Pros:
- The graph enables many different entities and data assets within Tableau to be queried
- The API performance is really good, even when requesting a large number of multidimensional relationships
- The Python client is extremely simple and intuitive to use, handling the authentication and serialization for the user
Cons:
- The documentation is sparse - it's not clear when to expect upstream or downstream data lineage assets to be provided or when they will be null
- A "full" lineage for each data asset is not available, you can only extract lineage from one step upstream or one step downstream (at least from what I could tell from using the API)
- Tableau releases a new API version every quarter or so but the docs do not depict which features are available in which version
Let's look at some code that can be used to query your Tableau metadata.
Authentication
You can use the Tableau API by authenticating with your username and password but the more secure and suggested approach is to use a client token. I've also created a simple helper function below to authenticate and execute queries.
x
import os
import tableauserverclient as TSC
TOKEN_NAME = os.environ.get('TOKEN_NAME' ,'some-token')
TOKEN = os.environ.get('TOKEN', 'your-token-value')
SITE_NAME = os.environ.get('SITE_NAME', 'your-site')
# If using Tableau Online this might be 'https://prod-useast-b.online.tableau.com'
SERVER = os.environ.get('SERVER', 'your-server')
SERVER_VERSION = os.environ.get('SERVER_VERSION', '3.9')
tableau_auth = TSC.PersonalAccessTokenAuth(TOKEN_NAME, TOKEN, SITE_NAME)
server = TSC.Server(SERVER)
server.version = SERVER_VERSION
# Helper function to run queries
def run_query(query):
with server.auth.sign_in(tableau_auth):
resp = server.metadata.query(query)
resp = resp['data']
if isinstance(resp, list):
resp = resp[0]
return resp
Define the Query
The Tableau Metadata API is a fantastic way to start learning GraphQL since Tableau handles all of the serialization for you and their Graph follows a consistent and easy to understand set of conventions.
The function below executes a query that will return all calculated fields that exist within your Site. The beautiful thing here with GraphQL is that we can simultaneously ask Tableau to return all of the fields that reference each of the calculated fields and we can go even deeper to request all of the sheets for each field that is referencing a calculated field.
x
def get_all_calculated_fields(batch_size=100):
all_calculated_fields = []
has_next = True
start = 0
while has_next is True:
query = """
{
calculatedFieldsConnection (first: %s, offset: %s){
nodes {
id
name
formula
referencedByFields {
fields {
id
name
sheets {
id
name
}
}
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
""" % (batch_size, start)
resp = run_query(query)
all_calculated_fields.extend(resp['calculatedFieldsConnection']['nodes'])
start = start + batch_size
if resp['calculatedFieldsConnection']['pageInfo']['hasNextPage'] == False:
has_next = False
return all_calculated_fields
Create Your Data Lineage
Now that you have your metadata from Tableau, how you structure and use the output is completely up to you. This example will define edges and nodes. These are the fundamental building blocks for network relationships and data lineage.
x
def format_nodes_and_edges(calc_fields):
nodes = []
edges = []
for calc in calc_fields:
# Add each calculated field to the nodes
calc_field_name = 'CalcField - ' + calc['name']
nodes.append(calc_field_name)
# For each field that references the calculated field, add a node
for ref_field in calc['referencedByFields']:
for field in ref_field['fields']:
# Calculated fields may show up under referenced fields, if that
# happens, do not overwrite the existing node
if field['id'] not in nodes:
field_name = 'Field - ' + field['name']
edges.append((calc_field_name, field_name))
# Create a reference to each sheet that uses this field
for sheet in field['sheets']:
sheet_name = 'Sheet - ' + sheet['name']
nodes.append(sheet_name)
edges.append((field_name, sheet_name))
return list(set(nodes)), edges
View the Edges and Nodes
Running all of the functions above will now result in creating the objects required to visualize your data lineage. The nodes and edges can be plugged into just about any network visualization tool, such as NetworkX, to view the output.
x
calculated_fields = get_all_calculated_fields()
nodes, edges = format_nodes_and_edges(calculated_fields)
nodes
# [
# ...
# 'CalcField - Click-to-Open',
# 'Sheet - Sheet 5',
# 'CalcField - Minutes of Delay per Flight',
# 'Sheet - Opportunities ',
# ...
# ]
edges
# [
# ...
# ('CalcField - Difference from Region', 'Field - State'),
# ('Field - State', 'Sheet - Obesity Scatter Plot'),
# ('Field - State', 'Sheet - Obesity Map'),
# ...
# ]
Closing Thoughts
I applaud Tableau for enabling this form of data access even though I believe this is a highly underutilized and under-leveraged benefit. Many companies do not fully make use of this metadata from Tableau to the full extent. Understanding how data moves and dependencies between data is such a critical feature especially as organizations try to maintain well-managed practices and controls around how their data is used. As you look to leverage Tableau metadata and data lineage within your company, make sure that you're taking the extra step to connect that data lineage with the upstream processes to give a complete and comprehensive perspective of your lineage.
Published at DZone with permission of Grant Seward. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments