Neo4J and Virtual Nodes/Relationships
Let's take a look at Neo4j and virtual nodes and relationships.
Join the DZone community and get the full member experience.
Join For FreeOverview
It's often the case that the database schema used for loading data doesn't translate well for query or reporting, such as generating aggregate or summary reports from the source-of-truth transactional data. Using relational databases, a common solution to define a schema better aligned to the reporting requirements. The data is then extracted from the transactional database, transformed, and loaded into the new schema, also known as ETL.
Schemas created for Neo4J and other NoSQL databases may also present reporting challenges that can be solved in different ways. For Neo4J, virtual nodes and relationships are a means of transforming the data in place without creating a separate schema or data store.
You might like: Querying Graphs with Neo4j
Context
The data set used in this article comes from federally-mandated filings for political lobbying in the United States. For more background, refer to Loading US Lobbying Data into Neo4J and Analyzing US Lobbying Data in Neo4J. The source project can be found in GitHub.
The data schema used is below:
Definitions
- Filing: Represents a single lobbying effort, specifying the detail information about the filing — i.e., unique identifier, period represent, dollar amount spent, date of filing, detailed description — and has relationships to additional details.
- Client: Special interest groups — e.g., corporations, non-profits, industries, national and international governments — advocating for/against legislation or regulations under consideration by the federal government.
- Lobbyist: A professional hired by the client to present the client's position and persuade the federal government to take the client's position with regards to proposed legislation and regulations.
- Registrant: The organization performing lobbying activities on behalf of the client, registered with the US government. Clients may lobby on their own behalf as both client and registrant or may hire firms who specialize in lobbying and hire lobbyists.
- Government Entity: A department, regulatory agency, commission, or branch of government lobbied. Multiple entities are usually associated with a single filing; by far the most lobbied entities are the legislative branches, the Senate and House of Representatives.
- Issue: Filings are assigned to general categories to simplify reporting, such as Education, Transportation, and Natural Resources. The specifics of the lobbying effort is described with each filing.
Problem: Visualizing the Raw Data
Let's explore the loaded data set, starting with a registrant and branching out. You'll see how cluttered the browser becomes as we navigate relationships by expanding nodes.
Step 1
MATCH a single (:Registrant).
MATCH (r:Registrant) RETURN r LIMIT 1
Step 2
Double-click on the (:Registrant) node to display related nodes.
Step 3
Expand a single (:Filing) node.
Step 4
Expand a (:Government Entity) node.
Step 5
Expand another (:GovernmentEntity) node.
(:Filing) nodes quickly overwhelm the graph, as they are the unifying node type related to all other node types,; unlike the other node types in this schema, all Filings are singletons. Any worthwhile Cypher query — whether used for visualization or returning tablular data — must include (:Filing), adding complexity and noise to any results.
Solution: Virtual Nodes and Relationships
Definition
Virtual nodes and relationships are created using the Neo4J APOC library in a Cypher statement. Unlike nodes and relationships created and stored in Neo4J, virtual nodes and relationships are transitory and only exist during query execution.
After installing the APOC library, the Neo4J config file is changed to provide the nodes functions unrestricted security.
dbms.security.procedures.unrestricted=apoc.nodes.*
Usage
The functions for creating virtual nodes and relationships are fairly simple.
WITH apoc.create.vNode(['vnode'], {name:'one'}) AS one,
apoc.create.vNode(['vnode'], {name:'other'}) AS other
RETURN one,
other,
apoc.create.vRelationship(one, 'related', {name:'one-to-other relationship'}, other) as vrel
Both virtual nodes and virtual relationships may have properties, if needed, supplied during creation using the JSON syntax.
You can differentiate persisted and virtual nodes/relationships by the Neo4J-generated IDs, which are negative for virtual.
Example 1: What Government Entities Does a Registrant Lobby?
The question is whether registrants target specific government entities in their lobbying efforts. Individual filings aren't important but rather we want to know how many filings and the dollar amount of those filings. To do this, we can create virtual nodes for the registrants and government entities and a virtual relationship between the two.
Solution #1
The following is the complete Cypher command, which I'll break out into explainable chunks.
MATCH (r:Registrant)-[:FILED]->(f:Filing)-[:TARGETED_AT]->(g:GovernmentEntity)
WITH r, f, g, SUM(f.amount) AS amt,
apoc.date.fields(LEFT(f.receivedOn, 10), 'yyyy-MM-dd') AS received
WHERE received.years = 2018 AND
received.months = 3 AND
amt > 100000 AND
g.name <> 'SENATE' AND
g.name <> 'HOUSE OF REPRESENTATIVES'
WITH COLLECT(DISTINCT r.name) AS registrants,
COLLECT(DISTINCT g.name) AS gents
WITH [gname IN gents | apoc.create.vNode(['gent'],{name:gname})] AS gNodes,
[rname in registrants |
apoc.create.vNode(['Registrant'],{name:rname})] AS rNodes
WITH apoc.map.groupBy(gNodes, 'name') AS gvs,
apoc.map.groupBy(rNodes, 'name') AS rvs
MATCH (r:Registrant)-[:FILED]->(f:Filing)-[:TARGETED_AT]->(g:GovernmentEntity)
WITH gvs, rvs, r, f, g, SUM (f.amount) AS amt,
apoc.date.fields(LEFT(f.receivedOn, 10), 'yyyy-MM-dd') AS received
WHERE received.years = 2018 AND
received.months = 3 AND
amt > 100000 AND
g.name <> 'SENATE' AND
g.name <> 'HOUSE OF REPRESENTATIVES'
RETURN rvs,
gvs,
apoc.create.vRelationship (rvs[r.name], 'LOBBIED',
{filingCnt:COUNT(f), filingAmt:SUM(f.amount)},
gvs[g.name]) AS rel
Part 1
Identify and filter the persisted data of interest, in this example, filings from March 2018 over $100,000, ignoring the legislative branch since the vast majority of filings include either the House, Senate, or both.
The names of registrants and government entities matched are collected into a list to allow iterating later.
MATCH (r:Registrant)-[:FILED]->(f:Filing)-[:TARGETED_AT]->(g:GovernmentEntity)
WITH r, f, g, SUM(f.amount) AS amt,
apoc.date.fields(LEFT(f.receivedOn, 10), 'yyyy-MM-dd') AS received
WHERE received.years = 2018 AND
received.months = 3 AND
amt > 100000 AND
g.name <> 'SENATE' AND
g.name <> 'HOUSE OF REPRESENTATIVES'
WITH COLLECT(DISTINCT r.name) AS registrants,
COLLECT(DISTINCT g.name) AS gents
Part 2
Next, iterate through the names and create the appropriate virtual node. A map is created for the nodes created, using the name property as the key into the map.
WITH [gname IN gents | apoc.create.vNode(['gent'],{name:gname})] AS gNodes,
[rname in registrants |
apoc.create.vNode(['Registrant'],{name:rname})] AS rNodes
WITH apoc.map.groupBy(gNodes, 'name') AS gvs,
apoc.map.groupBy(rNodes, 'name') AS rvs
Part 3
Re-query the nodes from which the virtual nodes were created.
MATCH (r:Registrant)-[:FILED]->(f:Filing)-[:TARGETED_AT]->(g:GovernmentEntity)
WITH gvs, rvs, r, f, g, SUM (f.amount) AS amt,
apoc.date.fields(LEFT(f.receivedOn, 10), 'yyyy-MM-dd') AS received
WHERE received.years = 2018 AND
received.months = 3 AND
amt > 100000 AND
g.name <> 'SENATE' AND
g.name <> 'HOUSE OF REPRESENTATIVES'
Part 4
Create a virtual relationship between the registrant and government entity directly, adding properties which aggregate the filings in a useful way.
RETURN rvs,
gvs,
apoc.create.vRelationship (rvs[r.name], 'LOBBIED',
{filingCnt:COUNT(f), filingAmt:SUM(f.amount)},
gvs[g.name]) AS rel
Visualization
The results are much easier to understand when the explicit filings are removed and aggregates are included as properties on the [:LOBBIED] relationships. In the Neo4J browser, select a relationship to see the total number of filings and dollar amount spent by the registrant.
Solution #2
Virtual nodes are useful when the persisted nodes are transformed into something more useful than the base node. Solution #1 created them to demonstate how, but they aren't actually required since virtual relationships can connected persisted nodes.
The following Cypher gets the same results without creating virtual nodes.
MATCH (r:Registrant)-[:FILED]->(f:Filing)-[:TARGETED_AT]->(g:GovernmentEntity)
WITH r, f, g, SUM(f.amount) AS amt,
apoc.date.fields(LEFT(f.receivedOn, 10), 'yyyy-MM-dd') AS received
WHERE received.years = 2018 AND
received.months = 3 AND
amt > 100000 AND
g.name <> 'SENATE' AND
g.name <> 'HOUSE OF REPRESENTATIVES'
RETURN r, g,
apoc.create.vRelationship (r, 'LOBBIED',
{filingCnt:COUNT(f), filingAmt:SUM(f.amount)}, g) AS rel
Conclusion
Transforming Neo4J schema inline with virtual nodes and relationships provides different insights into your data than what was available with the original, persisted data. While this article focused on simplified visualizations, transformed tabular results can be generated using Cypher's capabilities to chain queries together (the WITH clause) and including the virtual nodes/relationships in the chain.
Further Reading
Opinions expressed by DZone contributors are their own.
Comments