Sampling A Neo4j Database
Join the DZone community and get the full member experience.
Join For FreeAfter reading the interesting blog post of my colleague Rik van Bruggen on “Media, Politics and Graphs” I thought it would be really cool to render it as a GrapGist. Especially, as he already shared all the queries as a GitHub Gist.
Unfortunately the dataset was a bit large for a sensible GraphGist representation, so I thought about means of extracting a smaller sample of his raw data that he made available (see his blog post for the link).
Considering my last blog post on creating data from sampling a cross product, this should be much easier. We know we want to have all nodes with the labels PARTY, SHOW and GENDER in our graph as well as a sample of GUEST nodes with their relationships.
The first part is easy:
MATCH (n) WHERE n:PARTY OR n:SHOW OR n:GENDER RETURN n;
The second part uses something that was not helpful in my last exploration, namely that random sampling when applied directly to a match, is used to filter the first node-pattern in the match and then still traverse all relationships/paths emanating from that node.
MATCH(n:GUEST)-[r]->() WHERE rand() < 0.1 RETURN n,r;
The number you compare rand() to is the percentage you want to get back, in this example 10%.
Now I have two nice queries, that can get me the data, how can I bring them together? With UNION ALL
MATCH (n) WHERE n:PARTY OR n:SHOW OR n:GENDER RETURN n, null as r UNION ALL MATCH(n:GUEST)-[r]->() WHERE rand() < 0.1 RETURN n,r;
And where do I get the Cypher statements from, that I can use to populate my GraphGist database setup? Fortunately my dump command made it into the Neo4j-Shell, so that we can just run it on the command-line and redirect the output into a file:
bin/neo4j-shell -path talkshow/graph.db \ -c 'dump MATCH (n) WHERE n:PARTY OR n:SHOW OR n:GENDER RETURN n, null as r UNION ALL MATCH(n:GUEST)-[r]->() WHERE rand() < 0.1 RETURN n,r;' \ > talkshow/sample.cql
Don’t forget the semicolon at the end! Looking at sample.cql we see something like:
begin create (_0:`SHOW` {`Modularity Name`:"B&vD", `id`:"B&vD", `label`:"B&vD", `modularity_class`:3, `weighted outdegree`:0.000000}) create (_1:`SHOW` {`Modularity Name`:"P&W", `id`:"P&W", `label`:"P&W", `modularity_class`:4, `weighted outdegree`:0.000000}) create (_2:`SHOW` {`Modularity Name`:"DWDD", `id`:"DWDD", `label`:"DWDD", `modularity_class`:5, `weighted outdegree`:0.000000}) ... ... create _509-[:`VISITED` {`quantity`:1}]->_5 create _509-[:`VISITED` {`quantity`:1}]->_2 create _509-[:`VISITED` {`quantity`:1}]->_1 create _509-[:`VISITED` {`quantity`:1}]->_0 ; commit
Which we can now use to populate our database for our GraphGist, and here it is in all its beauty – GraphGist: “Media, Politics and Graphs”. But actually I chose not to use Rik’s GitHub Gist with the queries, but to copy the nice text and pictures from his blog post into the GraphGist.
You might notice that some of the parties go without connections. That would need some tweaking of the sampling which I leave as exercise for you.
Have fun
Michael
Published at DZone with permission of , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments