Analyzing the Panama Papers With Neo4j: Data Models, Queries, and More
In this post, we look at the graph data model used by the International Consortium of Investigative Journalists (ICIJ) and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.
Join the DZone community and get the full member experience.
Join For Freethese structures were uncovered from leaked financial documents and were analyzed by the journalists. they extracted the metadata of documents using apache solr and tika , then connected all the information together using the leaked databases, creating a graph of nodes and edges in neo4j and made it accessible using linkurious’ visualization application .
in this post, we look at the graph data model used by the icij and show how to construct it using cypher in neo4j. we dissect an example from the leaked data, recreating it using cypher, and show how the model could be extended.
the steps involved in the document analysis
- acquire documents
-
classify documents
- scan/ocr
- extract document metadata
-
whiteboard domain
- determine entities and their relationships
- determine potential entity and relationship properties
- determine sources for those entities and their properties
- work out analyzers, rules, parsers, and named entity recognition for documents
-
parse and store document metadata and document and entity relationships
- parse by author, named entities, dates, sources, and classification
- infer entity relationships
- compute similarities, transitive cover, and triangles
- analyze data using graph queries and visualizations
finding triads in the graph can show inferred connection. here
bob
has an inferred connection to
companyb
through
companya
.
from documents to graph
a simple model of the organizational domain of business inter-relationships in a holding is simple and similar to the model you use in business registries , a common use case for neo4j. as a minimum you have:
- clients
- companies
- addresses
- officers (both natural people and companies)
-
(:officer)-[:is officer of]->(:company)
-
with these classifications:
- protector
- beneficiary, shareholder, director
- beneficiary
- shareholder
-
with these classifications:
-
(:officier)-[:registered address]->(:address)
-
(:client)-[:registered]->(:company)
-
(:officer)-[:has similar name and address]->(:address)
those have specific relationships like a person is the "officer of" a company. this is a basic domain that you can populate from documents about a tax-haven shell company holding, a.k.a. the #panamapapers .
initially, you classify the raw documents by types and subtypes (like contract or invitation ). then you attach as much direct and indirect metadata as you can, either directly from the document types (like the senders and receivers of an email or parties of a contract). inferred metadata is gained from the content of the documents. there are techniques like natural language processing, named entity recognition or plain text search for well-known terms like distinctive names or roles.
the first step to build your graph model is to extract those named entities from the documents and their metadata. this includes companies, persons, and addresses. these entities become nodes in the graph. for example, from a company registration document, we can extract the company and officer entities.
some relationships can be directly inferred from the documents. in the previous example, we would model the officer as directly connected to the company:
(:officer)-[:is_officer_of]->(:company)
other relationships can be inferred by analyzing email records. if we see several emails between a person and a company we can infer that the person is a client of that company:
(:client)-[:is_client_of]->(:company)
we can use similar logic to create relationships between entities that share the same address, have family ties or business relationships or that regularly communicate.
-
direct metadata -> entities -> relationships to documents
- author, receivers, account-holder, attached to, mentioned, co-located
- turn plain entities / names into full records using registries and profile documents
-
inferred metadata and information from other sources -> relationships between entities
- related to people or organizations from the direct metadata
- same addresses / organizations
- find peer groups / rings within fraudulent activities
- family ties, business relationships
- part of the communication chain
the graph data model used by the icij
issues with the icij data model
there are some modeling and data quality issues with the icij data model.the icij data contains a lot of duplicates, but only a few of which are connected by a “has similar name or address” relationship, mostly those can be inferred by first and last part of a name together with addresses and family ties. it would also be beneficial for the data model to actually merge those duplicates, then certain duplicate relationships could also be merged.
in the icij data model, shareholder information like number of shares, issue dates, etc. is stored with the “officer” where the officer can be shareholder in any number of companies. it would be better to store that shareholder information on the “is officer of – shareholder” relationship.
some of the boolean properties could be represented as labels, eg. “citizenship=yes” could be a
person
label.
how could you extend the basic graph model used by the icij?
the domain model used by the icij is really basic, just containing four types of entities (officer, client, company, address) and four relationships between them. it is more or less a static view on the organizational relationships but doesn’t include interactions or activities. looking at the source documents and the other activities outlined in the report, there are many more things which can enrich this graph model to make it more expressive.we can model the original documents and their metadata and the relationships to people. part of those relationships are also inferred relationships from being part of conversations or being mentioned or the subject of documents. other interesting relationships are aliases and interpretations of entities that were used during the analysis, which allows other journalists to reproduce the original thought processes.
also, the sources for additional information like business registries, watch-lists, census records, or other journalistic databases can be added. human relationships like family or business ties can be created explicitly as well as implicit relationships that infer that the actors are part of the same fraudulent group or ring.
another aspect that is missing is the activities and the money flow. examples of activities are opening/closing of accounts, creation or merger of companies, filing records for those companies, or assigning responsibilities. for the money flow, we could track banks, accounts, and intermediaries used with the monetary transactions mentioned, so you can get an overview of the amounts transferred and the patterns of transfers. those patterns can then be applied to extract additional fraudulent money flows from other transaction systems.
graph data is very flexible and malleable, as soon as you have a single connection point, you can integrate new sources of data and start finding additional patterns and relationships that you couldn’t trace before.
-
new entities:
- documents: e-mail, pdf, contract, db-record, …
- money flow: accounts / banks / intermediaries
-
new relationships
- family / business ties
- conversations
- peer groups / rings
- similar roles
- mentions / topic-of
- money flow
let’s look at a concrete example
let’s look at the family of the azerbaijan’s president ilham aliyev who was already the topic of a graphgist by linkurious in the past. we see his wife, two daughters, and son depicted in the graphic below.
quoting the icij “ the power players ” publication (emphasis for names added):
the family of azerbaijan president ilham aliyev leads a charmed, glamorous life, thanks in part to financial interests in almost every sector of the economy. his wife, mehriban , comes from the privileged and powerful pashayev family that owns banks, insurance and construction companies, a television station and a line of cosmetics. she has led the heydar aliyev foundation, azerbaijan’s pre-eminent charity behind the construction of schools, hospitals and the country’s major sports complex. their eldest daughter, leyla , editor of baku magazine, and her sister, arzu , have financial stakes in a firm that won rights to mine for gold in the western village of chovdar and azerfon, the country’s largest mobile phone business. arzu is also a significant shareholder in sw holding, which controls nearly every operation related to azerbaijan airlines (“azal”), from meals to airport taxis. both sisters and brother heydar own property in dubai valued at roughly $75 million in 2010; heydar is the legal owner of nine luxury mansions in dubai purchased for some $44 million.we took the data from the icij visualization and converted the 2d graph visualization into graph patterns in the cypher query language. if you squint, you can still see the same structure as in the visualization. we only compressed the “is officer of – beneficiary, shareholder, director” to
ioo_bsd
and prefixed the other “is officer of” relationships with
ioo
.
we didn’t add shares, citizenship, reg-numbers, or addresses that were properties of the entities or relationships. you can see them when clicking on the elements of the embedded original visualization.
cypher statement to set up the visualized entities and relationships
create
(leyla: officer {name:"leyla aliyeva"})-[:ioo_bsd]->(ufu:company {name:"uf universe foundation"}),
(mehriban: officer {name:"mehriban aliyeva"})-[:ioo_protector]->(ufu),
(arzu: officer {name:"arzu aliyeva"})-[:ioo_bsd]->(ufu),
(mossack_uk: client {name:"mossack fonseca & co (uk)"})-[:registered]->(ufu),
(mossack_uk)-[:registered]->(fm_mgmt: company {name:"fm management holding group s.a."}),
(leyla)-[:ioo_bsd]->(kingsview:company {name:"kingsview developents limited"}),
(leyla2: officer {name:"leyla ilham qizi aliyeva"}),
(leyla3: officer {name:"leyla ilham qizi aliyeva"})-[:has_similiar_name]->(leyla),
(leyla2)-[:has_similiar_name]->(leyla3),
(leyla2)-[:ioo_beneficiary]->(exaltation:company {name:"exaltation limited"}),
(leyla3)-[:ioo_shareholder]->(exaltation),
(arzu2:officer {name:"arzu ilham qizi aliyeva"})-[:ioo_beneficiary]->(exaltation),
(arzu2)-[:has_similiar_name]->(arzu),
(arzu2)-[:has_similiar_name]->(arzu3:officer {name:"arzu ilham qizi aliyeva"}),
(arzu3)-[:ioo_shareholder]->(exaltation),
(arzu)-[:ioo_bsd]->(exaltation),
(leyla)-[:ioo_bsd]->(exaltation),
(arzu)-[:ioo_bsd]->(kingsview),
(redgold:company {name:"redgold estates ltd"}),
(:officer {name:"willy & meyrs s.a."})-[:ioo_shareholder]->(redgold),
(:officer {name:"londex resources s.a."})-[:ioo_shareholder]->(redgold),
(:officer {name:"fagate mining corporation"})-[:ioo_shareholder]->(redgold),
(:officer {name:"globex international llp"})-[:ioo_shareholder]->(redgold),
(:client {name:"associated trustees"})-[:registered]->(redgold)
interesting queries
family ties via last name
match (o:officer)
where tolower(o.name) contains "aliyev"
return o
family involvements
match (o:officer) where tolower(o.name) contains "aliyev"
match (o)-[r]-(c:company)
return o,r,c
who are the officers of a company and their roles
match (c:company)-[r]-(o:officer) where c.name = "exaltation limited"
return *
show joint company involvements of family members
match (o1:officer)-[r1]->(c:company)<-[r2]-(o2:officer)
with o1.name as first, o2.name as second, count(*) as c,
collect({ name: c.name, kind1: type(r1), kind2:type(r2)}) as involvements
where c > 1 and first < second
return first, second, involvements, c
resolve duplicate entities
match (o:officer)
return tolower(split(o.name," ")[0]), collect(o.name) as names, count(*) as count
resolve duplicate entities by first and last part of the name
match (o:officer)
with split(tolower(o.name), " ") as name_parts, o
with name_parts[0] + " " + name_parts[-1] as name, collect(o.name) as names, count(*) as count
where count > 1
return name, names, count
order by count desc
transitive path from mossack to the officers in that example
match path=(:client {name: "mossack fonseca & co (uk)"})-[*]-(o:officer)
where none(r in rels where type(r) = "has_similiar_name")
return [n in nodes(path) | n.name] as hops, length(path)
shortest path between two people
match (a:officer {name: "mehriban aliyeva"})
match (b:officer {name: "arzu aliyeva"})
match p=shortestpath((a)-[*]-(b))
return p
further work – extension of the model
merge duplicates
create a person node and connect all officers to that single person. reuse our statement from the duplicate detection.match (o:officer)
with split(tolower(o.name), " ") as name_parts, o
with name_parts[0]+ " " + name_parts[-1] as name, collect(o) as officers
// originally natural people have a “citizenship” property
where name contains "aliyev"
create (p:person { name:name })
foreach (o in officers | create (o)-[:identity]->(p))
introduce family ties between those people
create (ilham:person {name:"ilham aliyev"})
create (heydar:person {name:"heydar aliyev"})
with ilham, heydar
match (mehriban:person {name:"mehriban aliyeva"})
match (leyla:person {name:"leyla aliyeva"})
match (arzu:person {name:"arzu aliyeva"})
foreach (child in [leyla,arzu,heydar] | create (ilham)-[:child_of]->(child) create (mehriban)-[:child_of]->(child))
create (leyla)-[:sibling_of]->(arzu)
create (leyla)-[:sibling_of]->(heydar)
create (arzu)-[:sibling_of]->(heydar)
create (ilham)-[:married_to]->(mehriban)
show the family
match (p:person) return p
family ties to companies
match (p:person) where p.name contains "aliyev"
optional match (c:company)<--(o:officer)-[:identity]-(p)
return c,o,p
graphgist
you can explore the example dataset yourself in this interactive graph model document (called a graphgist). you can find many more for various use-cases and industries on our graphgist portal .related information
- http://neo4j.com/news/neo4j-powers-panama-papers-investigation/
- http://panamapapers.icij.org
- https://panamapapers.icij.org/the_power_players/
- stairway to tax haven game: https://panamapapers.icij.org/stairway_tax_heaven_game/
- https://linkurio.us/panama-papers-how-linkurious-enables-icij-to-investigate-the-massive-mossack-fonseca-leaks/
- e-mail analysis
- investigative journalism
- https://www.icij.org/offshore/offshore-companies-provide-link-between-corporate-mogul-and-azerbaijans-president
- http://blog.bruggen.com/2015/03/what-do-linkurious-icij-and-swissleaks.html
- existing graphgists
Published at DZone with permission of William Lyon, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments