Graphs and geneaology. modelling the data

03 Jan 2017

Lately I have been writing a little bit about a network I made of the relationships of European nobles without actually describing what the network looks like, so I figured I would amend that with a quick introduction to the database structure.

The source of my data is the genealogical database from Hull University created by Brian Tompsett. It was scraped directly from the site and since I was mainly interested in Scandinavian royalty I started my search from Harald Bluetooth, the first king of Denmark, and then traversed the graph through children, parents and spouses. I expected this to get me the Scandinavians since the families are quite intermarried, but when looking through the network later on I found even the Caliphs of Baghdad, so I am guessing that I ended up downloading more or less the entire Hull database.

The data is modeled in a fairly simple way. There are 57975 people in the network represented by a (:Person) node which can be connected to other people either through [:married_to] or [:child_of] relationships. Additionally there are 9846 (:Title) nodes that people can be connected to through [:holds] relationships. Using the titles as nodes makes sense since I often end up searching for specific titles such as “King of…” or “Queen of Britain” for instance. It would however make sense to split it into a (:Title) and (:Demesne) where “Demesne” means the actual territory that comes with the title such as Denmark or Northwich. In that way I can avoid using the Neo STARTS WITH query for text search and maybe even add GIS coordinates to the demesnes.

The nodes and relationships also have the following properties.

`(:Person)`	`(:Title)`	`[:married_to]`	`[:holds]`
name	name	date	acceded_date
born
died
born_year
died_year
family

The notable part about the properties is that I both include actual dates as a string and the year as an integer. Hopefully the next version of Neo4j will include support for datetime types which will make this a bit less confusing. I may also at some point separate out the family property as a node and create a [:part_of] relationship. The most notable missing property is the gender of the people, which really decreases the fun analysis possibilities, but sadly it wasn’t available in the source data.

That’s basically it for the genealogy network, hopefully I will be able to use it to do some interesting analysis.

Graphs and geneaology finding inbred nobles with neo4j

22 Dec 2016

There are some things that most people think they know about noble families: That they were rich, rare and horribly inbred. The first two are outside of my expertise, but the genealogy data I implemented in Neo4j I am able to look at inbreeding. Particularly we will be able to quantify the degree of inbreeding using the inbreeding coefficient, which is used both by dog breeders and geneaologists to determine how inbred the children of a set of parents will be. I will probably write more about the derivation of it in a later post, but for I have only added a small section with its definition.

First off a bit of history

Marrying children to close relatives has been done in many cultures through history, sometimes because you wanted wealth and titles to stay in the immediate family and not be lost through inheritance, sometimes because there simply were noone else around and sometimes just because they were royals so they could do what they wanted to do. On the other end of the spectrum there were plenty of laws against marriages between family or cosanguineous marriages as well. The laws adopted by the early Catholic church forbade marriages where there were less than four ancestors between husband and wife which meant no first cousin marriages allowed, in ancient China marriages were only allowed between people of different surnames and while first cousin marriages were allowed in Islam per example of the Prophet Muhammad everything closer than that was not.

Measuring inbreeding

As mentioned, the inbreeding coefficient of a child is based on the number of common ancestors between its parents and the distance to the common ancestors, or equivalently the length and number of paths between the parents when the family tree is seen as an undirected graph. Summed over each path, $p$ , with length $n$ the formulation is

$F(x) = \sum_p^P \frac{1}{2}^n$

The first thing to notice is that there is no limit to the length of the paths that are considered, which can lead to very long execution times and for long path lengths the contribution to $F(x)$ quickly becomes insignificant. For these reasons we can set the max path length to 10 and still be relatively on target.

Results or Who’s the derpiest of them all

Now that we have a way to measure inbreeding we can start applying it. I decided to look at every king and emperor and the top ten list of inbred monarchs can found in the table below. The top guy with an inbreeding coefficient really is a good indicator that out methods are working. Charles the II was famous for suffering from several disabilities, probably due to the massive inbreeding in the Habsburg dynasty. He did not speak before he was four, could not walk before he was eight and when he died the physician performing the autopsy said that his body, “did not contain a single drop of blood; his heart was the size of a peppercorn; his lungs corroded; his intestines rotten and gangrenous; he had a single testicle, black as coal, and his head was full of water (source).”

Date of Birth	Dynasty	Inbreeding coef.	Name
1661-11-06	Habsburg	0.176758	Charles II of Spain
1857-01-01	de Bourbon	0.174805	Alfonso XII of Spain
1822-05-13	de Asis	0.166016	Francisco
1578-04-14	Habsburg	0.166016	Philip III of Spain
1554-01-20	Avíz	0.146484	Sebastian of Portugal
1334-08-30	Ivrea Burgundy	0.135742	Peter the Cruel of Castile
1767-05-13	de Bragança	0.132812	John VI of Portugal
1069-01-01	Sanchez	0.125000	Pedro I of Aragón
1073-01-01	Sanchez	0.125000	Alfonso I the battler of Aragón
1075-01-01	Sanchez	0.125000	Ramiro II the Monk of Aragón

So Charles the II really was the poster child of why a homogenous gene pool is a bad thing and that he pops up at the top of the list gives us some confidence that our calculations are correct.

Among the other luminaries on the list we see a lot of Spanish and Portuguese monarchs in particular and as far as I understand from a bit of googling, this was the result of marriages often being arranged to keep the peace between different Iberian families while at the same time attempting to keep the power in their family. Very few modern monarchs show up on this list, probably a result of marrying relatives going out of fashion and the only currently living person in the top-20 is King Olav V of Norway who pops in at a 16th place with an inbreeding coefficient of 0.08, not that that seems to have affected him in the slightest.

It should be noted that there are several problems with the list above. First of all there might be people for whom data is missing, for instance if two people marry and only the parents of one of them are known, then we have no idea of their degree of relatedness. At some point I will probably do some analysis of the completeness of the data, but for now we can assume that the values are lower bounds. Secondly inbreeding is an inherited trait meaning that if your common ancestors are inbred then that leads to a higher degree of inbreeding in your children. This can actually be computed, but the contributions should be relatively small and implementing it would lead to the algorithm becoming recursive which again leads to longer computational times. All of these are issues that I plan on tackling later on, both by adding data to the network and possibly calculating the corrected inbreeding coefficient for a few people to see the magnitude of the differences.

Neo4j implementation

Now to the Cypher query itself. Given that we want to find the inbreeding coefficient for children of persons a and b, we find the length of all paths between them up to and including lengths of 10.

MATCH (a:Person), MATCH (b:Person)
WHERE a.name = {:name_a}
AND b.name = {:name_b}
OPTIONAL MATCH (a)-[p1:child_of *..5]->(common_ancestor)<-[p2:child_of *..5]-(b)
WITH size(p1) + size(p2) AS path_length, common_ancestor
RETURN DISTINCT path_length, c

The first part of this query is pretty self-explanatory

MATCH (a:Person), MATCH (b:Person)
WHERE a.name = {:name_a}
AND b.name = {:name_b}

simply matches the people by their names which are set by the standard Neo4j parameter syntax.

OPTIONAL MATCH (a)-[p1:child_of *..5]->(common_ancestor)<-[p2:child_of *..5]-(b)

matches all common ancestors at a maximum distance of 5 to either a or b and returns the paths. Now, there is an issue with this statement. We wanted to find every path between two people that had a path length smaller or equal to 10, however if one person is removed 4 steps from the common ancestor and the other is 6 steps removed we will not find them using this since p1 and p2 are broken into two. The reason for not just using an undirected relationship like

OPTIONAL MATCH (a)-[p2:child_of *..10]-(b)

is that by not using directedness we are including descendants in our query, i.e. if two people have a child together the path through the child would be returned. There are ways to fix this, but all of them add significantly to the calculation time and the contribution from the excluded paths will be very small due to path length, so I choose to go with this conservative estimate. The final part of the statement

WITH size(p1) + size(p2) AS path_length, common_ancestor
RETURN DISTINCT path_length, c

simply finds the path length as the sum of both paths and returns distinct paths along with their path lengths. If we did not return distinct paths then every path would be counted twice; once from a to b and once from b to a.

The results from this query is now plugged into the equation for inbreeding coefficient and voíla, we can now figure out just how inbred every person in our geneaology graph is.

Donald And The Viking Kings

17 Nov 2016

Donald Trump is not only everybodys favourite president-elect of the United States, he is also related to the John of Gaunt, a 14th century English duke.

John of Gaunt, English Duke who fathered 14 children and basically is to blame for the mess we're in today. <br>Image credit: Lucas Cornelisz de Kock (1495–1552)

John of Gaunt, English Duke who fathered 14 children and basically is to blame for the mess we're in today.
Image credit: Lucas Cornelisz de Kock (1495–1552)

Recently I got a hold of Brian Tompsetts incredibly extensive database of European royals and loaded it into a Neo4j graph database. Through that it was easy to look up John of Gaunt and it turns out that 16 generations before John, Trump is a descendant of Harald the II Bluetooth. For non-Danes that is the first king of Denmark, who famously introduced Christianity to the country to reduce the risk of attacks from its larger, Christian neighbors to the south.

Whether Trump has other things than using born-again Christianity for political gains in common with the longboat sailing and slave selling King of Denmark remains to be seen and while Bluetooth did build a wall to keep his southern neighbors out history says nothing about whether he made the Holy Roman Empire pay for it. If not then Trump has plenty of other noble forebears to use as inspiration; just through John of Gaunt he is directly related to at least 220 titles, including a Khan of the Cumans, a word which literally means “pale, unhealthy yellow” in Turkic, a reference to the Cumans being blonde. Which I guess explains the hair.

Among Trumps other royal relatives we find another descendant of John of Gaunt, Louis XIV of France and if there is such a thing as a gene for taste (or lack thereof) in interior decoration it seems to have been expressed in both the coming president and the Sun King of France.

To the left the Versailles bedroom of Louis XIV. To the right the entrance to Donald Trumps New York Apartment. Image credit: <a href="http://www.fotopedia.com/items/jmhullot-z2Umsz0Z8aY">Jean Marie Hulot</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=34640481)">CC BY 3.0</a>

To the left the Versailles bedroom of Louis XIV. To the right the entrance to Donald Trumps New York Apartment. Image credit: Jean Marie Hulot, CC BY 3.0

Before his royal ancestry is attributed to either an illuminati plot or some “ruling gene”, eugenics nonsense it should be noted that smart people figured out that any European that had children in the year 1000 is an ancestor of every European alive today. So Trump has no more right to the Kingdom of Denmark than anyone else, something that is probably a relief for most Danes.

Finding Common Neighbors In Neo4j

24 Jun 2016

The number of neighbors shared by two nodes $x$ and $y$ is the most basic and often among the best predictors of a connection between a source node, $x$ , and the target node, $y$ . Usually this number is formulated as the Common Neighbors index which is just the size of the intersection of the 1-neighborhoods of $x$ and $y$

\[ CN(x,y) = | \Gamma(x) \cup \Gamma(y) | \]

This index is pretty brilliant since it is easy to find and it is often a surprisingly good predictor. It also forms the basis for a lot of other indices such as Adamic/Adar and the Jaccard index.

Finding the number of common neighbors is luckily very easy in Neo4j. If we have some arbitrary network what we want to do is to find the number of common neighbors between every possible pair of source and target nodes, regardless of whether a link exists between them or not. The first order of the day will be to find the 1-neighborhood of every source node and then say that nodes in the 1-neighborhood of the neighbors are potential target nodes.t

MATCH (source)--(neighbor)--(target)
WHERE NOT (source) = (target)

The first MATCHstatement in the query finds all the paths in the network that consists of 3 nodes which is the same as finding every pair of source and target node that share common neighbors. The following WHERE statement ensures that the target and source node are not the same, since we’d otherwise end up with a bunch of self-loops in the data.

These two statements are in themselves enough to find the number of common neighbors between nodes in the network, but it is also interesting to know whether a link actually exists between the source and target so we can evaluate the performance of the number of common neighbors as a predictor. This is accomplished by the OPTIONAL MATCH statement which tries to find links between the source node and the target node and returns null if no link exists.

OPTIONAL MATCH (source) - [con] - (target)

Finally we can return the results of the query. We use RETURN DISTINCT to ensure that each source and target node is only returned once (if we didn’t then we would get a row for each neighbor as well) and we find the number of common neighbors as the count of neighbor nodes for each pair of source and target nodes. The boolean IS NOT NULL statement at the end of the line returns true if a link was found between the source and target and false otherwise.

RETURN DISTINCT source.node_id AS source_id, target.verdict_id AS target_id, count(neighbor) AS common_neighbors, con IS NOT NULL as is_connected

Together this leads to the following query

MATCH (source)--(neighbor)--(target)
WHERE NOT (source) = (target)
OPTIONAL MATCH (source) - [con] - target
RETURN DISTINCT source.node_id AS source_id, target.verdict_id AS target_id, count(neighbor) AS common_neighbors, con IS NOT NULL as is_connected

Common Neighbors in directed, temporal networks

While the above query works fine for an undirected network a little more work is necessary for directed networks. If for example we’re talking about a network of scientific papers an outgoing link means a citation from the source to the target paper. In general we can assume that a paper only can cite papers that are older than itself, so the paper publication date is encoded in a property called created_at. This changes the MATCH statement to the following

MATCH (source)->(neighbor)--(target)
WHERE NOT (source) = (target)
AND source.created_at > target.created_at

with the additional clause making sure that paper doing the citing is newer than the cited paper. Adding direction to the link between the source and target makes sense since you won’t know who will cite you when you are writing the paper.

All in all this leads to the following statement

MATCH (source)->(neighbor)--(target)
WHERE NOT (source) = (target)
AND source.created_at > target.created_at
OPTIONAL MATCH (source) - [con] - (target)
RETURN DISTINCT source.node_id AS source_id, target.verdict_id AS target_id, count(neighbor) AS common_neighbors, con IS NOT NULL as is_connected
ORDER BY common_neighbors DESCENDING

which will return results that look like this

source_id	target_id	common_neighbors	is_connected
A “Rose Is a Rose Is a Rose Is a Rose,” but Exactly What Is a Gastric Adenocarcinoma?	Carbon Monoxide: To Boldly Go Where NO Has Gone Before.	10	True
You Probably Think This Paper’s About You: Narcissists’ Perceptions of Their Personality and Reputation.	A Lucky Catch: Fishhook Injury of the Tongue.	8	False
Role of Childhood Aerobic Fitness in Successful Street Crossing.	An-arrgh-chy: The Law and Economics of Pirate Organization	1	True

Where we’re assuming that the paper title is used as the ID and the results are ordered by descending order of common neighbors.

Older Newer

Simon Thordal