Friday, February 25, 2011

A short introduction to Isonymy (surname genetics)

Think you need a lab to do population genetics research?  Chances are, you already have everything you need for basic population structure analysis right at home. All you really need is a phonebook, a set of equation and either a pencil, paper and a calculator or, preferably, a computer with a good spreadsheet program.  What follows here is a very short introduction to using surnames as genetic markers in population and kinship studies.

In our culture, along with other European derived societies, surnames are inherited from the father.  As such, they behave as a system analogous to Y-chromosome inheritance, which has been shown to reflect actual genetic within and between populations.  Surnames have been used for decades by biological anthropologists as measures of genetic relationships.  Medical geneticists have also used surnames to study genetic linkages with various diseases, such as certain cancers. 

There are two basic types of isonymy (literally "same name") studies; random and non-random.  Non-random studies are generally geared toward analyses of inbreeding levels in a population, and are usually centered on levels of surname repetition among married couples.  This was the technique employed by George Darwin (son of Charles Darwin) in his landmark study of inbreeding in first-cousin marriages in England.   

Random isonymy,  by contrast, involves calculating the kinship coefficient for a population based on the repetition of surnames throughout that population.  Calculating the random isonymy of either a single population or between two populations is the first step in any kinship study.  Unfortunately, there are many different methods of computing this figure, since many different researchers have worked out their own methods of calculating random isonymy.  One common method for calculating isonymy within a single population is:

Iii = nik (nik – 1)
Ni ( Ni-1)

            Iii equals random isonymy.  nik equals the number of occurrences of a particular surname within a population (say, for example, Anderson).  This is multiplied with the same number of the occurrences of Anderson minus 1.  So let’s say that Anderson occurs 6 times in a population.  You would multiply 6 * 5, getting 30.  The next step is to repeat this for every surname in the population.  Surnames that only occur once drop out of the equation (since 1 * 0 is always 0).  Then take the results of all these calculations and add them up, which is indicated by the sigma () symbol.  Moving to the bottom of the equation, Ni equals the total number of surnames in the population.  This is multiplied by the total number of surnames minus 1.  Then divide the numerator by the denominator and you have the random isonymy for that population.

From the random isonymy score, you can calculate other metrics of kinship within a population.  The simplest of these are Laskers coefficient of relationship (which corresponds to Wrights coefficient of relationship) which is obtained by dividing random isonymy by 2 (Lasker, 1977).  Divide random isonymy by 4 and you have the coefficient of inbreeding.  These are the very basic elements of surname genetics, which I hope to explore in greater depth in the course of this blog.

Darwin, G.  1875.  Marriage between first cousins and their effects.  The Journal of the Statistical Society of London, 38: 153-184.

Lasker, GW. 1977.  A coefficient of relationship by isonymy: a method for estimating the genetic relationship between populations.  Human Biology, 49: 489-493.


  1. Neat. How does the choice of name affect the interpretation of the results? "Grebeldinger", say, isn't a terribly common name in the US, and it would be reasonable to assume that most, if not all, Grebeldingers in an area were related at some point... "Smith", or even "Cooper", or all those other occupational names, probably were never all related. On the other hand, spelling variation of the name could skew the results in the other direction somewhat - I'm descended from a Van Keuren, but other branches of the family are Van Curen and Van Kurin (unfortunately, they fell into some serious illiteracy around 1800 or so). How do you handle factors like those?

  2. When you conduct a surname study, you are assuming that the surnames are, to some extent, monophylitic; that is, they all have the same origin. Of course this model breaks down somewhat in the cases you mentioned (Smith and Cooper), and you have to assume that your results will to some extent overstimate te level of kinship within a population.

    You also have to determine ahead of time how strongly you are going to assume a monophyletic origin for the surnames. For example, is O'Brian the same as O'Brien? In general, you need to come up with a rule for your study and assume that slight differences in spelling either are or are not significant and stick with that rule for the whole population.

    Nevertheless, pairings of surname data with actual genetic data show that isonymy does reasonable aproximate actual levels of kinship. I'll track down those studies and try to have links to them in the near future.

  3. When we assume a monophyletic origin, it is for single surnames. For example, all persons with surname Cooper are descendents of a single Cooper. However in real world, most surnames have multiple origins.