Raffaele Sollecito and the Y-haplotype lottery

 Italian lottery ticketIn my last post, I gave my take on what the review of DNA evidence in the Knox/Sollecito appeal said about the alleged murder weapon. The other item they examined is the clasp from Meredith Kercher’s bra, on which, the prosecution claims, Raffaele Sollecito’s DNA was found.

The main issue for the court with regards to the clasp, as with the knife, looks like being the likelihood or otherwise that the DNA results observed might have been the result of contamination. That’s something I’ll look at in my next post.

Before that, I thought it would be a good idea to consider whether there might be any problems with the interpretation of the DNA found on the clasp as belonging to Sollecito. Clearly, if there are then it may be possible to reject the clasp as evidence before even getting to the question of contamination.

Stefano Conti and Carla Vecchiotti, the authors of the review, do raise concerns about the way electronic printouts from the DNA tests were analysed. If you’ve read the review and, particularly, if you’ve only read the conclusion to the review, you might be forgiven for thinking that the match to Sollecito is in some doubt. However, I don’t think this is a correct reading of what has been written, and I don’t think the authors can possibly have intended to leave that impression. I’ll explain why.

Two different sorts of DNA test were run on the sample from the clasp: a test from austosomal DNA (the type of test most people are thinking of when they refer to a “DNA profile”) and a test looking specifically for Y chromosomes.

In each case, Conti and Vecchiotti draw attention to peaks on the graphs that were originally interpreted as “stutters” – small false peaks which are particularly common to find in readings of mixed DNA samples such as this one, where the DNA of the victim was also present. It’s normal practice to just pretend they are not there. The review, though, questions whether all of these peaks really are stutters. They suggest that any peak over 50 RFU might in theory represent a genuine allele, regardless of their size compared to to the “main” peaks observed.

With regard to the tests run to identify Y-chromosomes in the sample, the review agrees with the finding of compatibility between the sample and Sollecito’s Y haplotype. You can see how this is difficult to deny by looking at page 129 of the review. At each locus point in the computer printout (i.e. each group of peaks you can see), the largest of the peaks corresponds perfectly in each case to Sollecito’s profile.

What the review is suggesting here is not that there is any doubt about that, but that some of the smaller peaks which are above 50 RFU might not be stutters. Instead, they may point to the presence of an additional Y-chromosome or chromosomes other than Sollecito’s. This, in turn, calls into question a working hypothesis of police scientist Patrizia Stefanoni in the original tests: that the DNA she was looking at represented a mixture of the victim plus one other person. I’ll examine the potential implications of this in my next post.

In her autosomal DNA testing of the sample from the clasp, Stefanoni found a match with Sollecito over 16 locus-points. That’s a very strong finding.

To give you an idea of how strong, I’ve done my own calculation. For each locus, I looked up the population frequencies for the relevant alleles (using the highest value present relating to Italy) in this database. Because some of these are overlaps with Kercher’s profile (i.e. they would be present even if Sollecito’s DNA were not) I then erred very much on the safe side and gave them a dummy frequency of 1. I then used the numbers to calculate the genotype frequency for each locus (slightly tedious to explain how that is done – please let me know if you care).

All you then need to do to get an overall frequency (or “random match probability”) for the 16 loci is multiply all the genotype frequencies together and invert the result (i.e. divide one by it).

I’ve either explained all that clearly or I haven’t. In either case, what I came up with was a random match probability of one in about six thousand billion.

I’m not a DNA expert. This is a very rough estimate and there may be methodological issues with my calculation. For example, maybe it would be better to use worldwide frequency figures rather than just concentrating on Italy. But the real point is not the specific figure I came up with, just the fact that is is very, very large. Genotype frequency values tend to be lower than 0.3. So, whatever specific values you use, once you’ve multiplied a load of them together and then inverted, you are guaranteed to end up with a very large number.

The specific problem identified in the review is that at four of the locus points there are peaks present which were disregarded as a stutter in the original testing, but which could be considered genuine peaks, according to Conti and Vecchiotti. You can get an idea of what they are talking about from page 120 of the review. The large peaks seen here are those that match with the victim. The others are, according to Stefanoni, a mixture of stutters and matches to Sollecito.

If we accept the possibility that some of the stutters may be real peaks, then this may have an effect on the random match probability.

There may be a number of potential ways of attempting to account for this. But, even if we were to bend over backwards for the defence and scrap the affected loci altogether, we would still have an overall frequency of one in about 22 billion over the 12 remaining loci. That slashes the odds quite considerably, but it still doesn’t exactly result in a small number.

Now, you might be tempted to formulate a line of thinking, based on the idea that those ambiguous stutters/peaks might represent some unknown person or persons, that perhaps the DNA reading which looks like it matches Sollecito is actually a random combination of DNA from other people. The trouble is, though, that this proposition doesn’t make any difference to the maths. What we’re talking about is just the frequency (to all intents and purposes, the probability) of that particular combination of alleles occurring in sequence. How they got there doesn’t really matter, to the extent that there’s just no realistic statistical possibility that they represent anything other than Sollecito’s DNA.

Still not convinced? Okay, well consider the Y-haplotype DNA. That’s a perfect match over 17 loci. A Y-haplotype match is generally not held to be usable for identification simply because it is not unique. Your Y-chromosome (if you have one) is, barring random mutations, likely to be shared with your father and other Y-chromosomed members of your immediate family. Or, if your Y-haplotype is a very common one, your not-so-immediate family. It’s not impossible that you have exactly the same Y-chromosome as hundreds of thousands of other people. Or it might be a handful. But, without knowing which it is, your Y-chromosome can’t be used to identify you.

Unfortunately, it also seems that available data isn’t extensive enough to reliably assign a population frequency to Y-haplotype. However, it is possible to classify a Y-haplotype as “rare”. At the original trial, Stefanoni referred to having checked Sollecito’s Y-haplotype in something called the YRHD database. You can do this yourself if you want. The search form is here and Sollecito’s Y-haplotype reading is on page 130 of the review. What you’ll find is that, out of 36,477 Y-haplotypes in the database, none match Sollecito’s.

[Note: I redid this search in April 2013, by which time the database had grown to 112,005 Y-haplotypes, but there was still no match for Sollecito.]

I’m thinking of the Y-haplotype as being like the bonus number in a lottery game. It’s not worth much as an identifier on its own, but it’s priceless if you’ve already matched all the main numbers.

NOTE: This post originally contained some incorrect figures and has now been corrected.


10 Responses to Raffaele Sollecito and the Y-haplotype lottery

  1. maundy says:

    I hope the nod to J.K. Rowling in the title is not too subtle.

  2. Hi Maundy,

    Again, I think you need to be clear on the difference between contamination and secondary transfer. I think that secondary transfer is more likely for the clasp than for the knife. There are some methodological issues with what you did, but that is not the most problematic aspect of your analysis. It is bad enough that Dr. Stefanoni apparently wished to use a threshold of 50 RFU here but not with respect to the knife blade profile, where 22 of 29 peaks fall below 50 RFU. This is fairly obvious bias, and one textbook on the subject specifically cautions practitioners to set the threshold prior to the analysis to avoid such a problem.

    However, even worse is the existence of a peak of 108 RFU in D5S818 with a repeat number of 13. This is in the wrong position to be a stutter and it is more than twice the threshold. Massei wrote (pp. 208-209), “Dr Stefanoni declared that she had not considered that peak as an allele or as noise…” This makes no sense; it has to be something. Moreover, there are other peaks that are about as tall as one of 65 RFU that she did count. If there are other individuals who contributed to the DNA on the clasp (and Conti and Vecchiotti imply that there are), then one is obliged to explain how their DNA was deposited. If theirs was deposited innocently, then why not Mr. Sollecito’s?

    • maundy says:

      Hi Chris,

      I’m not sure what your are driving at with the comparison between Stafanoni’s methodology in the case of the clasp and that in the case of the blade. The blade sample is a peak-for-peak match with Kercher. The only way the equipment could have been clearer about it would have been by printing out a picture of her face. So, regardless of RFU value, it would have been beyond absurd to have considered whether any of the peaks might be stutters. I think that’s a country mile beyond dispute.

      Don’t get me wrong, I am not saying that Stefanoni was necessarily right, in the case of the clasp, to ignore all of the peaks that she did. In fact, I’m planning a further post about how the disputed peaks might be interpreted.

      Since you bring up the additional peak at D5S818, one of the things that occurs to me (and I wonder why it didn’t occur to Stefanoni) is that this is a fairly clear candidate allele which maps to Amanda Knox’s DNA profile.

      • The blade sample is only a peak-for-peak match with Meredith’s sample if you ignore the fact that there are peak imbalances of about threefold in some loci, if you ignore the presence of two extra alleles in one locus, and if you ignore the threshold of 50 RFU. These problems are understandable if one concedes that the sample is in the low template range of analysis. I think the main problem here is that you have misunderstood my argument. If we allow a lower threshold for the blade, then there is no obvious reason to reject other peaks that fall below 50 RFU. If one is arguing for two different thresholds, at least two criteria should be met. One is that the threshold should be set prior to the analysis (as per Rudin and Inman’s textbook, pp. 14-15). Two is that the reason for setting different thresholds should be object; therefore, other scientists should be able to examine the reasons for it and agree.

        The fact that you are bringing Amanda’s reference profile into the discussion tells me that you are analyzing the bra clasp DNA with her profile in hand. Surely you know that this is prohibited until the last step in the analysis discussed in Butler’s textbook, Forensic DNA Typing (p. 158), and I am sure that you can appreciate the very good reasons for it.

      • maundy says:

        I think it’s very likely that the sample is low template. But I don’t think you’re correct in thinking that there should be a strict common standard for cleaning across tests. The blade sample gives a very clear and unambiguous reading. We know – more to the point, Stefanoni knew – that the concentration was low. So the indicators of low copy are just that and there’s no reason to discount the peaks. Indeed, post hoc we can say that a suggestion otherwise really amount to a hypothesis that a profile identical to Kercher’s was, to a large extent, randomly formed by noise. Which would be a daft thing to propose. In contrast, the bra clasp sample is clearly mixed (so, a greater possibility of stutters) and it’s very clear that at least some peaks can be written off. A decision to treat that printout in the same way as the blade printout would have made no sense.

  3. Hi Maundy,

    Dr. Tagliabracci disputed Dr. Stefanoni’s claim that the bra clasp matched Raffaele’s profile with respect to at least five loci. Judge Massei wrote (pp. 296-297), “Consequently, there are apparently a considerable number of loci that are not the subject of dispute, a number which seems to be greater than the number of disputed loci and greater than the number of six loci with reference to which Professor Tagliabracci had previously declared, before the current systems were available‚ it was enough …it must be held that, for the greatest number of loci at least, the peaks were so clear and the interpretation so sound that they could not be contested. Consequently, the overall result should be considered fully reliable, even disregarding the repetition of the analysis. It should however be noted that Dr. Stefanoni, during the hearing at which she testified, had offered suitable explanations and answers which this Court considers acceptable.”

    Raffaele’s appeal document correctly notes that Massei’s argument about the numbers of disputed and undisputed loci is contrary to the principles of forensic genetics. Let us assume that the data are clear enough to avoid ambiguity and consider the following analogy. Suppose that a winning lottery number is 12497635834, and I have a lottery ticket that is 12497235834. And suppose I claim that since my ticket has 10 out of the 11 numbers identical, I am a winner. That argument makes as much sense as Massei’s first argument does. Massei’s second argument appears to be that Dr. Stefanoni was right and Dr. Tagliabracci was wrong, but it gives no reason for this assessment.

    Your calculation is based on thirteen loci. Can you explain why you chose that number?

    • maundy says:

      Don’t know whether you’ve notice my mea culpa above. It should be 16 points, not 13.

      Tagliabracci may well have been right to dispute five loci. There’s at least one additional peak (the 13 in D16S539) which C&V don’t contest, but I can’t understand why they haven’t. Perhaps they have a reason, but that would make five altogether. But even then, that leaves 11 loci. So, let the defence have all five and there’s still enough for a definitive match.

      The lottery analogy won’t go where you’re trying to take it, because there’s no such thing as a jackpot in DNA analysis. It’s about how may “numbers” you do match, not how many you don’t. Unless you’ve got an enormous amount of time on your hands, there will always be a few billion that elude you. To put it another way, the important thing to consider is that the odds of getting the first ten numbers in your example are one in ten billion. Failing to get the eleventh doesn’t change those odds.

      Apart from which, C&V are not really making the argument that there is no match to Sollecito at the loci they discuss. What they are suggesting is that there may be something else there as well.

  4. Hi Maundy,

    With due respect you seem to have fallen into the same error as Massei, the one noted by Raffaele’s appeal as being contrary to the fundamental principles of forensic genetics. If Raffaele’s reference profile matches a piece of evidence at 11 loci but is different at 5 loci, the interpretation is exactly the same as if his reference matches at 5 but is different at 11: he is excluded as a contributor to that piece evidence. I ran this sort of question by a forensic geneticist some time ago just to be certain. Your analysis would be valid if we were dealing with a partial profile of 11 loci and no alleles at all at the rest of the loci, but that is not how I understand the debate over the clasp.

    • maundy says:

      “This sort of question”? So, am I right in thinking not the precise question we are talking about?

      Two things to consider. Firstly, what you are saying would certainly be true in cases where only a small number of loci are in consideration. If, say you had three matches then a fourth which was incompatible, then your result is negative because it is possible that the DNA belongs to someone other than your suspect. But if you had twelve and then a thirteenth that was incompatible the situation is different, because the odds against the DNA belonging to your suspect are so astronomical by the time you get to locus twelve. Quite possibly, your first reaction might be to pinch yourself, because it would be a bit like taking a perfectly-defined pair of handprints from a crime scene and then finding that your suspect matched nine of the digits but not the tenth. It’s just not possible to find two people who will give you identical readings at twelve (or even eleven) loci. Instead, you would be forced to conclude that you were dealing with a mixed sample, including your suspect and an unknown contributor.

      Secondly, there does not seem to be actual incompatibility with Sollecito at any allele (ie there is a peak everywhere you would expect to see one – this is according to Massei, and I don’t see it contradicted anywhere else). What V&C say is that, at at least four loci, there are additional peaks which they would not rule out. This suggests the possibility of contributions to the sample that are not from Kercher or Sollecito, rather than the possibility that the match to Sollecito is wrong.

      I do think things may be more interesting if the defence can go beyond the four loci highlighted by V&C, though.

  5. Maundy, You wrote, “It’s just not possible to find two people who will give you identical readings at twelve (or even eleven) loci.” Yet you do allow for coincidence explaining three matches. How many matches are necessary, to your way of thinking.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: