Summary: ~ 600 words on a moderately controversial recent paper that has been discussed endlessly elsewhere. Reading this is unlikely to be a good use of your time.
What Tomasetti et al did was to:
- Choose a set of ~ 30 cancers
- Find the lifetime incidence for each of those cancers
- Independently, find estimates for the number of stems cells and the rate of stem cell division in each of the tissues from which those cancers arise, and then use these to calculate the lifetime number of stem cell divisions in that tissue
- Find the correlation of the lifetime incidence of the cancer with the lifetime number of stem cell divisions in that tissue
In order to make sense of their results, I decided to reproduce their analysis. First, I manually copied and cleaned up the formatting of the data from their pdf (!) . One of the first things I did after loading it into R was to plot it:
This doesn’t look like the plot in their paper, but I realized that was because I hadn’t taken the logs. Once I do, it looks way more similar (although I still flipped the axes):
The Pearson correlation coefficient is much higher in the log-log data (0.80) than in the non-transformed data (0.53). The Spearman correlation coefficient is the same in both (0.81), since the ranks don’t change. This is a good example of how the Spearman correlation is more robust.
So what does figure one mean? I interpret it as showing that there is a positive relationship between the rate of stem cell division in a tissue and the rate of cancer in that tissue (variance explained of ~ 0.66). This then suggests that cancers arising from tissues with a higher rate of stem cell division are more due to the “luck” of whether or not one of those divisions happened to include a mutation, as opposed to a genetic predisposition.
Of course, the rate of stem cell divisions and/or mutations could still be influenced by an environmental factor, but it’d be less likely that any hypothetical environmental factor would affect the risk of mutation in all stem cells in all tissues at the same rates.
However, this does not suggest that “66% of cancers are caused by bad luck”, for many reasons, including the fact that the residuals are not weighted by the proportion that each individual cancer makes up of total cancer rates.
The next section of their paper ranks the cancers based on a score meant to quantify the amount of cancer that occurs above and beyond what you’d expect from the stem cell division rate in that tissue, and then clusters the cancers into two groups based on this ranking.
It’s slightly troubling that they call this one-dimensional k-means clustering “machine learning”. K-means clustering is in fact commonly considered a machine learning method, but at one dimension it reduces to a single breakpoint estimation, which is not quite in the spirit of high-dimensional ML connotation space. [Editor’s note: this was slightly edited for clarity/correctness].
All in all, while the data being in pdf format was slightly annoying and I wasn’t 100% sure of the implementation details of their k-means clustering , I was able to reproduce pretty much all of their results. This was a useful exercise for me and it is a good sign for the longevity of their insights. Simplicity is the ultimate sophistication.
1: My code for this post is on github.
2: Since I haven’t read through every line of the paper with a hyperdontic comb, let me state that this is probably my fault for missing something.