Statistics, Clergy Abuse, and Homosexuality: Correlation and Aggregation
Updated: Feb 7, 2019
The previous post gave a brief outline of Dr. Paul Sullins’ recent paper on homosexuality and clergy abuse, discussing an initial concern related to his use of data. This post will continue this discussion, outlining some basic issues of correlation. We hope to help equip Catholics to better evaluate data presented on this very controversial issue. We have linked within the article to definitions of various terms, and we highly recommend you reviewing these, as Catholics only stand to gain from an understanding of how statistics can be used and misused to support positions.
Correlation, the statistical measure underlying Sullins’ claims, is a measure of how often one variable moves in the same direction as another. Two variables are highly correlated when, as one of them grows or shrinks, so does the other. Correlation does not take into account the scale or units of the two variables, and for this reason it should be interpreted with some caution. In one famous example of spurious correlation, an economics professor calculated a 99% correlation between the S&P 500 and the price of butter in Bangladesh. There is even a website devoted to collecting amusing examples of the phenomenon.
Furthermore, correlation can be very sensitive to the level of aggregation of the two variables. Take, for example, the two vectors (0, 1, 0, 2) and (3, -1, 5, -2). It is easy to see that they are negatively correlated. When the first vector goes up from 0 to 1, then the second vector goes down from 3 to -1. When the first vector goes down from 1 to zero then the second vector goes up from -1 to 5, and so on. These two vectors have a correlation coefficient of -.92, representing an extremely strong negative correlation.
However, when the two vectors are aggregated into buckets of two values each--the first aggregated vector being ([0 + 1 =] 1, [0 + 2 =] 2), and the second being ([3 + -1 =] 2, [5 - -2 =] 3), yielding the vectors (1, 2) and (2, 3)--they have a correlation coefficient of 1, a perfect positive correlation.
This is important because the data presented in Figures 9 and 10 has been aggregated into five-year periods. His charts therefore leave open the question of whether or not the relationship between the variables in each one is as strong, in the same direction, or even still present at all if the data is disaggregated.
Sullins does admit this last point, though only deep in the paper and not in the materials edited for broader release.
Finally, there is a further problem, in that Table 1 (which, again, Sullins uses to support his correlations in Figures 9 and 10) does not aggregate the data. Rather, Table 1 appears to consider each year individually. Sullins does not explain or note this difference. Nonetheless, in our next post we will consider some of the data in Table 1 on its own.
Posts in this series: