The Structure and Interpretation of Computer Science Academic Metrics
I’ve been unhappy with metrics lately, watching academics (of all people) apply metrics without interpretation. I’m losing my mind over it.
This was free written in one sitting don’t @ me.
Publication Count
This probably started when my department introduced a new policy explicitly measuring publication count. They wanted an objective, transparent metric of research productivity. They decided (well, some of them decided, and pushed through a policy against all department norms) that the metric of research productivity would be (drum roll please): publication count. At least 1 paper per year (ish) would be required.
As if “1 paper per year” is a meaningful metric. As if “1 paper” is a thing that exists. We all know there is no consistent interpretation of “1 paper”. There was much discussion on this point. Everyone admitted that “1 paper” varies greatly from field to field. Some papers are smaller, some are large, some take many years to do the research, write, polish, and some are small results easily accepted in small venues. But surely, “1 paper” is the bare minimum anyone could possibly be expected to publish in any field. So the policy was codified.
Publication count has long been a proxy for research productivity, and while it is objective and transparent, it’s also meaningless.
Number of physical chairs in a department is also objective and transparent, and almost equally meaningless. Surely most chairs in a department are assigned to people doing research? They need a place to sit in order to do research. So as the number of physical chairs increase, the research productivity increases, because more researchers are sitting in chairs doing research. Sure, there are differences between chairs. Some are rolly chairs, some have lumbar support, some are mere stools. Some, of course, are assigned to admin staff, and really only indirectly support research productivity. It’s not a perfect metric, but it’s objective and transparent!
I’m generally skeptical of metrics. “When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law. Merely the act of measuring changes peoples’ actions. It provides an incentive to game the metric. You can use evolutionary game theory to show that introducing a metric as a fitness function will destroy any other variable, say, quality. If only the fittest survive, then anyone who better optimizes the fitness function wins and survives. When the fitness function is “number of publications”, and if this is not identical to “quality” (it’s not; it takes longer to polish a paper than to merely publish one), then trying to maintain “quality” is a losing strategy. Any rational actor seeking to survive will change their strategy, avoid “quality”, and optimize “quantity”. Any irrational actor, seeking to maintain “quality” over “quantity”, will lose and not survive.
“1 paper” is not a measure of research productivity. It is a measure of how many papers one has written, which is loosely correlated with, but not the same as, research productivity. It’s fairly easy to write papers which no research content. There are lots of venues that will accept a tutorial paper, or a neat idea paper, or controversial opinion, or a talk abstract. I’ve written some of these; it’s not bad to write them. But they’re not necessarily the same as research productivity. I would not equate my ACM Opinion article (a paper, with a DOI) with a POPL paper. That would be absurd. I wrote that as a blog post late one night after getting mad about ACM fees. Probably my “lowest tier” publication is a paper at Scheme Workshop. Scheme Workshop appears to be slowly dying, doesn’t have a formal proceedings (most of the time?), gets few submissions and the accept rate is almost 100%. But that paper involved a year of writing code, documenting, deploying to hundreds of users, testing, writing 24 pages of technical explanation, and preparing and delivering a 30 minute talk. That’s some research output. I’d have been smarter, more fit, to submit a neat little toy in Scheme that took no time to write about.
How could academics, trained in studying and understanding data in minute details, so naively apply metrics this way? Don’t tell me; I know the answer. They wanted to make decisions more easily, and they found an objective, transparent metrics to do that.
Emery Berger’s Ranking
This is literally the reason http://emerybergersrankings.org was was created. To provide an objective, transparent metric to aid grad student in making decisions about which grad school to go to.
I also hate Emery Berger’s Ranking. The goal is to provide “a metrics-based ranking of computer science institutions”. Ranking of what? Metric that measures what? It measures the output at a completely arbitrary list of conferences divided by author count per paper. What does that tell you? Why is is the school that is Number 1 in Arbitrary Conference Set Papers Divided By Author Count the “best” school? The one you want to spend ~6 years of your life studying at? At best, Emery Berger’s Ranking tells you that going to the most highly Ranked school will probably result in you producing some conference publications, likely with fewer coauthors than at other institutions? If that’s your goal, this is a great metric! Otherwise, it’s a bad metric. Honestly, US News and World report is a better metric for choosing a grad school, as it takes into account things like how nice the campus is, which greatly contributes to my happiness on a day-to-day basis.
And even that much is an overstatement of what the metric measures, because it arbitrarily chooses a cut off for default inclusion into the metric of approximately 30% rejection rate. Why rejection rate of the conference is at all related to which is the best school is completely beyond me, but this means the ranking as actually the order on Number of Papers at Arbitrary Conferences with Below a 30% Rejection Rate Divided by Author Count.
All of these things are proxies, loosely correlated with, but not the same as, quality. Rejection rate is a proxy for quality of a venue, since a highly prestigious venue (not the same as quality, but correlated) will attract more submissions. Since a conference can only accept so many talks, more submissions will be correlated with (but not the same as) a higher rejection rate. Author count is correlated with (but not the same as) a measure of effort per publication, since (in theory) more authors should reduce the effort involved in publishing a paper. Why this particular set of conferences? Well obviously those are “the very top conferences in each area”, according to Emery Berger’s judgement.
So this metric measures SOMETHING, but it’s not what is the best CS grad school, or the most research productive CS department, or really, anything other than “Number of Papers at Arbitrary Conferences with Below a 30% Rejection Rate Divided by Author Count”.
But it is objective and transparent (at least, if you actually spend the time to interpret it and don’t just sort by it). And it does let people make decisions more easily. Not good decisions, but they sure are easy!
We know for sure that Emery Berger’s Ranking has had exactly the effect predicted by behavioural game theory. We have examples of committees using it to judge quality of publication (note: it cannot measure that, see above interpretation). We have examples of people choosing venues not based on the best fit for the work, but to optimize the Emery Berger’s Ranking of their research. Of course they would; that’s the winning strategy when the fitness function uses the Emery Berger’s Ranking of a venue for promotion and tenure (see above).
ICLR Points
Imagine my surprise when a new metric based on the same dataset underlying Emery Berger’s Ranking popped up, and after one look… I LIKED IT.
https://cspubs.org roughly measures the relative effort required to publish 1 paper in terms of ICLR papers. What is this, a metric that seeks to measure effort per publication count? That is actually a thing this data COULD measure. The data takes into account number of authors, which is pretty related to effort. More coauthors mean you can produce more research. It takes into account rejection rate, which is related to effort. More frequent rejections and resubmission require more effort. It takes into account area norms about how much work is required to get a paper at a particular venue, which is pretty much the definition of effort. Since it’s directly measuring effort, it’s not a measure that easily gamed: if you want to increase your ICLR points, then you, what, switch to publishing in venues that require more effort? That will almost certainly net you exactly the same number of ICLR points, since they measure effort.
It’s not a perfect metric. It’s not precise. Is the difference between 1 POPL paper and 1 OOPSLA paper 1.2 ICLR papers? Uh… I dunno, not all OOPSLA papers are the same, not all POPL papers are the same, but maybe? But there is at least a difference in effort.
And yet, I saw people complaining. “Why doesn’t cspubs like ICFP and OOPSLA”. That’s not what it says! It says ICFP and OOPSLA require less effort, which is probably true! They have a lower rejection rate, which decreases effort!! This is quite a good metric for that.
Or maybe I’m just biased because it says PL papers requires the 2nd most effort out of any area in CS. Damn straight I put a lot of effort into my papers.