What is peer reviewing?

2022-04-26 :: academia

I’ve been doing, and experiencing, a lot of peer reviewing lately. I’ve been ranting about it on Twitter as I get reviews that don’t help me and, in many ways, hurt me, and lauding reviews that provide useful constructive feedback (even if I disagree with it or the decisions). I’ve been trying to figure out how to provide good reviews and avoid negative aspects of reviewing.

I need to get the thoughts out of my head. These are not declarations of what peer reviewing is or should be, but my attempt to work through those questions.

The Scientific Aspect of Reviewing🔗

scientific | adjective: based on or characterized by the methods and principles of science.

Science | noun: a systematically organized body of knowledge.

Mostly we think about the scientific aspect of peer reviewing. If we’ve written a scientific piece, we have tried to answer a question, scientifically. We have posed a research question, hypothesized some answer, tried to evaluate the answer objectively, and present all of that clearly to advance the state of knowledge.

The point of peer review, then, is to check that we have indeed done some science. To check our question is reasonable, that our evaluation is not flawed, and that we have indeed advanced the state of knowledge (and communicated that knowledge clearly, at least relatively to some community).

This interpretation of peer review is intuitive, but is it real?

Here are the evaluation criteria from the call for papers for some major programming languages conferences I’ve been involved in. They’re all SIGPLAN conferences, so they are all fairly similar:

POPL 2022:

The Review Committee will evaluate the technical contribution of each submission as well as its accessibility to both experts and the general POPL audience. All papers will be judged on significance, originality, relevance, correctness, and clarity. Each paper must explain its scientific contribution in both general and technical terms, identifying what has been accomplished, explaining why it is significant, and comparing it with previous work.
PLDI 2022:

Reviewers will evaluate each contribution for its accuracy, significance, originality, and clarity. Submissions should be organized to communicate clearly to a broad programming-language audience as well as to experts on the paper’s topics. Papers should identify what has been accomplished and how it relates to previous work.
ICFP 2022:

Submissions will be evaluated according to their relevance, correctness, significance, originality, and clarity. Each submission should explain its contributions in both general and technical terms, clearly identifying what has been accomplished, explaining why it is significant, and comparing it with previous work. The technical content should be accessible to a broad audience.

Correctness🔗

Accuracy or correctness appear in all of these. This makes sense; one major aspect of reviewing is to make sure the scientific work hasn’t made any mistakes. Of course, the reviewing itself might make mistakes. The goal isn’t to do a perfect job, but to try. And if we try, keep doing science, keep checking each others’ work, keep asking questions and even re-asking questions, eventually we’ll approach something resembling the truth. I like this criterion and think it’s probably the most important one.

Originality🔗

Originality appears in all three as well. This one is a little odd. If the point of science is to advance the state of knowledge, it makes sense that scientific work should be original, i.e., new, novel, producing something that was previously unknown. But, it also seems a little at odds with the previous criteria. One great way for checking correctness is reproducing or replicating prior results, double-checking existing work and making sure we get the same answer. The emphasis on originality or novelty seems to be at odds this goal. We could interpret it a little more generously, by considering replication and reproduction as new in the sense that they are new evaluations of an old question, so it is still original work. That’s okay with me. But it does require a little care in the interpretation of the reviewing criteria....

Clarity🔗

Clarity appears in all three as well. This is interesting as it maybe seems irrelevant to the idea of a rigorous and objective evaluation of a research question.

It may even seem.. not true of essentially any research paper. If you’ve ever tried to read one of my research papers, and you’re not literally in my field, and even for many people who are, you’d probably say they’re hard to read. Maybe not very clear, maybe lots of obtuse notation, vocabulary, methodologies, ideas, ... But the reviews have judged them to be clear enough for accepting.

So clarity seems to be quite ambiguous, quite subjective. At the very least it’s relative. Let’s say it’s relative to the community reviewing and calling for papers.

But why is it important? As long as the work is correct! Well, the point of science isn’t merely evaluating a research question, but advancing the state of knowledge. We can’t very well advance knowledge if no one, or a vast majority of a community of interest, can understand what you’ve done.

So it’s important for the work to be not only well evaluated, but clearly (relatively) well evaluated, and well communicated. This way it advance knowledge for many people, and not just the authors.

So far, so scientific. All these criteria directly relate to the original scientific goal.

Relevance🔗

POPL and ICFP include "relevance". I take this to mean relevance to the field. For example, submitting a machine learning paper to POPL is probably not relevance, even if it is of the highest quality science.

This is related to the advancement of knowledge, since only reviewers who are familiar with the area, methodologies, state of knowledge etc are going to be capable of assessing the other criteria.

This leads to some problems at the borders between areas. What if it is relevant, but heavily in another field, and as a result, there isn’t very much expertise to review the paper? But one hears stories about these papers having a hard time finding a place in any venue. I guess I should take a charitable view of such papers, but then, that may require sacrificing review quality.

Or what about work that is revolutionary, in the sense that it is creating completely new models, methodologies? Such work would argue that it is relevant, but it would be difficult for the reviewers to judge it so. It might seem completely irrelevant— no one has used it yet, perfectly good techniques (which will, by nature, be more clear and easier to judge correctness of) will apply in many of the examples.

I don’t think there’s a good solution here. Revolutions are hard. Founding new fields, subfields, etc, is hard.

Significance🔗

The last criteria for PLDI is significance, which also appears for POPL and ICFP.

I have no idea what this means. The dictionary provides:

significance | noun [mass noun]: the quality of being worthy of attention; importance

This seems like a very suspect criterion. How do we know, apriori, which ideas are important, are worthy of attention? This requires us to understand, in advance, what impact the work could have? Are we psychic?

I suppose, in some cases, it might be somewhat clear that work is not important. Sam TH, who I love for pushing against my (and others’) simple takes to advocate for complexity, recommended this paper: https://doi.org/10.1007/s11245-006-0005-2. It introduces a strawman in the context of philosphy, of people inventing and then studying a game called "chmess". One could investigate all of these questions scientifically, but they would all be insignificant, because the made up game affects no one in the world.

In this case, it’s clear that the questions are not significant, but it introduces the key problem with significance— the answer to the question is a relative one (like clarity). Maybe a question is not significant to one field, but is to another, because that field knows how to apply that kind of question to some other problem that is significant, and so on.

I don’t know how to detect significance then.

The chmess paper suggests asking whether you could explain the research question to an outside audience and convince them of its importance. This seems like a low bar, however. Almost every paper I read begins with a motivation section, which does exactly that: here’s some interesting the theoretical problem and how it could, in principle, be used to solve or address or make progress on some real world problem. Perhaps that just means we’re all good at (convincing ourselves that we’re) working on significant problems.

Supposing I could accurately judge significance by this definition, how does it relate to the original goal: ensuring that a scientific work advances knowledge? Well, if the result is unimportant, one might argue (if one were a pragmatist, in the sense of philosophy sense) that unimportant knowledge is not useful, and therefore, not knowledge. But it still seems quite difficult to judge the utility of knowledge.

It reminds me of number theory, which was considered quite without practical application. Until we invented cryptography.

While trying to understand significance on Twitter, others proposed an alternative definition: that the research question is large enough (to be important?). This seems like an even more suspect definition. For one, what is large enough? Completely subjective. The prior definition of significance is relative to a community, but this is relative to an individual’s expectations. For two, it has perverse results in practice. In an effort to ensure a result is big enough, an author is incentivized to make a result larger (or appear larger), perhaps artificially. The knowledge is withheld from the community and society until it reaches some arbitrary bar. That bar is unavoidably pushed higher, as each year new scientists joining the field compare to the current bar, new authors strive to beat that bar (from the prior incentive) and the baseline is reset.

And for what? What part of our original scientific goal is achieved by ensuring a result is "large enough"? It doesn’t help advance knowledge to withhold a result for being small, if it is correct, and clear, and original.

I see none.

At best, this definition seems to be a response to something unscientific... see the next section.

The Social Aspect of Reviewing🔗

social | adjective: relating to society or its organization.

There’s another aspect to peer review, which I will call a social aspect. Science is, of course, a social process, so these are not unrelated. But I want to separate them.

The previous aspects to reviewing dealt explicity with scientific mission— the advancement of knowledge. But it is us humans that are attempting to do that, in a large context with many systems and pressures that we interact with and within.

For example, to advnace knowledge, we must keep up with the state of knowledge. One purpose of reviewing I’ve heard advocate is to act in defense of human attention, so that we can focus on the advancement that are relevant and important. A peer reviewers job is, in part, to reject "noise" from the process, so it is even possible to know what the state of knowledge is.

This seems related to the second definition of "significance" in the prior section. If a result is "too small", perhaps it becomes a distraction. New authors spend too much time learning it and pushing it to a state of being useful, taking more time than it would have taken the original authors, thus wasting time. Or perhaps realizing the result does not scale to enough settings as originally assumed, but only after wasting a great deal of time, attention, and effort.

But this requires us to make the call on what others will want to and/or should pay attention to. That’s a hard call to make; I’m not sure whether I can make or should make it.

It could also be a reaction to gaming the system that employs scientists and funds research. A bunch of small publication is still a large publication count, which can (through the magic of relying on metrics) convince people to provide funding or jobs to one person, but not another. This, in turn, can prevent others from doing useful work to advance knowledge; they could use made use of those jobs or funding.

I’m not going to rant here about the Tyranny of Metrics, for which there’s a whole book I heartily recommend.

Metrics exist, and we have to work with them, so it’s worth paying attention to them. We don’t want people gaming our system, even if our system is stupid. We should try to change it, but that’s not always possible.

Still I think it’s worth being careful of doing more harm than good when trying to prevent people from gaming the system. I’m not sure how easy it is to recognize "salami slicing" a work in to lots of little pieces of work, apriori. Science is incremental.

On the other side, does a venue have a responsibility to game unjust systems? Some systems exist evaluate venues based on inclusion in certain indexes, acceptance reates, classificaiton as journal or conference, etc, and peoples jobs’ rely on these things. Should a venue do things like turn itself into a journal (as PACMPL has done) to game an unjust evaluation system? What about targeting a certain acceptance ratio, which POPL explicitly does?

https://www.sigplan.org/Conferences/POPL/Principles/

This acceptance ratio is important for maintaing a high ranking in the CORE ranking system.

See ICFP’s downgrade from A* to A, which cites the acceptance ratio of 33% being too high: http://portal.core.edu.au/core/media/justification/CORE2021/4612ICFP.pdf

Do reviewers have a responsibility to maintain a high ranking, to the benefit of the venue and the people submitting to their venue, by rejecting a certain proportion of papers? Without doing so, we risk the scientific endeavour— the venue and its community may no longer be able to effectively advance knowledge.

I dunno man I’m not a moral philosopher.