CHI Reviewing: A Guide and Examples
Thank you for volunteering to review for CHI. If you aren't familiar with reviewing for CHI, or haven't reviewed for the last couple of years, please read this thoroughly; our approach has changed considerably. This short guide explains what you need to do in order to assist the Papers Co-chairs and Associate Chairs (ACs) on the Papers Committee in selecting a top-quality program of papers. It also addresses your equally important task of providing helpful comments to feed back to authors.
This document describes how to summarize a paper's contribution, what is needed in the review (a review is not just a vote for or against acceptance, it is input to a discussion amongst the 40+ chairs and associate chairs), and provides examples of both suitable and unsuitable ways of writing CHI reviews. It also emphasizes the importance of being complete, constructive and polite, and of turning your reviews in on time. If you cannot complete a thoughtful review by the review deadline, please contact your associate chair as soon as possible.
Summarizing the Contribution
When you first log on you will see a list of the papers assigned to you. The review form asks you to state, in two or three sentences, what contribution the paper aims to make to the field of Human-Computer Interaction. The Call for Participation advises authors that their papers will be reviewed in terms of its contribution, so we ask you to take some care to identify it concisely and accurately. Give some thought to the type of contribution, which may be:
Further examples are provided of how not to summarize the contribution, viz.:
The key point you should keep in mind is that your review is not just a vote for whether a paper will be accepted: it is input to a discussion amongst the committee members. You are assisting the committee, and your AC in particular, by providing an argument for or against acceptance. In some cases there will be wide divergence amongst reviewers' numerical ratings of the paper. In these cases your argument, if clearly and succinctly stated, can ultimately have more influence than the rating alone. To do so your review need not be of great length. But if it provides just a rating without an adequate rationale it will be virtually useless. Please therefore help your AC put together an overall case for acceptance or rejection by attending to:
Reading the paper thoroughly. To write a thorough review you will need to understand in detail what the authors have achieved and how they achieved it. It is often necessary to read a paper more than once, or to spend considerable time focusing on complex arguments. Please set aside sufficient time to read each of your assigned papers.
Summarizing your main points. Your AC and fellow committee members will need to understand your argument quickly, and an initial summary can help them a lot. Use the remainder of the review to expand on the summary's main points and mention other matters.
Relevant past work. If you're aware of past work against which this paper can be judged, please mention it, whether or not the authors have cited it. See an example of this in Review 2. You need not comment on the thoroughness of the paper's citations, as in Review 5 and Review 8, for this is not what the form requests.
Significance of contribution and benefit. It's particularly helpful to your AC if you can assess the contribution objectively, preferably in terms of the past work you've identified (see Review 1, Review 3). Please don't just give a short, unexplained answer or avoid this point altogether.
Coverage of all the criteria. Your AC will pull together all the reviewers' comments on relevant past work, contribution's significance, benefit, validity and originality. You're not obliged to cover every one of these, but please don't devote the whole review to just one issue. Equally, don't cover all the criteria in a cursory manner without backing up each claim.
While it is important to point out weaknesses and validity issues, it is equally important to identify the contribution of a submission. Ultimately a submission's acceptance depends on its contribution, not perfection.
Reviewing "as is". The tight deadlines of CHI rule out any possibility of checking whether authors make changes demanded by the committee. Therefore the decision on whether or not to accept must be made on the basis of what the authors submit for review. Please do recommend improvements, but don't require acceptance to hinge on making these changes.
Polite, temperate language. However much you may dislike the paper, try to say so in a manner helpful to the authors and informative to the AC. Please don't vent your anger or shower abuse.
There are further parts of the form after this, covering written presentation, optional further comments for the author, and optional comments for the AC and other committee members only. We hope you find these parts easy to fill in.
Meeting the Deadline
The reviewing and selection of CHI papers takes place against very tight deadlines. After all the reviews are submitted the ACs prepare meta-reviews summarizing their recommendations. These are presented at a Papers Committee meeting early in November, during which decisions must be made on all papers, no matter how many have been submitted. If reviews aren't ready in time, meta-reviews can't be written, and the committee won't have sufficient information to make its decisions. Your help is essential to ensuring these deadlines are met. In particular, we rely on you to submit your reviews on time. If you cannot complete a thoughtful review by the review deadline, please contact your associate chair as soon as possible.
The example reviews include the text that might be entered in the review form. They include four reviews suitable for CHI, each of which follows recommended practice but each in its own way. These are based on actual CHI reviews that have been altered to disguise the content of the paper in question. They also include four unsuitable reviews that we have written specially to illustrate the pitfalls we hope we have helped you to avoid.
Examples of reviews suitable for CHI
Review 1 - paper medium-rated - review suitable
This review does a first-rate job of summarizing its main points and then assessing the paper's contribution in terms of relevant past work. It provides helpful feedback to the authors concerning the presentation of the work. It is preceded by a contribution summary that mentions reservations about the originality of the work.
The paper presents a set of six guidelines on menu design, drawn from two experiments studying menu selection in the presence of other targets on a GUI desktop. These can inform choices between certain menu types in UI design. However, some of the guidelines appear to have been published already.
This paper does an excellent job of citing and summarizing past work in the area. The studies seem robust and their findings generalizable. The research does not seem to offer much, however, beyond what has already been published.
The three most related papers are probably the two by Offord, et al, and the one by Masters and Selisky. With these as context, the six design guidelines seem accurate, but fairly incremental. Guidelines 1, 2 and 6 appear to be restatements of prior research (particularly Offord, Masters). Guidelines 3 and 4 appear to be a summary of the paper's experimental findings. Guideline 5 is very interesting and novel. But the studies seem to be summarizable as "we found the same results for a 5-element pull-down menu and for more freeform menus as Offord did for pie menus." While the result is rigorous, it is only a small incremental step.
I found the paper hard to follow in places, because it consistently reported details but did not offer me any opportunities to use these details in seeing a larger picture. While the studies were rigorous, the visual presentation of the results was not. Specifically, in Figure 4, are the results on a scale of 0 to 20, and is it displaying the mean? In Figure 6, are these results the mean per subject, out of 200 trials, with a theoretically unbounded maximum number of errors? Please explain, and also add confidence bars.
Of great importance: On page three (and also later), what units are "N"? My assumption is that the numbers are a fraction of the maximum force the phantom is capable of generating, but this is unclear.
Review 2 - paper medium-rated - review suitable
This is a good example of a review organized around the main criteria listed on the review form. It identifies an important body of relevant past work, and explains why the contribution and validity are considered inadequate. The section on originality gives credit, as it should, for original methods used in the research.
The authors develop a cognitive model to predict state transitions in Instant Messaging. The authors validate the state predictions in two experiments. Based on their model, they make predictions regarding interest levels of messages.
Past work: There have been several projects in the past that used cognitive models to predict Instant Messaging usage. Unfortunately they authors do not seem to be aware of this work, which mostly took place in the Ubicomp and CSCW communities (see http://www.ubico.com/papers.html, http://www.cscwj.com/conf01.html, but there are many other papers around). Using cognitive models to predict the interestingness of phone messages is new though as far as I can tell.
Significance: The presented work COULD POTENTIALLY be significant. Determining interesting phone messages by methods that go beyond the analysis of access frequency could yield better results than those that have been obtained so far. Unfortunately it has not been shown that the author's model is superior. The graphical comparison of two messages in Figs. 3 and 4 is meaningless.
Benefit: If the aims of the authors were to be achieved, this would be very beneficial for researchers in the area of personalization and in instant messaging.
Validity: Inadequate. While the match between predicted states and user's indication of their current states seems largely convincing, there is no rigid proof that one of these states really corresponds to interestingness.
Originality: To my knowledge the combination of supervised and unsupervised learning described here has not been used before in this kind of study.
Review 3 - paper high-rated - review suitable
This review does a particularly effective job of explaining the significance of the empirical findings that make up the paper's contribution. It does this by briefly summarizing two pieces of past work and showing how this paper helps in understanding their results. It also offers useful suggestions for optional improvements.
This paper presents empirical findings on the effects of factors in the design of collaborative learning environments. It builds on a growing body of empirical work examining this issue, and its findings are oriented towards helping designers of learning environments.
This paper offers a significant contribution, building strongly on previous related work. The authors present a thorough review of this work, including the important result by Finzi demonstrating the performance-boosting effect of one-on-one training, and Carlson's recent study showing a similar effect for teams that engaged in social chat while browsing documentation. The authors have set out to identify and separate the effects of three possible underlying factors in these earlier studies: interactivity, visibility, and online information. The significance of this work is high, given the increasing importance of distance learning and cooperative work among people who may have never met face to face. The benefit others can gain is an understanding of how to help remote collaborators train each other and have a higher probability of success in distance learning.
This study is well thought out, well executed, and well presented. Although the degree to which results such as these can be generalized to real training situations is unknown, other researchers should be able to take up the fundamental result with confidence. By using the Tower of Hanoi paradigm, the study also builds on a substantial body of social science research investigating training and cooperative behavior under different conditions in such tasks.
One question I had was Figure 3, which purports to show significant differences in a post-experimental measure of learning, but for which the ratings look substantially the same (varying between 55 and 65 on a scale labeled "ratings"). It would probably help to have some additional information or some comment in the text about why this is significant (i.e., there must be small but consistent differences across subjects?).
I think the last paragraph, in introducing some of the outstanding issues and limitations of a laboratory study such as this, needs to be expanded for the CHI audience. After all, the point of investigating these factors is to learn how to make application of them to real-world situations. So the consideration of how well the results may generalize, and some suggestions as to how investigators (especially HCI researchers interested in distance learning) might begin to test the generality of the results, would be highly appropriate for this venue.
Review 4 - paper high-rated - review suitable
This review explains clearly what the paper is contributing and how readers can benefit, while providing adequate coverage of other aspects such as past work and validity.
The paper presents an excellent example of how the CCIP model can be applied to a HCI problem. Its contribution is twofold: (1) it applies a cognitive architecture to gestures, thus creating a new angle on HCI problems that have not yet received attention and (2) it further establishes CCIT as a valid architecture to model applied HCI problems.
This paper builds on previous work by Dingle (1999), Spencer (2001b), Carlyle and Welch (2001) and, more broadly, Brown (2001), Brown and Green (1998) and Walker and Fellowes (2000).
As mentioned above, the paper makes a significant contribution because (1) In contrast to what might be expected, not much research has gone into the costs and benefits of the use of gestures in GUIs. This is even more surprising given the growing use of GUIs. (2) It further establishes CCIP as a valid architecture to model applied HCI problems. This model has not been around as long but has qualities that make it very useful to this line of research. Until recently it was mainly applied to CVE studies (like TXF) but it is beginning to be applied to direct HCI issues.
The results matter and can benefit others because they show how the perceptual-motor capabilities of CCIP can be used to "see" and manipulate objects in a GUI and how the general activation learning mechanism in CCI2 (retrieve versus compute paradigm) enables one to model not only the initial differences in the experimental conditions but also in some detail the learning curves. Furthermore, both the experiments and the modeling itself are done with expertise and reported in a clear and organized fashion.
Other researchers can confidently take up these results. However, it would be advisable for researchers/practitioners to interact with the CCIP system to ensure that that varying inputs do indeed yield the predicted results.
This line or research is gaining importance. Several other researchers are using the same approach. This does not however depreciate the original value of this work, which I regard to be relatively high. One thing I did miss was any kind of suggestion how the cognitive architecture could be improved to further increase its reliability. For instance, at this point there are some restrictions on how the software must be developed to enable the model to interact directly with it. This could be avoided in the future by including artificial eyes and hands separate from the cognitive model.
Examples of reviews unsuitable for CHI
Review 5 - paper low-rated - review unsuitable
This review is far too brief. The reviewer has provided only a few words on each criterion, with no supporting rationale. As a result the AC will find it hard to give weight to its low rating, or to explain the conflict between this rating and the review's neutral stance. The request to mention relevant past work has been misinterpreted. The review is preceded by a contribution statement that digresses into reviewing the paper.
This paper presents guidelines drawn from two experiments involving a GUI desktop application in which different styles of menu design were provided. I found the results rather obvious in places, and felt the authors overstated the potential benefit. For example, their first two guidelines restate standard practice in UI design. I thought the paper's written presentation could be improved.
Past work: The citations are adequate. Significance: This work does not really make a major advance in the development of menus. Benefit: There is little of benefit to interface designers here. Validity: The results seem valid. Originality: The experimental design was quite interesting.
Review 6 - paper medium-rated - review unsuitable
Here the contribution summary outlines the work the authors have carried out, not what they have contributed. The review dwells too much on the reviewer's subjective reactions and on questions the reviewer would like the authors to answer, never addressing the main review criteria.
The authors review the cognitive modeling literature and suggest that this approach can generate more accurate predictions of IM interest. They conduct two experiments to verify the state transitions predicted by their model and on this basis make predictions of interest level.
I enjoyed reading this paper. The results are really unexpected, and I could use them directly in my work. But I found the paper raised more questions for me than it answered. First of all, I would really like to know if there is a relationship between interest level and time constraints. When does one dominate over the other? Which factors influence state transitions, and which don't? Also, I want to know whether other types of model can be applied to the same problems? What advantages would these models have? I think there's a lot of important future work to be done here, and would have liked to hear more about it.
Review 7 - paper low-rated - review unsuitable
This review uses wholly inappropriate language to tell us that the reviewer disliked the paper. It will be hard for the authors or the AC to take comments or rating seriously.
This paper attempts to tell us something about the design of collaborative learning environments, but in fact tells us nothing. A complete waste of authors' and readers' time.
I found most of this paper's findings obvious in the extreme, and its ponderous presentation of them almost comical -- "we consider interactivity has different implications from visibility" — I ask you! Other findings completely contradicted the results of the study, for example the significant differences in a post-experimental learning measures. I just about gave up at this point and most readers will do the same. I struggled on however, wading through a description of the experiment that wouldn't even get a passing grade in high school. If CHI accepts this paper, I'll tell all my colleagues to stay away.
Review 8 - paper medium-rated - review unsuitable
The contribution summary never states what the contribution is. The review plunges almost at once into a lengthy comment on one particular issue. It finishes with a recommendation for provisional acceptance, which CHI cannot offer - papers can be accepted only as submitted.
The empirical studies are innovative, but a flaw in the interpretation of results detracts from the paper's overall contribution.
Past work was adequately cited, and the contribution is of some significance. The weakness in this paper, however, lies in the interpretation of the results. It isn't enough to compare the results of retrieval with those of target selection - other cues must be identified and factored out. The authors need to deal properly with Probert's Law, which can predict the same phenomena, rather than just mention it in passing. When it comes to learning curves, the authors have introduced a Poisson distribution without adequate justification. The unsuitability of this model is obvious when you look at the second-order differences. I would have liked to see much more attention paid to the results of the incomplete third experiment, rather than the two main experiments. I recommend that the paper be accepted provisionally, on the basis that these points will be addressed in the final version.