270 likes | 287 Views
This presentation discusses the need to improve the review process for paper selection and suggests tweaks to ensure fairness and integrity. It also highlights the importance of choosing PC chairs and members with expertise in the field.
E N D
Tweaks to improve our review process T. N. Vijaykumar
Review expertise data • From a recent conference: • I define low expertise as a score of 1 or 2 out of 4 (1- no familiarity, 2 - some familiarity, 3 - knowledgeable, 4 - expert). These are self-assigned (people may not choose 4 out of humility but 3 should be common). A key point is nobody will choose 2 when they really are 3 or 4 which means low expertise score truly means that.
Review expertise data • I would think that low-expertise reviews for any PC should be < 3 out of 18 reviews. • Of 49 PC members: • 13 people (> quarter of the PC) had low expertise for 50% or more reviews (9+ out of 18) • 25 people (> half the PC) had low expertise for 33% or more reviews (6+ out of 18) • This has happened before though any PC/ERC can easily check on HotCRP (in 2 clicks!)
Review score data • Historically, a paper with an average of Weak Accept usually gets in. 23 out of 49 PC members gave 8+ weak accepts or higher out of 18 papers they reviewed (many personal acceptance rates exceed 50%). At the historical acceptance rate of 20%, and assuming random paper-PC distribution the probability of 8+ "good" papers is < 2% • One non-randomness is that papers on a topic tend to go to the same people but that cannot change this probability from 2% to 50%
Review score data • These 23 over-positive PC had high overlap with the above 25 with 33% or more low-expertise reviews. • If a paper goes to two such over-positive PC it will likely get in irrespective of what is in the paper (over-negative PC addressed later) • The papers have to be ranked for PC discussion (there is no other way). Then, the lucky papers will be top-ranked in the PC discussion and will push out unlucky papers reviewed by the other PC
Review score data • So, the unlucky papers will likely get rejected irrespective of what is in them. OR during the discussions some of the lucky papers will get rejected but only after wasting time which will again push out the other papers. • Basically half the PC is grading out of 100 and the other half is grading out of 400 but the papers are rank-ordered on absolute scores without considering this bias. • This has happened before though any PC/ERC can check this in HotCRP (in 3 clicks!)
My guess (no real data) • I think this has been happening for ALL the 40 years of architecture- we did not have HOTCRP to collect data in the first 25 years and nobody has looked at data in the last 15 • The community has grown so larger –ve/+ve gap and more outlier reviewers • In the past, our ~20% acceptance may have come from 30% for 1/2 the PC and 10% for other 1/2 (so luck matters). Now, it may be coming from 35% from 1/2 and 5% from other 1/2 (so luck matters even more)
Axioms • Papers are imperfect, so are reviews • 150+ reviewers, 1500+ reviews only prevention will work, not cure • Reviewers must adhere to community standards for key metrics • problem importance, idea novelty, implementation realism, scientific experiments • No more, no less than long-standing standards • Reviewers must avoid unilateral standards • Reviewers must be fair and accountable • Chair/PC/ERC must look out for authors
Axioms • Paper outcome should not depend on who reviews it but on content • Our process can fail in many ways vigilance • We all are in this together don’t let good papers get killed or bad papers get in • Next time your good paper will be killed • Next time bad papers will evict your good paper • Every unjust reject of someone stochastically you get ripped off a bit • Following slides in order of time • Tweaks already present/possible within HotCRP
1: Choosing PC chair/member • Will singularly decide process integrity • PC member should have published papers in equivalent conferences recently so knows community standards • E.g., >= 3 papers/patents/products in last 5 years • PC chair should be an active researcher recently so knows who does what to get good reviews and knows what all could go wrong • E.g., Above + submitted to 30%+ of equiv. conferences in last 5 years • Ensure PC expertise covers last year’s submissions (not accepted papers – common mistake?) • If already true easy!
1: Choosing PC chair/member • E.g., 320 submissions • Option 1: two-round process, smaller PC+ERC • First round: 2 PC + 1 ERC reviews for all papers • Second round: 1 PC + 1 ERC reviews for top ~40% papers (2x historical acceptance rate) • 50 PC (15 papers each) + 45 ERC (10 papers each) • Bottom 60% little chance of accept, if truly bottom • still rebutted, resurrected if flagged (later) • First-round should be solid small PC a must • Neither lenient (many second round), nor harsh (unfair) • Chair monitors per-PC, per-ERC scores (later) • Smaller PC+ERC higher chance of solid reviews
1: Choosing PC chair/member • Option 2: one-step 5 reviews each • 3 PC + 2 ERC reviews for all papers • 64 PC (15 papers each) + 128 ERC (5 papers each) • OR 75 PC (12 each) + 150 ERC (4 each) • Smaller PC+ERC better expertise, lower variability • Fewer, less-variable reviews (option 1) better than more, more-variable reviews (option 2) • ONLY IF option 1 has a smaller PC • Large PC+ ERC key reason for wild variability in review quality and standards • 10 ERC reviews better calibration than 5, load ok • 15 PC reviews reasonable load
2: Conflict of interest • Both positive (pro) and negative (con) conflicts • If your paper shows a previous paper not to work then mark those authors as conflict and point to text/graph in your paper • Previous work “good” (speedup 1.3) and yours “better” (speedup 1.6) -- is NOT a conflict • Previous work speedup 0.9, yours 1.3 -- is a conflict • “Previous work breaks virtual memory” - is a conflict • Chair should be able to find 5 reviewers (out of 400+?) unconflicted with previous work’s authors • Bogus conflict immediate reject
3: Reviewer expertise • In review form, only 2 choices “expert” “knowledgeable” (keep “no/some familiarity” for calibration, but not clickable) • Give reviewers 1 week after paper assignment to read title+abstract and mark expertise • If neither choice, reviewer must return paper instead of silently giving weak reject/accept • Within 1 week so chair has time to reassign • 5 (at least 4) knowledgeable reviewers else false-reject if negative reviewer only expert • Chair has to work harder but that’s their job
4: Review instructions • Currently both false accepts and false rejects • All positive reviewers high scores for all papers so better ones likely top • All negative reviewers low scores for all papers so better ones still likely top • Both together bad papers push out good papers in inevitable ranking and PC discussion list • bad papers in AND good papers out (worsens false-reject impact) • Half PC grading out of 100, other half out of 400
4: Review instructions • Only prevention will work, not cure • No per-PC scaling of scores can separate weak accepts from each other scaling all scores of a PC loses selectivity- the VERY goal of reviews! • Instruct BEFORE reviewing starts • PC chairs should not say “be positive” because over-positive causes problems , should say “be fair – no more or less than long-standing standards” • Historically, a paper with average score of “Weak Accept” is highly likely to be accepted • VERY IMPORTANT for review calibration
4: Review instructions • 18 papers per PC, historical acceptance rate 20% Prob. ( “good” papers < 1) < 2%, Prob. (“good” papers > 7) < 2% • Assumes random paper assignment • Common non-randomness: papers on same topic go to same reviewers but cannot change 2% to 50% • Different topics don’t have different acceptance rates, sub-groups breaking community standards do • Per PC [1,7] weak accepts+ expected • Most – [3,4], some – [2,5], a few – [1,7] • Per ERC, 5 papers [0,2] weak accepts+
4: Review instructions • Large PC/ERC ~2 PC, ~4 ERC outside this range possible • Must justify to chair • Most PC [2,5] weak accepts+ • Most ERC should give [0,1] weak accepts+ • ¼ or ½ PC/ERC can’t claim to be outliers • PC/ERC should work hard to choose right papers • We claim to be a quantitative community time to act like one • Chair and ALL of PC/ERC should monitor the per-PC stats from day ONE and flag outliers
4: Review instructions • Concern: Restrict over-positive reviewers paper scores may be low and close • But high scores due to false accepts bogus • Few false accepts even with false rejects, good papers likely top • False rejects later • Average scores may be close & some reviewers may not give more than weak accepts rank papers on weak accepts+ count better separation & less biased • Alternative: each reviewer must spend all n tokens but new calibration, huge departure, still false-rejects
5: Pre-rebuttal • Request reviewers to pace themselves else poor review quality • Chair can’t micromanage • Chair can monitor “Procrastination” graph in HotCRP and send reminders every week • Hide other reviews/reviewer names until rebuttal discussion for review independence • Nop reviews gang up with negative reviews • One week to read rebuttals and change scores and two days for authors see if scores changed sanely else flag review
6: Review accountability • Currently, review says X, rebuttal says not X, review ignores rebuttal! • Contrary to wide belief, rebuttal can’t “convince” reviewers Most are not looking to be convinced • Rebuttals are for flagging review mistakes and reviewer accountability • Show full review scores during rebuttal so authors check review text and scores match
6: Review accountability • Authors should flag only factual mistakes, not harsh opinions (later) • “Paper A already did it” when paper A did not do key contributions • “does not feel right” without saying what • “done somewhere” without reference • “Paper does X” when paper avoids X • Nop review without critique but reject • NOT for “clever idea but did not show this breakdown so hard reject” – harsh but subjective • NOT for “X incremental over Y” – subjective (later)
6: Review accountability • Paper’s most positive reviewer, unconflicted with flagged reviewer, checks the flag • Hence reviewer names not shown at this stage • Flag valid discard review (else no accountability) • Many valid flags discard reviewer! • Else no accountability • Bogus flag(s) immediate reject • AFTER flag resolution, make all reviews/rebuttals visible to all reviewers of paper for post-rebuttal discussion • Overall, false rejects likely decrease
7: Post-rebuttal review • Often rushed (hence flawed) and negative, but becomes the deciding vote • Never heard of last-minute positive review • When chair asks for review, tell authors it is coming • Tell reviewers review will be rebutted • Give authors chance to rebutt even if only from 2am to 6am on day of PC meeting • Authors WILL rebut • Should allow authors to flag like other reviews
8: Post-rebuttal discussion • Should be 3 weeks • Target PC meeting to be a sanity check • 3 weeks for discussion vs. 10 minutes at PC meeting • Few false accepts and few false rejects there will be only ~20% (historical rate) papers viable (not fixed at 20% but has been so for decades so hypothetical arguments hollow) should get nearly-final program BEFORE PC meeting • Fewer than 10 papers to be discussed beyond sanity-check at the PC meeting • 80+ “discuss” papers in 1 day absurd/unfair outcome due more to PC stamina than content
9: PC meeting • Mostly sanity-check • Ensure lowest accept better than highest reject • Avoid unfair vote by PC who didn’t read paper • 3 PC who read the paper/wrote review/read rebuttal over 4-6 weeks can’t decide but can 50+ tired people who have not read the paper, who hear a 5-minute summary after meeting for 5+ hours? • Only PC that read the paper should vote • Other PC can ask questions to help the vote • If still no consensus (but 2 positive), accept • Even 1 positive, 1 neutral could be accept
10: Authors’ Bill of Rights • Right to fair and knowledgeable review • Right to be reviewed against community standards, not unilateral standards • Right to unconflicted review • Right to flag unfair review • Right to reviewer accountability • Right to rebut all reviews (even post-rebuttal) • Right to know that every rebuttal is read