[DISCLAIMER: The opinions expressed in my posts, and guest posts, are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting ...


Guest Post by Shira Gabriel: Don't Go Chasing Waterfalls and more...

Guest Post by Shira Gabriel: Don't Go Chasing Waterfalls

 [DISCLAIMER: The opinions expressed in my posts, and guest posts, are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


Guest post by Shira Gabriel

Don’t go chasing waterfalls, please stick to the rivers and the lakes that you're used to.

I haven’t always been the most enthusiastic respondent to all the changes in the field around scientific methods.  Even changes that I know are for the better, like attending to power and thus running fewer studies with more participants, I have gone along with grudgingly.  It is the same attitude I have towards eating more vegetables and less sugar.  I miss cake and I am tired of carrots, but I know that it is best in the long run. I miss running dozens of studies in a semester, but I know that it is best in the long run. 

It is not like I never knew about power, but I never focused on it, like many other people in the field.  I had vague ideas of how big my cell sizes should be (ideas that were totally wrong, I have learned since) and I would run studies using those vague ideas.  If I got support for my hypotheses-- great!  But when I didn’t, I would spend time with the data trying to figure out what went "wrong" -- I would look to see if there was something I could learn.  I would look to see if there was some other pattern in the data that could tell me why I didn’t find what I predicted and perhaps clue me into some other interesting phenomena. 

You know, like Pavlov trying to figure out why his saliva samples were messed up and discovering classical conditioning.  That was me, just waiting for the moment when I would discover my own version of classical conditioning1.

I am going to be honest here: I love doing that.  I love looking at data like a vast cave filled with goodies where one never knows what can be found.  I love looking for patterns in findings and then thinking hard about what those mean and whether, even though I had been WRONG at first, there was something else going on – something new and exciting. I love the hope of discovering something that makes sense in a brand new way. It is like detective work and exploration and mystery, all rolled in one.  I’m like Nancy Drew in a lab coat2.

Before anyone calls the data police on me, the next step was never publication.  I didn’t commit that particular sin. Instead if I found something I didn’t predict, I would design another study to test whether that new finding was real or not.  That was step two.

But this movement made me look back at my lab and the work we have done for the past 17 years3 and realize that although this has worked for me a couple of times I have also chased a lot -- A LOT -- of waterfalls that have turned into nothing. In other words, I have thrown a lot of good data after bad.4

And looking back, the ones that did work -- that turned into productive research areas with publishable findings -- were the ones that had stronger effects that replicated over different DVS right from the start.

When I chased something that wasn't as strong, I wasted huge amounts of my time and resources and, worse yet, the precious time of my grad students.  That happened more than is comfortable for me to admit.

So, I think a big benefit for me of our new culture and rules is that I spend less time chasing waterfalls.  My lab spends more time on each study (since we need bigger Ns) but we don't follow up unexpected findings unless we are really confident in them.  If just one DV or one interaction looks interesting, we let it go for what it likely is -- a statistical fluke.

And we don't just do that because it is what we are now supposed to do, we do it because empirically speaking I SHOULD have been doing that for 17 years.  I spent too much time chasing after patterns that turned out to be nothing5.

So I think my lab works better and smarter now because of this change.

As long as I am being so honest, I should admit that I miss chasing waterfalls.  Just last week, one of my current PhD students6 and I were looking at a study that is a part of a really solid research program of hers that thoughtfully and carefully increases our science.  And in her latest dataset, I felt the mist of a far-off possible waterfall in an unexpected interaction.  Could this be the big one? It was tempting, but we aren’t going to chase the waterfall.  As much as it seems fun and exciting, our science isn’t built on the drama and danger of waterfalls. To paraphrase the wise and wonderful TLC, I am sticking to the rivers and lakes that I am used to.  That is how science advances.


  1. Still waiting, in case you were wondering.
  2. I don’t wear a lab coat. More like Nancy Drew in Yoga pant and a stained sweatshirt, but same difference.
  3. I am really old.
  4. Or is it bad after good? I can never remember which way it is supposed to go.
  5. How can you tell if an unexpected finding is a waterfall or classical conditioning? You can’t. But here are the four things I now look for: are the effects consistent across similar DVs; can we look back and find similar things in old datasets; do we have a sound theoretical explanation for the surprising findings; and, finally, can that theoretical explanation lead to other hypotheses that we can look at in the data.  Only if a good chunk of that works out will we move on to collect more data. And yah, “good chunk” is not quantifiable. Sometimes Nancy Drew has to follow her instincts.
  6. Elaine Paravati. She rocks.

results blind vs. results bling*


[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


in many areas of science, our results sections are kind of like instagram posts.  beautiful, clear, but not necessarily accurate. researchers can cherry-pick the best angle, filter out the splotches, and make an ordinary hot dog look scrumptious (or make a lemon look like spiffy car).**  but what's even more fascinating to me is that our reaction to other people's results are often like our reactions to other people's instagram posts: "wow! that's aMAZing! how did she get that!"

i've fallen prey to this myself.  i used to teach bem's chapter on "writing the empirical journal article," that tells researchers to think of their dataset as a jewel, and "to cut and polish it, to select the facets to highlight, and to craft the best setting for it."  i taught this to graduate students, and then i would literally turn around, read a published paper, and think "what a beautiful jewel!"***

as with instagram, it's impossible to mentally adjust our reaction for the filtering the result could have gone through.  it's hard to imagine what the rough cut might've looked like.  it's hard to keep in mind that there could've been other studies, other measures, other conditions, other ways to clean the data or to run the analyses.  and we never know  - maybe this is one of those #nofilter shots.

in short, we can't avoid being blinded by shiny results.  

what can we do?

there are a few stopgaps.  for example, as an author, i can disclose as much as possible about the process of data collection and analysis, and the results (e.g., the 21 word solution).  as a reader, i'll often pause when i get to the end of the method section and ask myself - is this study well-suited to the researchers' goals?  would i think it should be published if i had to evaluate it just based on the method?    

another partial solution is pre-registration, the #nofilter of scientific research. by pre-registering, a researcher is committing to showing you the raw results, without any room for touching-up (for the planned analyses - the exploratory analyses can be looked at from any and all angles).  with a good pre-registration, readers can be pretty confident they're getting a realistic picture of the results, except for one problem. the editors and reviewers make their evaluations after seeing the results, so they can still consciously or unconsciously filter their evaluation through biases like wanting only counterintuitive, or significant, findings.  so pre-registration solves our problem only as along as editors and reviewers see the value of honestly-reported work, even if it's not as eye-catching as the filtered stuff. as long as editors and reviewers are human,**** this will likely be a problem.

the best solution to this problem, however, is to evaluate research before anyone knows the results.  this is the idea behind registered reports, now offered by Collabra: Psychology, the official journal of the Society for the Improvement of Psychological Science.  an author submits their paper before collecting data, with the introduction, proposed method, and proposed analyses, and makes a case for why this study is worth doing and will produce informative results.  the editor and reviewers evaluate the rationale, the design and procedures, the planned analyses and the conclusions the authors propose to draw from the various possible results.  the reviewers and editor give feedback that can still be incorporated into the proposed method.  then, if and when the editor is satisfied that the study is worth running and the results will be informative or useful regardless of the outcome, the authors get an "in principle acceptance" - a guarantee that their paper will be published so long as they stick to the plan, and the data pass some basic quality checks.  the final draft goes through another quick round of review to verify these conditions are met, and the paper is published regardless of the outcome of the study.

registered reports have many appealing characteristics.  for the author, they can get feedback before the study is conducted, and they can get a guarantee that their results will get published even if their prediction turns out to be incorrect, freeing them to be genuinely open to disconfirmation of their predictions.  it's nice to be able to have less at stake when running a study - it makes for more objectivity, and greater emotional stability.*****  

for science, the advantage is that registered reports do not suffer from publication bias - if all results are published, the published literature will present an unbiased set of results, which means science can be cumulative, as it's meant to be.  meta-scientists can analyze a set of published results and get an accurate picture of the distribution of effects, test for moderators, etc.  the only downside i can think of is that journals will be less able to select studies on the basis of projected citation impact - the 'in principle acceptance' means they have to publish even the findings that may not help their bottom line.  call me callous but i'm not going to shed too many tears over that.

not everything can be pre-registered, or done as a registered report.  for one thing, there's lots of very valuable existing data out there, and we shouldn't leave it hanging out to dry.  for another, we should often explore our data beyond testing the hypothesis that the study was originally designed to test.  many fascinating hypotheses could be generated from such fishing expeditions, and so long as we don't fool ourselves into thinking that we're testing those hypotheses when we're in fact just generating them, this is an important part of the scientific process.

 the fact that we keep falling for results bling, instead of evaluating research results-blind, just means we're human.  when you know the results are impressive, you're biased to think the method was rigorous.  the problem is that we too easily forget that there are many ways to come by impressive-looking results.  and even if we remember that filtering was possible, it's not like we can just magically know what the results look like unfiltered. it's like trying to imagine what your friends' unfiltered instagram pictures look like. 

are we ready to see what science looks like without the filters?  will we still get excited when we see it au naturel?  let's hope so - to love science is to accept it for what it really looks like, warts and all.


* this title was inspired by a typo.  
** although i'm using agentic verbs like "filtering" or "prettying up," i don't believe most of these distortions happen intentionally.  much of the fun in exploring a dataset is trying to find the most beautiful result we can.  it's hard to remember everything we tried, or what we might have thought was the best analysis before we knew how the results looked.  most of the touching up i refer to in this post comes from researchers engaging in flexible data analysis without even realizing they're doing so.  of course this is an assumption on my part, but the pretty large discrepancies between the results of pre-registered studies and similar but not-pre-registered studies suggests that flexibility in data analysis leads to exaggerated results.
***  words you'll never actually hear me say.
**** canine editing: not as effective as it sounds. 
***** personality change intervention prediction: if registered reports become the norm, scientists' neuroticism will drop by half a standard deviation. (side effect: positive affect may also take a hit)


Octopus25octopus vulgaris



alpha wars

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


i was going to do a blog post on having thick skin but still being open to criticism, and how to balance those two things.  then a paper came out, which i’m one of 72 authors on, and which drew a lot of criticism, much of it from people i respect a ton and often agree with (one of them is currently on my facebook profile picture, and one of the smartest people i know).  so this is going to be a two-fer blog post.  one on my thoughts about arguments against the central claim in the paper, and one on the appropriate thickness for scientists’ skin.

PART I: the substantive argument*

in our paper we argue that we should introduce a new threshold for statistical significance (and for claiming new discoveries), and make it .005. 

i want to take a moment to emphasize some other things we said in the paper.  we said that .005 should not be a threshold for publication.  we also said that we can keep the p < .05 threshold and call results that meet this threshold, but not the .005 threshold, “suggestive” (and not enough to claim a new discovery).  so, in some ways, this is not very different from what many people already implicitly (or explicitly?) do when we read papers – treat p-values as a quasi measure of the strength of the evidence, and maintain a healthy amount of skepticism until p-values get pretty low (at least in the absence of pre-registration, more on this below).

here are some arguments against the claim in our paper, and my thoughts on each of these.  many of these arguments came out of a long and lively discussion with Ellen Evers and Daniël Lakens (and from Lakens’s blog post).  i’m focusing on these arguments because i find them compelling or interesting to think about, so am curious to hear more points in favor of or against any of the points below.  this is me playing with the ideas here. i am sure there are flaws in my thinking.

1.  if we lower alpha to .005, the ratio of the type I error rate (when the null is true) to the type II error rate (when the null is false) will be way off (1:40).**


that would be true if the type I error rate really were equal to what we say our alpha level is, and the type II error rate really were equal to what our power analyses tell us beta is.  i don’t believe either of those is the case. i believe our type I error rate is actually much higher than our field’s nominal alpha level.  and the type II error rate is much lower than beta as calculated from power analyses (see my argument for this claim here – basically, p-hacking and QRPs help us make things significant, and that increases type I error, which we know, but also decreases type II error, which we rarely talk about).  so, under the current state of affairs in our field, lowering the nominal alpha level to .005 may actually bring the ratio of type I to type II error rates closer to the 1:4 that Cohen and others suggested than it is right now.

of course this would change if practices change a lot.  if we consistently pre-register our studies and follow our pre-registration, and only interpret p-values for the planned analyses, then our using a .005 alpha level would indeed lead to a 1:40 ratio of type I error rates when the null is true to type II error rates when the null is false. and i agree that this is too small a ratio in many circumstances.  so, i’ll revisit my opinion on what threshold we should use as a default for drawing meaningful conclusions when our practices have changed drastically. (i still use p-values when i’m not doing confirmatory research, so probably the first step is for me to quit doing that.*** one reason i haven’t is that i suspect it would make it very hard for my grad students to publish their work in outlets that would help them get jobs.  life is hard. principles ain’t cheap.)

for the same reason, i would be in favor of exempting the results of key, pre-registered analyses from this lower alpha.  i don’t know exactly what the right alpha should be, but i think that when the rules of Neyman-Pearson hypothesis testing were followed (pre-registered study with key analysis specified), i’m less worried about an out-of-control type I error rate.

2.  we should abandon thresholds, or NHST, altogether.


on thresholds: i have a hard time separating my feelings on the pragmatism of this position (very low) from my answer to “in a perfect world, would this be the right approach?” i don’t know.  i definitely think thresholds cause a lot of problems. but it’s hard for me to imagine humans abandoning them.  as an editor, i find thresholds useful because otherwise editorial decisions would feel even more arbitrary to authors.  i am sympathetic to authors who feel that the criteria on which their submissions will be evaluated are not transparent enough, or are impossibly vague.  i know thresholds aren’t the only solution to this problem (and are far, far from sufficient), but i am not imaginative enough to come up with something that would adequately replace them. 

also, if we want to encourage less black-and-white thinking, aren’t two thresholds (one for “suggestive” results and one for “statistically significant” results) better than one?  isn’t this a (baby) step towards thinking in gray?

but i get it. maybe to accept that thresholds are here to stay is to give away the farm.  instead of spending effort tweaking a broken system and trying to make it a little less broken, maybe i should spend more time trying to change the system (see also #4 below).

on abandoning NHST or frequentist approaches: i have to admit, much of this debate is over my head.  i have a lot more to learn about the philosophy and math behind Bayesian approaches.

3.  requiring results to meet a p < .005 threshold before they’re considered strong evidence is going to make it prohibitively difficult to conduct studies.


i have to admit, i’m a bit surprised by this argument. it seems like one of the few things we all agree on is that of course a single study, especially one with a high p-value, is far from conclusive, and of course we need mounds of evidence before drawing strong conclusions.  this kind of view is often expressed, but rarely heeded in discussion sections of papers, or in press releases.  in my view, introducing a new threshold is a way to try to enforce this skepticism that we all say we want more of. (particularly because we’re not saying that results shouldn’t be published until they meet this threshold, just that they should be considered nothing more than suggestive). why not make a rule that holds us to this standard we espouse?  if we can still consider results below .05 suggestive, we can still claim everything we’ve been saying we should claim, but not more.

even with a much lower threshold, a single study can’t be conclusive.  but at least it can have a chance of providing much stronger evidence.  and the costs of powering studies to this new threshold are not exorbitant, in my opinion (i know there is strong disagreement on this.  it's hard to know what should count as a ridiculous sample size expectation.  six years ago many, many people considered n = 20 per cell ridiculously high.).  for a correlation of .20, you need 191 people to have 80% power at the .05 threshold, and 324 people to have 80% power at the .005 threshold.  I don’t know about you, but that’s a way smaller price than i expected to have to pay for dividing our alpha level by ten.  moreover, if you decide to only power your study to the .05 threshold, you should still get a p-value below .005 about 50% of the time.  so if you’re running a series of studies and consistently missing the .005 threshold, something’s wrong.****

what does make sense to me about this argument is that lowering the threshold to .005 leads to a more conservative science.  yes.  absolutely. where i think i disagree with the spirit of the argument is that, implied in that criticism is the view that our past/current standards are reasonable or balanced (i.e., not particularly conservative or liberal), and so a drastic shift towards greater conservatism would be draconian.  in my view, our past (i’ll stay agnostic about current) standards were incredibly liberal – i am pretty convinced that it was relatively easy to, without realizing it, get a significant result even when the effect didn’t exist.  i know this is controversial, but i think it’s important to be clear about where the disagreement is.  i want our standards to become more conservative, but not because i think we should have a very conservative type I error rate or false discovery rate (FDR). i probably want about the same type I error rate or FDR as people who oppose the reforms i’m for.  the difference is that i think our current standards put us way, way above that mark, and so we need to shift to a more conservative standard just to hit the target we thought we’d been aiming for all along. 

4.  the real problem is that p-values, as currently used, are not meaningful, and we should fix that problem rather than hold p-values to a higher standard.


this is the most compelling argument i’ve heard so far.  i have tried to push for changes that would bring us closer to being able to interpret p-values in the way that (as i understand it) the Neyman-Pearson approach to NHST describes. that is, more pre-registration, and more transparency (e.g., if you say you predicted something, show us the pre-registration where you wrote down your prediction and how you would test it, and if you didn’t pre-register, refrain from claiming that, or interpreting statistics as if, you did).

if i could choose between introducing a new threshold for significance and getting people (including myself) to follow Neyman-Pearson rules (or not use p-values when we aren’t following those rules – also ok), i would choose the latter.  and in a world where we were careful to only interpret p-values when Neyman-Pearson rules were followed, i might advocate lowering alpha, but probably not all the way to .005.

so this is me admitting that by recommending a new threshold, i am caving a bit.  my faith in a future in which we achieve transparency and incentivize truly confirmatory research sometimes wavers, and i want a more immediate way to address what i perceive to be a pretty serious problem. this may be a bad compromise to make.  i’m still trying to figure that out.

these criticisms, and others, have definitely made me reconsider the recommendations in our paper.  i haven’t yet come around to the view that they’re horribly misguided – i still don’t see much harm in them – but i'm considering the possibility that there are better ways to achieve the same goals, or that our efforts are better spent elsewhere.  right now, i still feel that introducing a new threshold will actually help the other goals – encourage more transparency and more pre-registration.  this is based mostly on my perception that it will be easier to get a result below that threshold by engaging in these practices than by p-hacking (because p-hacking down to .005 is pretty hard, from what i understand). so, if we assume that people respond to incentives, or are drawn towards the least effortful path, adding this hurdle could make suboptimal research practices less attractive, by making them less effective.  if p-hacking isn’t going to save you, then the costs of pre-registering are lower (i.e., tying your hands is less costly when the loopholes you’re closing by pre-registering were unlikely to work anyway).  but i acknowledge that, for me, the end goal is to improve research design and practices, and the change to statistical interpretation is, in large part, a means to that goal.  that, and my playing around with G*power leads me to the conclusion that, when an effect is really there, and you’ve done a halfway decent job of designing your study, you should get a p-value below .005 quite often, so this threshold isn’t as scary as it might seem.

PART II: the process

i talk big about embracing criticism, fostering skepticism, etc.  i also agree with many people who’ve expressed the view that you have to have a healthy layer of thick skin to be in science.  these two views seem at odds to me, and i’ve been thinking about how to reconcile them.

i now have a bit more direct experience on which to base my reflection.  it's still going to sound platitudinous. here’s what i’ve come up with:

-look hard for kernels of logic/reasoning or empirical evidence.  sometimes it’s garnished with a joke at your expense, or accompanied by a dose of snark, but often there’s at least a nugget of real argument in there.  the criticism may still be wrong, but if it takes the form of a good faith effort at reasoning or empiricism, it’s probably worth entertaining.

-talk it out.  find people who you know are willing to tell you things you don’t want to hear.  ask them what they think about the criticisms.  ask them which ones they think you should take most seriously, and if there are any you can dismiss as not in good faith. if possible, talk directly with the people who are criticizing you. (my thinking benefited tremendously from talking to Lakens and Evers).

-take your time.  you’ll notice when your emotional reaction starts to subside and you can laugh a bit at your own initial defensiveness.  don’t decide what to do about the criticism until you’ve reached that point.  if you don’t reach that point, it’s not because you were never defensive.

-if, after all that, there is some criticism that clearly seems outside the realm of reasoning/empiricism, or is not in good faith, this is where it’s time to suit up and put on the thick skin. let it go.  go play with your cat, or watch a funny video, or bake a cake.*****

some people have a harder time with the first three steps (being permeable), some people have a harder time with the last step (being impermeable).  it's not easy to know when – and be able – to go back and forth.  but developing those skills is pretty important in science (life?).  also – i don’t think anyone is very good at doing these things alone.  find people who help you be more open to criticism, and people who help you be more resilient.  they might be the same people (those are pretty amazing people, keep them), or they might not (those people are ok, too).

* i speak only for myself, and not for my co-authors or anyone else.

** what should the ratio be?  i have no idea, and there’s clearly not just one answer, but apparently our field likes 1:4 a lot (alpha = .05, beta = .20).

*** are p-values useful to report even when you’re doing exploratory research?  i have some thoughts on this (inspired by Daniël Lakens, again), but that’s for another blog post.

**** this raises the question of whether every study needs to meet the .005 threshold to qualify for the new threshold.  here the answer is clearly no – we can identify what the distribution of p-values across multiple studies should look like if your effect is real, and it’s not 100% significant p-values (because of sampling error), regardless of the threshold. but if you have some ugly p-values, you can only confidently chalk up that messiness to sampling error if your readers can rule out other sources of messiness, like flexibility in data analysis or selective reporting – that is, if everything is pre-registered and all studies are reported.  also a blog post for another day.

***** emotion suppression gets such a bad rap.



what is rigor?

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


recently i wrote a thing where i was the eight millionth person to say that we should evaluate research based on its scientific qualities (in this case, i was arguing that we should not evaluate research based on the status or prestige of the person or institution behind it).  i had to keep the column short, so i ended with the snappy line "Let's focus less on eminence and more on its less glamorous cousin, rigor." 

simple, right?

the question of how to evaluate quality came up again on twitter,* and Tal Yarkoni expressed skepticism about whether scientists can agree on what makes for a high quality paper.  

Tal tweet

there's good reason to be skeptical that scientists - even scientists working on the same topic - would agree on the quality of a specific paper.  indeed, the empirical evidence regarding consensus among reviewers during peer review suggests there is ample disagreement (see this systematic review, cited in this editorial that is absolutely worth a read).

so, my goal here is to try to outline what i mean by "rigor" - to operationally define this construct at least in the little corner of the world that is my mind.  i have no illusions that others will share my definition, nor do i think that they should - there is some value to different people having different approaches to evaluating work, because then each may catch something the others miss (this is one of the reasons i'm not too concerned about low agreement among peer reviewers - we often select them on purpose to focus on different things or have different biases).  nevertheless, i think we should try to build some consensus as a field about what some important criteria are for evaluating the quality of research, and maybe if we start trying to articulate these more explicitly, we can see where people's views overlap and where they don't.  i think the exchange between Simmons and Finkel & Eastwick (and Simmons's response) was a nice example of that.  

what is rigor?  to me, this is what teaching research methods is all about. trying to identify specific features that make for a good scientific study.  so it won't surprise you that many of the things on my list of what makes a study rigorous will be super basic things that you learned in your first year of college.  but it's amazing to me how often these issues are overlooked or waved away in studies by seasoned researchers. thus, even though many of these things will seem too obvious to mention, i think they need to be mentioned. still, this list isn't even close to exhaustive, and it's very heavily oriented towards social and personality psych studies.  i would love to see others' lists, and to work together to come up with a slightly more comprehensive list.**

so, what do i ask myself when i'm reading and evaluating a scientific paper in my field (e.g., when i'm evaluating job candidates, trying to decide whether to build on a finding in my own research, reviewing a manuscript for a journal, etc.)?***  i ask myself two broad questions.  there are many more specific questions you could list under each one, but these broad questions are how i orient my thinking.

1. is the narrow claim true: did the authors find the effect they say they found?

questions about the research design:
-is the sample size adequate and is it clear how it was determined?  
-is the population appropriate for the questions the authors are trying to answer?
-are the measures and manipulations valid (and validated)?
-are there confounds? selection effects? selective dropout? (i.e., were the groups equal to begin with, and treated equally other than the manipulated variable?)
-could there be demand characteristics?
-did the authors design the study such that null results could be informative? (big enough sample, validity checks, etc.) 
-do the authors report all conditions and all measures collected? 
 -do the authors provide links to their materials?
questions about the analyses:
-are the analyses appropriate? do they test the key research question(s)? (this might seem like it's super straightforward, but it's often pretty complicated to identify the key statistical analysis that directly tests your prediction (something you become painfully aware of when you try to do a  p-curve).)
-are the test statistics correct?  (e.g., run statcheck)
-do the authors report all equally reasonable analyses they could have done to test their question (e.g., different decisions about data exclusions, transformations, covariates, etc.)? 
-if there were multiple analyses or multiple ways to conduct each analysis, do the authors address the increased chance of false positives (because of multiple comparisons)?
-do the authors provide their data and analysis code?  if so, are the results reproducible?
2. is the broader claim true: does the finding mean what the authors claim it means?
-are there alternate explanations?
-if the authors are making causal claims, are those justified?
-is the strength of the authors' conclusion calibrated to the strength of their evidence? (there are various ways to think about what "strong evidence" looks like: precise estimate, small p-value, large Bayes Factor, etc.)
-if results are not robust/consistent, are the conclusions appropriately calibrated?
-do the authors focus on some results and ignore others when interpreting what they found?
-do the authors extrapolate too far beyond the evidence they have?  are they careful about any generalizations beyond the population/setting/measures they examined? do they include a statement of constraints on generalizability?
-if the research is not pre-registered, is it presented as exploratory and is the finding presented as a preliminary one that needs to be followed up on?
-do the authors seem open to being wrong? do they take a skeptical approach to their results?  do they have any conflict of interest that may limit their openness to disconfirming evidence?
then what?
if a paper passes both of these hurdles, there are still questions left to ask.  what those questions are depends on your goal.  you might ask whether the study is important or interesting enough to spend more time on (e.g., build on with your own research).  you might ask if the research topic is a good fit with your department.  you might ask if the study is novel or groundbreaking or definitive enough for a particular journal. you might ask what implications it has for theory, or how its findings could be applied in the real world.  
to be honest though, if a paper passes both of these hurdles, i'm pretty impressed.****  maybe my standards will go up as the norms in our field change, but right now, a paper that passes these hurdles stands out to me. 
if there were three people like me, who defined rigor according to the criteria outlined above, then would we agree about which papers are better than others?  which candidate we should hire?  i don't know.  i'd like to think agreement would be slightly better than without a shared definition.  if so, that could be an argument for editorial teams, search committees, grant panels, and other evaluating bodies to try to come up with at least a loose set of criteria by which they define rigor.  perhaps that is a step towards this land of unicorns and roses where we don't just rely on metrics, where scientific work is evaluated on its own merit, and not based on status, fads, or the evaluator's hunger level.*****
* you can literally debate anything on twitter.  even powdered cheese.
** if you like this idea, you might like the SIPS workshop on "how to promote transparency and replicability as a reviewer" (osf page here.  SIPS 2017 program here.) 
*** i don't actually ask myself each of these sub-questions every time.  as with many operationalizations, this is a flawed attempt at articulating some of the things i think i do implicitly, some of the time.  mmmmm operationalization. funtimes.
**** i don't mean passes every single question - i'm not crazy like reviewer 2. i just mean if i can answer yes to "is the narrow claim true?" and to "is the broader claim true?"
***** no, that's probably not a real thing.

Bear sleeping

be your own a**hole

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


do you feel frustrated by all the different opinions about what good science looks like?  do you wish there were some concrete guidelines to help you know when to trust your results?  well don't despair!  

it's true that many of the most hotly debated topics in replicability don't have neat answers.  we could go around and around forever.  so in these tumultuous times, i like to look for things i can hold on to - things that have mathematical answers.  

here's one: what should we expect p-values for real effects to look like?  everyone's heard a lot about this,* thanks to p-curve, but every time i run the numbers i find that my intuitions are off. way off.  so i decided to write a blog post to try to make these probabilities really sink in.  

do these p-values make me look fat?
let's assume you did two studies and got p-values between .02 and .05.  should you be skeptical? should other people be skeptical?
many of us have gotten used to thinking of p = .049 as the world's thinnest evidence, or maybe even p = .04.  but what i'll argue here is that those intuitions should stretch out way further.  at least to p = .02, and maybe even to p = .01 or lower.
let's get one thing out of the way: if you pre-registered your design, predictions, and key analyses, and you are interpreting the p-values from those key analyses (i.e., the confirmatory results) then you're fine.  you're doing real Neyman-Pearson NHST (unlike the rest of us), so you can use the p < .05 cutoff and not ask any further questions about that.  go eat your cookie and come back when the rest of us are done dealing with our messy results.
now for the rest of us, who don't pre-register our studies or who like to explore beyond the pre-registered key analysis (i.e., who aren't robots**), what should we make of two results with p-values between .02 and .05?
the math is simple.  using an online app (thanks Kristoffer Magnusson, Daniel Lakens, & JP de Ruiter!) i calculated the probability of one study producing p-values between .02 and .05 when there is actually a true effect. it's somewhere between 5% and 15% (i played around with the sample size to simulate studies with power ranging from 50% to 80%).  so what's the probability of getting two out of two p-values between .02 and .05?  at "best" (i.e., if you're doing underpowered studies), it's 15% x 15% = 2.25%.  with decent power, it's around 0.25% (i.e., 1 in 400).
Screen Shot 2017-05-03 at 10.35.48 AM
percentages in the blue box are the ones i'm focusing on here.
if you agree with this math (and there's really not much room for disagreement, is there?), this means that, if you get two out of two p-values between .02 and .05, you should be skeptical of your own results.  if you're not skeptical of your own results, you make everyone else look like an asshole when they are skeptical of your results.  don't make them look like assholes.  be your own asshole.
yes, it's possible that you're in that 2%, but it would be unwise not to entertain the far more likely possibility that you somehow, unknowingly, capitalized on chance.***  and it's even more unreasonable to ask someone else not to think that's a likely explanation.  
and that's just two p-values between .02 and .05.  if you have even more than two studies with sketchy p-values (or if your p-values are even closer to .05), the odds you're asking us to believe are even smaller.  you're basically asking anyone who reads your paper to believe that you won the lottery - you managed to get the thinnest evidence possible for your effect that reaches our field's threshold for significance.
of course none of this means that your effect isn't real if your p-values are sketchy.  i'm not saying you should abandon the idea.  just don't stop there and be satisfied with this evidence. the evidence is telling you that you really don't know if there's an effect - whether you remember it or not, you likely did a little p-hacking along the way. that's ok, we all p-hack.  don't beat yourself up, just design a  new study, and another, and don't stop until the evidence is strong in one direction or the other.****
and i don't just mean the cumulative evidence.  yes, you can combine the p = .021 study, the p= .025 study, and p = .038 study to get a much smaller p-value, but that still doesn't explain how you got three high p-values out of three studies (extremely unlikely).  even with the meta-analytic p-value at p < .005, a reader (including you) should still conclude that you probably exploited flexibility in data analysis and that those three results are biased upwards, making the cumulative (meta-analytic) evidence very hard to interpret.  so keep collecting data until you get a set of results that is either very likely if there's a true effect (i.e., mostly small p-values) or very likely under the null (i.e., a flat distribution of p-values).  or, if you're brave and believe you can design a good, diagnostic study, pre-register and commit to believing the results of the confirmatory test. 
if that's too expensive/time-consuming/impossible, then do stop there, write up the results as inconclusive and be honest that there were researcher degrees of freedom, whether you can identify them or not. maybe even consider not using NHST, since you didn't stick to a pre-registered plan.  make the argument that these results (and the underlying data) are important to publish because this small amount of inconclusive evidence is super valuable given how hard the data are to collect. some journals will be sympathetic, and appreciate your honesty.*****
we talk big about how we want to preserve a role for creativity - we don't want to restrict researchers to pre-registered, confirmatory tests.  we need space for exploration and hypothesis generation.  i wholeheartedly agree; everything i do is exploratory.  but the price we have to pay for that freedom and creativity is skepticism.  we can't have it both ways.  we can't ask for the freedom to explore, and then ask that our p-values be interpreted as if we didn't explore, as if our p-values are pure and innocent. 
* brent roberts says it's ok to keep repeating all of the things.
** this is not a dig at robots or pre-registerers.  some of my favorite people walk like robots.
*** yes, that's a euphemism for p-hacking.
**** my transformation into a bayesian is going pretty well, thanks for asking.  if you drink enough tequila you don't even feel any pain.
***** probably not the ones you were hoping to publish in, tbh.
Polar bear1
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.