[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for ...


power grab: why i'm not that worried about false negatives and more...

power grab: why i'm not that worried about false negatives

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 


i've been bitching and moaning for a long time about the low statistical power of psych studies.  i've been wrong.

our studies would be underpowered, if we actually followed the rules of Null Hypothesis Significance Testing (but kept our sample sizes as small as they are).  but the way we actually do research, our effective statistical power is actually very high, much higher than our small sample sizes should allow.  

let's start at the beginning.

background (skip this if you know NHST)

NHST tableNull Hypothesis Significance Testing (over)simplified

in this table, power is the probability of ending up in the bottom right cell if we are in the right column (i.e., the probability of rejecting the null hypothesis if the null is false).  in Null Hypothesis Significance Testing (NHST), we don't know which column we're in, we only know which row we end up in.  if we get a result with p < .05, we are in the bottom row (and we can publish!*  yay!).  if we end up with a result with p > .05, we end up in the top row (null result, hard to publish, boo).  within each column, the probability of ending up in each of the two cells (top row, bottom row) adds up to 100%.  so, when we are in the left column (i.e., when the null is actually true, unbeknownst to us), the probability of getting a false positive (typically assumed to be 5%, if we use p < .05 as our threshold for statistical significance) plus the probability of a correct rejection (95%) add up to 100%.  and, when we are in the right column (i.e., when the null is false, also unbeknownst to - but hoped for by - us), the probability of a false negative (ideally at or below 20%) plus the probability of a hit (i.e., statistical power; 80%) add up to 100%.

side note: even if the false positive rate actually is 5% when the null is true, it does not follow that only 5% of significant findings are false positives.  5% is the proportion of findings in the left column that are in the bottom left cell.  what we really want to know is the proportion of results in the bottom row that are in the bottom left cell (i.e., the proportion of false positives among all significant results).  this is called the Positive Predictive Value (PPV) and would likely correspond closely to the rate of false positives in the published literature (since the published literature consists almost entirely of significant key findings).  but we don't know what it is, and it could be much higher than 5%, even if the false positive rate in the left column really was 5%.**

back to the main point.

we have small sample sizes in social/personality psychology.  small sample sizes often lead to low power, at least with the effect sizes (and between-subjects designs) we're typically dealing with in social and personality psychology.  therefore, like many others, i have been beating the drum for larger sample sizes.

not background

our samples are too small, but despite our small samples, we have been operating with very high effective power.  because we've been taking shortcuts.

the guidelines about power (and about false positives and false negatives) only apply when we follow the rules of NHST. we do not follow the rules of NHST.  following the rules of NHST (and thus being able to interpret p-values the way we would like to interpret them, the way we teach undergrads to interpret them) would require making a prediction and pre-registering a key test of that prediction, and only interpreting the p-value associated with that key test (and treating everything else as preliminary, exploratory findings that need to be followed up on). 

since we violate the rules of NHST quite often, by HARKing (Hypothesizing After Results are Known), p-hacking, and not pre-registering, we do not actually have a false positive error rate of 5% when the null is true.  that's not new - that's the crux of the replicability crisis.  but there's another side of that coin.  

the point of p-hacking is to get into the bottom row of the NHST table - we cherry-pick analyses so that we end up with significant results (or we interpret all significant results as robust, even when we should not because we didn't predict them).  in other words, we maximize our chances of ending up in the bottom row.  this means that, when we're in the left column (i.e., when the null is true), we inflate our chances of getting a false positive to something quite a bit higher than 5%.  

but it also means that, when we're in the right column (i.e., when the null hypothesis is false), we increase our chances of a hit well beyond what our sample size should buy us.  that is, we increase our power. but it's a bald-faced power grab.  we didn't earn that power.

that sounds like a good thing, and it has its perks for sure.  for one thing, we end up with far fewer false negatives. indeed, it's one of the main reasons i'm not worried about false negatives.  even if we start with 50% power (i.e., if we have 50% chance of a hit when the null is false, if we follow the rules of NHST), and then we bend the rules a bit (give ourselves some wiggle room to adjust our analyses based on what we see in the data), we could easily be operating with 80% effective power (i haven't done the simulations but i'm sure one of you will***).  

what's the downside?  well, all the false positives.  p-hacking is safe as long as our predictions are correct (i.e., as long as the null is false, and we're in the right column).  then we're just increasing our power.  but if we already know that our predictions are correct, we don't need science.  if we aren't putting our theories to a strong test - giving ourselves a serious chance of ending up with a true null effect - then why bother collecting data?  why not just decide truth based on the strength of our theory and reasoning?

to be a science, we have to take seriously the possibility that the null is true - that we're wrong.  and when we do that, pushing things that would otherwise end up in the top row into the bottom row becomes much riskier.  if we can make many null effects look like significant results, our PPV (and rate of false positives in the published literature) gets all out of whack. a significant p-value no longer means much.

nevertheless, all of us who have been saying that our studies are underpowered were wrong.  or at least we were imprecise.  our studies would be underpowered if we were not p-hacking, if we pre-registered,**** and if we only interpreted p-values for planned analyses. but if we're allowed to do what we've always done, our power is actually quite high.  and so is our false positive rate.


other reasons i'm not that worried about false negatives:

  • they typically don't become established fact, as false positives are wont to do, because null results are hard to publish as key findings.  if they aren't published, they are unlikely to deter others from pursuing the same question.
  • when they are published as side results, they are less likely to become established fact because, well, they're not the key results.
  • if they do make it into the literature as established fact, a contradictory (i.e., significant) result would probably be relatively easy to publish because it would be a) counter-intuitive, and b) significant (unlike results contradicting false positives, which may be seen as counter-intuitive but would still be subject to the bias against null results).

in short, while i agree with Fiedler, Kutzner, & Krueger (2012)***** that "The truncation of research on a valid hypothesis is more damaging [...] than the replication of research on a wrong hypothesis", i don't think many lines of research get irreversibly truncated by false negatives.  first, because the lab that was testing the valid hypothesis is likely motivated to find a significant result, and has many tools at its disposal to get there (e.g., p-hacking), even if the original p-value is not significant.  second, because even if that lab concludes there is no effect, that conclusion is unlikely to spread widely.

so, next time someone tells you your study is underpowered, be flattered.  they're assuming you don't want to p-hack, or take shortcuts, that you want to earn your power the hard way. no help from the russians.*******

* good luck with that.

** it's not.

*** or, you know, use your github to write up a paper on it with Rmarkdown which you'll put in your jupyter notebook before you make a shinyapp with the figshare connected to the databrary. 

**** another reason to love pre-registration:  if we all engaged in thorough pre-registration, we could stop all the yapping and get to the bottom of this replicability thing. rigorous pre-registration will force us to face the truth about the existence and magnitude of our effects, whatever it may be.  can we reliably get our effects with our typical sample sizes if we remove the p-hacking shortcut?  let's stop arguing****** and find out!

***** Fiedler et al. also discuss "theoretical false negatives", which i won't get into here.  this post is concerned only with statistical false negatives.  in my view, what Fiedler et al. call "theoretical false negatives" are so different from statistical false negatives that they deserve an entirely different label.

****** ok, let's not completely stop arguing - what will the psychMAP moderators do, take up knitting?

******* too soon?

psychMAP moderators, when they're not moderating




now is the time to double down on self-examination

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

IMG_8627 (1)

it can be tempting, when contemplating the onslaught that science is likely to face from the next administration and congress, to scrub away any sign of self-criticism or weakness that could be used against us.  as a "softer" science, psychology has reason to be especially nervous.*

but hiding our flaws is exactly the wrong response.  if we do that, we will be contributing to our own demise. the best weapon anti-science people can use against us is to point to evidence that we are no different from other ways of knowing.  that we have no authority when it comes to empirical/scientific questions.  our authority comes from the fact that we are open to scrutiny, to criticism, to being wrong.  the failed replications, and the fact that we are publishing and discussing them openly, is the best evidence we have that we are a real science.  that we are different from propaganda, appeals to authority, or intuition.  we are falsifiable.  the proof is that we have, on occasion, falsified ourselves. 
we should wear our battle with replicability as a badge of honor.  it's why the public should trust us to get things right, in the long run.  it's why it means something when we have confidence in a scientific discovery.  we should be proud of the fact that we don't take that word - discovery - lightly.  we scrutinize, criticize, attempt to falsify. that's why we will survive attacks on science.
yes, our failures will be used against us.  we will lose some battles.  but if we let those attacks scare us away from self-criticism and self-correction, we will have lost the war.
when we find a flaw in our process or in our literature, we need to responsibly communicate what it means and what it doesn't mean.  the failed replications i've seen in the last few years have been careful to do that.  those producing these critiques and "failures" are not trying to tear us - or anyone - down.  they are demonstrating just how tough we are.
the next few years are going to get even more challenging, but we need to resist the temptation to give in to a bunker mentality, to shield ourselves from criticism.  we need to be even more transparent, even more honest, than we have been until now.  we cannot fight ignorance with secrecy, we must face it head on with openness and faith in the scientific enterprise.

* if you came here for the jokes, you're out of luck.  this is a Very Serious post. mostly because my hotel room does not have a mini bar.**

** canada is, apparently, not perfect.

a little bit louder now

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the positions or policies of any institution with which I am affiliated.] 


i can't even begin to imagine how many times in her life hillary clinton has had to bite her tongue.  through everything, through trump, i never saw her even begin to lose her cool.  
on nov. 8th, i realized that i had assumed that all of that self-control, all of that turning the other cheek, meant that she had earned a win.  i made the same gamble on hillary's behalf that women and minorities make every day: that if we take the blows without flinching, if we don't play the woman/minority card, if we keep demonstrating our competence, people will have no choice but to recognize it.  
one lesson from this election is that this strategy doesn't always work.  it assumes a reality that does not yet exist.  we will not get extra credit for our patience and forbearance in the face of sexism, racism, homophobia, or xenophobia.  no one will pat us on the back for biting our tongues.  it is time to start speaking up.
i hope that when people watched the debates, and admired clinton's strength, they made the parallel to the less extreme situations that women and minorities face every day.  we don't often face someone as vile as trump, but if we want to be successful we have to be prepared to absorb smaller slights on a regular basis, to keep our cool when faced with ignorant, unfair, or offensive comments.  
why don't we call them out?  because it's extremely difficult.  because there are too many things to call out.  because there is backlash.  because we often don't know for sure how much sexism is to blame for any particular event.  because often the person being sexist is someone we like and respect, and we don't want to make them uncomfortable. because we often don't want to derail the conversation, or take away from the larger goal of the group. because there is gaslighting. because it's fucking exhausting to speak up.  
it is very hard to decide whether and when to speak up.*  i don't know the right answer, but hillary's loss has convinced me that the answer is: more.  so when it feels like the right thing to do, i will try very hard to say something.  saying words out loud is not my strong suit, so we'll see how it goes.  i'll probably fail, a lot.**  
i'll also try to speak up more when people are good. when people stick their necks out for women, for minorities, for people who have less voice, less visibility, more to lose. i think one reason i underestimated how much of an uphill battle hillary was facing is because i know a lot of good people. people who believed in me stubbornly, persistently, fiercely. people who treated me as an authority on my own experience, and on other topics, well before i thought i had anything worth saying, or trusted my own voice. i cannot thank those people enough, but i can try to pay it forward. 
speaking up scares the shit out of me. but if you do it with me, it'll seem a little less scary.*** i'm not talking about being a jerk, i'm talking about being more honest.  flinching, if you feel like flinching.  giving other people the chance to hear your experience.  telling people how things look and sound to you does not make you a nasty woman. and staying silent will not guarantee any reward.
* for example, i deliberated for a long time about whether or not to publish this blog post.  i'm still not sure about it.  presumably if you are reading this i decided to post it.  yikes.

** i have already failed, twice, since writing this sentence.

*** reading Lindy West's book, Shrill, also makes it seem a little less scary.  go read it.

who will watch the watchers?


[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 
he makes it look easy.
i wasn't going to write a blog post about susan fiske's column.  many others have already raised excellent points about the earlier draft of her column, and about the tone discussion more generally.  but i have two points to make, and poor self-control.
point #1.
i have a complicated relationship with the tone issue.  on one hand, i hate the "if you can't stand the heat, get out of the kitchen" attitude.  being able to stand the heat is not a personal accomplishment.  it's often a consequence of privilege.  those who have been burnt many times before (by disadvantage, silencing, etc.) are less likely to want to hang out in a place with a ton of heat.  and we need to make our field welcoming to those people who refuse to tolerate bullshit.  we need more people like that, not fewer.
on the other hand, i don't think the self-appointed data police are the most egregious source of bullshit.  the vast majority of methodological freedom fighters* are doing something good for the field.  they are literally doing the things i teach my undergrad research methods students to do.  they are reading papers, critiquing the methods, double-checking the stats, evaluating the conclusions.  if more people want to volunteer their time to check our work, we should welcome them. even if they are robots.** i'd rather spend my energy trying to fix the other sources of bullshit that i think are much more pernicious - mainly the fact that many early career researchers receive inadequate support and are often taken advantage of in various ways.  we don't treat our young well, and we need to do better.***  this is not new to this era, and it doesn't get enough attention.
so, point #1: let's turn down the heat in the kitchen so that everyone feels welcome, but the source of the most heat is not the data police.  it's the same source that has been creating heat since the beginning of mankind: status/power/privilege/entitlement.  
point #2.
fiske's column places a great deal of emphasis on the importance of gatekeepers.  she criticizes some forms of communication for being "uncurated" and "unmoderated."  instead, she advocates for critics to make "their rebuttals and letters-to-the-editor subject to editorial oversight and peer review."  she praises changes made through APS because "APS innovates via expert consensus" and this way research is "judged through monitored channels."  she prefers these channels because they "offer continuing education, open discussion, and quality control."
full disclosure: i am one of those gatekeepers.****  i am also a board member of APS, and proud of some of the things the organization has done.  but i find this line of argument extremely problematic.*****  
the "expert consensus" and "quality control" that fiske trusts is a big part of what got us into this mess.  i think we can forgive our colleagues if they no longer feel that the gatekeepers are always wise arbiters of quality.
moreover, if a reader wants to critique a published paper, especially if she believes she has found a major flaw in the paper, she is implicitly critiquing the journal it was published in and it's expert judgment.  it seems unreasonable to me to require that she only be allowed to express that criticism if it meets with the approval of the journal's editor (and reviewers, who are likely to include the author whose work is being critiqued).
more generally, gatekeepers need to be held accountable, and open to criticism.  fiske seems to want to grant them more power and control over what views are allowed to be expressed.  in contrast, i would like there to be more avenues for people to express criticisms of the system.  what fiske calls "quality control" i would call a system that is responsive to some perverse incentives and that is designed to protect itself and its reputation.  
the healthiest thing we can do for the peer review system is to give its critics a voice.  we should expose the process and its flaws to the sunlight, including through uncensored post-publication peer review.  journals, editors, and societies should welcome open discussion of the current system, and post-publication critiques of their output.  we bemoan how hard it is to find reviewers before publication, and then smear the people who donate their time to post-publication review.  if we care about the quality of our published record, we should appreciate both kinds of feedback.
a while back on facebook, someone jokingly suggested a ratemyeditor.com website.  it wasn't a serious proposal, but it reflects a real concern: what are researchers supposed to do when they want to critique the system?  who will watch the watchers?  rather than concentrate even more power in the hands of the gatekeepers, i say let's open the doors to more voices.
* credit for this phrase goes to sanjay srivastava. 
** i'm not gonna name names, but a certain someone does walk a lot like a robot.  not saying who, just saying, would a non-robot come up with the phrase "methodological freedom fighter"?

*** i was lucky to have excellent support from my advisor and many others as an early career researcher, so i am mostly not speaking from personal experience.  except for that time a famous social psychologist walked up to me and grabbed my ass at my first SPSP.  true story.
**** in case it's not already super obvious, the views i express here do not necessarily reflect the views of any journal or society i am affiliated with. 


i have found the solution and it is us

 [DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

bear, having recently joined SIPS

i have found scientific utopia.*  

sometimes, when i lay awake at night, it's hard for me to believe that science will ever look the way i want it to look,** with everyone being skeptical of preliminary evidence, conclusions being circumscribed, studies being pre-registered, data and materials being open, and civil post-publication criticism being a normal part of life.
then i realized that utopia already exists.  it's how we treat replication studies.
i've never tried to do a replication study,*** but some of my best friends (and two of my grad students) are replicators.  so i know a little bit about the process of trying to get a replication study published.  short version: it's super hard.
we (almost always) hold replication studies to an extremely high standard.  that's why i'm surprised whenever i hear people say that researchers do replications in order to get an 'easy' publication.  replications are not for the faint of heart.  if you want to have a chance of getting a failed replication**** published in a good journal, here's what you often have to do:
  • carefully match the design and procedure to the original study. except when you shouldn't. adapt the materials to your local time and culture, but don't change too much.  pre-test your materials if possible.  get the design approved by people who probably wish you weren't doing this study.
  • pre-register everything. be thorough - expect people to assume you were biased (more biased than the original researchers) and that these biases might have led you to inadvertently cherry-pick. 
  • run a shitload of participants.
  • repeat.
  • one more time, you forgot a potential moderator that the reviewers just thought of.
  • make sure that nowhere in your manuscript do you state with confidence that the evidence shows there is no effect.  admit that you might have made a type II error.  circumscribe all your conclusions and give serious consideration to alternative explanations, even if you think they're very unlikely.
  • make all of your materials publicly available.  expect people to question your expertise, so provide as much information as possible about all details of your method.
  • make all of your data publicly available.  this is not optional.  you don't get the same liberties as original researchers - open data is an imperative.
  • get through peer review, with a decent chance that one of your reviewers will perceive your manuscript as a personal attack on them.
  • once your paper is published be prepared for it to be picked apart and reanalyzed.  your conclusions will not be allowed to stand on their own, they will be doubted, and your results will be discounted for a variety of reasons.
  • you will told by well-respected senior scientists that you are harming your field, or wasting your time, or betraying your lack of creativity, or that you must be trying to take people down.  
from the tone of my writing, i wouldn't blame you for assuming that i think we are unfair to replication studies.  actually, i think we are unfair to original studies.  i would like us to treat original studies more like we treat replication studies (except maybe that last point).
imagine, if you can, a world where we hope for the following in original research:
  • pour tons of thought and effort into the study design before you think about collecting data.  show the design and procedure to skeptics who don't believe your theory.  let them pick it apart and make changes.  build a study that even they would agree is a strong test.
  • think about whether your materials would work in other times and places, so that researchers who want to replicate your study in other contexts know what factors to consider.
  • pre-register your study.  document everything assuming people will want to rule out the possibility that you got your result by inadvertently exploiting flexibility in the research process. 
  • run a shitload of participants.
  • assume your readers will not believe your first study results, and will propose hidden moderators.  repeat it a couple times just to be sure, testing some potential moderators along the way (pre-register these, too).
  • draw very circumscribed conclusions.  seriously consider the possibility that your result is a type I error.  entertain alternative explanations that are not sympathetic to your theory, and do not wave them away.  sit with them.  cuddle them.  let them into your heart.
  • make all of your materials and data publicly available. 
  • tell us about anything in your file drawer, let readers decide if it really belongs in the file drawer.
  • expect others to reanalyze your data, criticize your design, analyses, and conclusions, and propose post-hoc alternative explanations.  admit that they might be right, we can't know for sure until future research directly tests these explanations.
  • be accused of letting your bias creep in, in ineffable ways that cannot be cured by pre-registration and open data, and accept that this is a possibility.
  • accept that it will take a lot more to determine if the effect is real.  consider an adversarial collaboration.
of course not every study (original or replication) can have all of these features, and we should be flexible about tolerating studies with various limitations.  there are important side effects of raising standards in absolute, black and white ways.  we need room for studies that fall short on some criteria but are important for other reasons.  i know, because my own work falls far short of many of these ideals.  but these are nevertheless ideals we should strive for and value.  
one of the beautiful things about replication studies is that they make everyone see the value of these practices.  people who didn't care much for open data, pre-registration, etc., are often in favor of those practices when it comes to replication.  people who are resistant to the idea that original researchers' motives and biases could lead to systematic distortions in the published literature are more sympathetic to this concern when it comes to replication researchers' motives and biases.  
can you imagine what we'd say if a replication author refused to share their data or materials for no good reason?  can you imagine the reaction if we found out they hid some studies in the file drawer?*****  can you imagine how we'd react if their conclusions went way beyond the evidence?
there seems to be a consensus that replication studies, including materials, data, and results, belong to all of us. we feel we are owed complete transparency, and we expect to all have a say in how the studies should be evaluated and interpreted.
those are the correct reactions, but not just to replications.  these reactions provide an opportunity to build on this new common ground.  let's apply those values to all research, not just replications.
* no, i don't mean SIPS.  i'm barely even going to mention SIPS in this blog post.  or our new website.  with its new domain namehttp://improvingpsych.org/  
*** hypocrisy is one of my favorite pastimes. for example just this morning i ate a corndog for breakfast while simultaneously believing people should refrain from eating meat AND should eat a healthy balanced breakfast.

**** not that we know whether a replication will succeed of fail ahead of time (in all the cases i know of, the replicators were genuinely curious about the effect and open to any outcome).  but i'm talking about failed replications here because i think they're subjected to more scrutiny.  if the replication results come out positive, you face different challenges in getting those results published (some people think it's harder to publish successful replications than failed ones, but the evidence is not clear). 

cj and j├Âreskog, having recently discovered GIFs
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.