[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for ...

 

what is rigor? and more...




what is rigor?

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

IMG_0710

recently i wrote a thing where i was the eight millionth person to say that we should evaluate research based on its scientific qualities (in this case, i was arguing that we should not evaluate research based on the status or prestige of the person or institution behind it).  i had to keep the column short, so i ended with the snappy line "Let's focus less on eminence and more on its less glamorous cousin, rigor." 

simple, right?

the question of how to evaluate quality came up again on twitter,* and Tal Yarkoni expressed skepticism about whether scientists can agree on what makes for a high quality paper.  

Tal tweet

there's good reason to be skeptical that scientists - even scientists working on the same topic - would agree on the quality of a specific paper.  indeed, the empirical evidence regarding consensus among reviewers during peer review suggests there is ample disagreement (see this systematic review, cited in this editorial that is absolutely worth a read).

so, my goal here is to try to outline what i mean by "rigor" - to operationally define this construct at least in the little corner of the world that is my mind.  i have no illusions that others will share my definition, nor do i think that they should - there is some value to different people having different approaches to evaluating work, because then each may catch something the others miss (this is one of the reasons i'm not too concerned about low agreement among peer reviewers - we often select them on purpose to focus on different things or have different biases).  nevertheless, i think we should try to build some consensus as a field about what some important criteria are for evaluating the quality of research, and maybe if we start trying to articulate these more explicitly, we can see where people's views overlap and where they don't.  i think the exchange between Simmons and Finkel & Eastwick (and Simmons's response) was a nice example of that.  

what is rigor?  to me, this is what teaching research methods is all about. trying to identify specific features that make for a good scientific study.  so it won't surprise you that many of the things on my list of what makes a study rigorous will be super basic things that you learned in your first year of college.  but it's amazing to me how often these issues are overlooked or waved away in studies by seasoned researchers. thus, even though many of these things will seem too obvious to mention, i think they need to be mentioned. still, this list isn't even close to exhaustive, and it's very heavily oriented towards social and personality psych studies.  i would love to see others' lists, and to work together to come up with a slightly more comprehensive list.**

so, what do i ask myself when i'm reading and evaluating a scientific paper in my field (e.g., when i'm evaluating job candidates, trying to decide whether to build on a finding in my own research, reviewing a manuscript for a journal, etc.)?***  i ask myself two broad questions.  there are many more specific questions you could list under each one, but these broad questions are how i orient my thinking.

1. is the narrow claim true: did the authors find the effect they say they found?

questions about the research design:
-is the sample size adequate and is it clear how it was determined?  
-is the population appropriate for the questions the authors are trying to answer?
-are the measures and manipulations valid (and validated)?
-are there confounds? selection effects? selective dropout? (i.e., were the groups equal to begin with, and treated equally other than the manipulated variable?)
-could there be demand characteristics?
-did the authors design the study such that null results could be informative? (big enough sample, validity checks, etc.) 
-do the authors report all conditions and all measures collected? 
 -do the authors provide links to their materials?
 
questions about the analyses:
-are the analyses appropriate? do they test the key research question(s)? (this might seem like it's super straightforward, but it's often pretty complicated to identify the key statistical analysis that directly tests your prediction (something you become painfully aware of when you try to do a  p-curve).)
-are the test statistics correct?  (e.g., run statcheck)
-do the authors report all equally reasonable analyses they could have done to test their question (e.g., different decisions about data exclusions, transformations, covariates, etc.)? 
-if there were multiple analyses or multiple ways to conduct each analysis, do the authors address the increased chance of false positives (because of multiple comparisons)?
-do the authors provide their data and analysis code?  if so, are the results reproducible?
 
2. is the broader claim true: does the finding mean what the authors claim it means?
 
-are there alternate explanations?
-if the authors are making causal claims, are those justified?
-is the strength of the authors' conclusion calibrated to the strength of their evidence? (there are various ways to think about what "strong evidence" looks like: precise estimate, small p-value, large Bayes Factor, etc.)
-if results are not robust/consistent, are the conclusions appropriately calibrated?
-do the authors focus on some results and ignore others when interpreting what they found?
-do the authors extrapolate too far beyond the evidence they have?  are they careful about any generalizations beyond the population/setting/measures they examined? do they include a statement of constraints on generalizability?
-if the research is not pre-registered, is it presented as exploratory and is the finding presented as a preliminary one that needs to be followed up on?
-do the authors seem open to being wrong? do they take a skeptical approach to their results?  do they have any conflict of interest that may limit their openness to disconfirming evidence?
 
then what?
 
if a paper passes both of these hurdles, there are still questions left to ask.  what those questions are depends on your goal.  you might ask whether the study is important or interesting enough to spend more time on (e.g., build on with your own research).  you might ask if the research topic is a good fit with your department.  you might ask if the study is novel or groundbreaking or definitive enough for a particular journal. you might ask what implications it has for theory, or how its findings could be applied in the real world.  
 
to be honest though, if a paper passes both of these hurdles, i'm pretty impressed.****  maybe my standards will go up as the norms in our field change, but right now, a paper that passes these hurdles stands out to me. 
 
if there were three people like me, who defined rigor according to the criteria outlined above, then would we agree about which papers are better than others?  which candidate we should hire?  i don't know.  i'd like to think agreement would be slightly better than without a shared definition.  if so, that could be an argument for editorial teams, search committees, grant panels, and other evaluating bodies to try to come up with at least a loose set of criteria by which they define rigor.  perhaps that is a step towards this land of unicorns and roses where we don't just rely on metrics, where scientific work is evaluated on its own merit, and not based on status, fads, or the evaluator's hunger level.*****
 
 
* you can literally debate anything on twitter.  even powdered cheese.
** if you like this idea, you might like the SIPS workshop on "how to promote transparency and replicability as a reviewer" (osf page here.  SIPS 2017 program here.) 
*** i don't actually ask myself each of these sub-questions every time.  as with many operationalizations, this is a flawed attempt at articulating some of the things i think i do implicitly, some of the time.  mmmmm operationalization. funtimes.
**** i don't mean passes every single question - i'm not crazy like reviewer 2. i just mean if i can answer yes to "is the narrow claim true?" and to "is the broader claim true?"
***** no, that's probably not a real thing.


Bear sleeping
 

be your own a**hole

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

Giraffe1

do you feel frustrated by all the different opinions about what good science looks like?  do you wish there were some concrete guidelines to help you know when to trust your results?  well don't despair!  

it's true that many of the most hotly debated topics in replicability don't have neat answers.  we could go around and around forever.  so in these tumultuous times, i like to look for things i can hold on to - things that have mathematical answers.  

here's one: what should we expect p-values for real effects to look like?  everyone's heard a lot about this,* thanks to p-curve, but every time i run the numbers i find that my intuitions are off. way off.  so i decided to write a blog post to try to make these probabilities really sink in.  

do these p-values make me look fat?
 
let's assume you did two studies and got p-values between .02 and .05.  should you be skeptical? should other people be skeptical?
 
many of us have gotten used to thinking of p = .049 as the world's thinnest evidence, or maybe even p = .04.  but what i'll argue here is that those intuitions should stretch out way further.  at least to p = .02, and maybe even to p = .01 or lower.
 
let's get one thing out of the way: if you pre-registered your design, predictions, and key analyses, and you are interpreting the p-values from those key analyses (i.e., the confirmatory results) then you're fine.  you're doing real Neyman-Pearson NHST (unlike the rest of us), so you can use the p < .05 cutoff and not ask any further questions about that.  go eat your cookie and come back when the rest of us are done dealing with our messy results.
 
now for the rest of us, who don't pre-register our studies or who like to explore beyond the pre-registered key analysis (i.e., who aren't robots**), what should we make of two results with p-values between .02 and .05?
 
the math is simple.  using an online app (thanks Kristoffer Magnusson, Daniel Lakens, & JP de Ruiter!) i calculated the probability of one study producing p-values between .02 and .05 when there is actually a true effect. it's somewhere between 5% and 15% (i played around with the sample size to simulate studies with power ranging from 50% to 80%).  so what's the probability of getting two out of two p-values between .02 and .05?  at "best" (i.e., if you're doing underpowered studies), it's 15% x 15% = 2.25%.  with decent power, it's around 0.25% (i.e., 1 in 400).
 
Screen Shot 2017-05-03 at 10.35.48 AM
percentages in the blue box are the ones i'm focusing on here.
 
if you agree with this math (and there's really not much room for disagreement, is there?), this means that, if you get two out of two p-values between .02 and .05, you should be skeptical of your own results.  if you're not skeptical of your own results, you make everyone else look like an asshole when they are skeptical of your results.  don't make them look like assholes.  be your own asshole.
 
yes, it's possible that you're in that 2%, but it would be unwise not to entertain the far more likely possibility that you somehow, unknowingly, capitalized on chance.***  and it's even more unreasonable to ask someone else not to think that's a likely explanation.  
 
and that's just two p-values between .02 and .05.  if you have even more than two studies with sketchy p-values (or if your p-values are even closer to .05), the odds you're asking us to believe are even smaller.  you're basically asking anyone who reads your paper to believe that you won the lottery - you managed to get the thinnest evidence possible for your effect that reaches our field's threshold for significance.
 
of course none of this means that your effect isn't real if your p-values are sketchy.  i'm not saying you should abandon the idea.  just don't stop there and be satisfied with this evidence. the evidence is telling you that you really don't know if there's an effect - whether you remember it or not, you likely did a little p-hacking along the way. that's ok, we all p-hack.  don't beat yourself up, just design a  new study, and another, and don't stop until the evidence is strong in one direction or the other.****
 
and i don't just mean the cumulative evidence.  yes, you can combine the p = .021 study, the p= .025 study, and p = .038 study to get a much smaller p-value, but that still doesn't explain how you got three high p-values out of three studies (extremely unlikely).  even with the meta-analytic p-value at p < .005, a reader (including you) should still conclude that you probably exploited flexibility in data analysis and that those three results are biased upwards, making the cumulative (meta-analytic) evidence very hard to interpret.  so keep collecting data until you get a set of results that is either very likely if there's a true effect (i.e., mostly small p-values) or very likely under the null (i.e., a flat distribution of p-values).  or, if you're brave and believe you can design a good, diagnostic study, pre-register and commit to believing the results of the confirmatory test. 
 
if that's too expensive/time-consuming/impossible, then do stop there, write up the results as inconclusive and be honest that there were researcher degrees of freedom, whether you can identify them or not. maybe even consider not using NHST, since you didn't stick to a pre-registered plan.  make the argument that these results (and the underlying data) are important to publish because this small amount of inconclusive evidence is super valuable given how hard the data are to collect. some journals will be sympathetic, and appreciate your honesty.*****
 
we talk big about how we want to preserve a role for creativity - we don't want to restrict researchers to pre-registered, confirmatory tests.  we need space for exploration and hypothesis generation.  i wholeheartedly agree; everything i do is exploratory.  but the price we have to pay for that freedom and creativity is skepticism.  we can't have it both ways.  we can't ask for the freedom to explore, and then ask that our p-values be interpreted as if we didn't explore, as if our p-values are pure and innocent. 
 
* brent roberts says it's ok to keep repeating all of the things.
** this is not a dig at robots or pre-registerers.  some of my favorite people walk like robots.
*** yes, that's a euphemism for p-hacking.
**** my transformation into a bayesian is going pretty well, thanks for asking.  if you drink enough tequila you don't even feel any pain.
***** probably not the ones you were hoping to publish in, tbh.
 
Polar bear1
 

Perspectives You Won't Read in Perspectives: Thoughts on Gender, Power, & Eminence

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

 

This is a guest post by Katie Corker on behalf of a group of us

 
 
Rejection hurts. No amount of Netflix binge watching, nor ice cream eating, nor crying to one's dog* really takes the sting out of feeling rejected. Yet, as scientific researchers, we have to deal with an almost constant stream of rejection - there's never enough grant money or journal space to go around. 
 
Which brings us to today's topic. All six of us were recently rejected** from the Perspectives on Psychological Science special issue featuring commentaries on scientific eminence. The new call for submissions was a follow-up to an earlier symposium entitled "Am I Famous Yet?", which featured commentaries on fame and merit in psychological research from seven eminent white men and Alice Eagly.*** The new call was issued in response to a chorus of nasty women and other dissidents who insisted that their viewpoints hadn't been represented by the scholars in the original special issue. The new call explicitly invited these "diverse perspectives" to speak up (in 1,500 words or less****).  
 
Each of the six of us independently rose to the challenge and submitted comments. None of us were particularly surprised to receive rejections - after all, getting rejected is just about the most ordinary thing that can happen to a practicing researcher. Word started to spread among the rejected, however, and we quickly discovered that many of the themes we had written about were shared across our pieces. That judgments of eminence were biased along predictable socio-demographic lines. That overemphasis on eminence creates perverse incentives. That a focus on communal goals and working in teams was woefully absent from judgments of eminence.***** 
 
Hm. It appeared to us that some perspectives were potentially being systemically excluded from Perspectives!****** Wouldn't it be a shame if the new call for submissions yielded yet more published pieces that simply reinforced the original papers? What would be the point of asking for more viewpoints at all? 
 
Luckily it's 2017. We don't have to publish in Perspectives for you to hear our voices. We hope you enjoy our preprints, and we look forward to discussing and improving******* this work.
 
The manuscripts:
 
 
-Katie Corker
on behalf of the authors (Katie Corker, Fernanda Ferreira, Åse Innes-Ker, Cindy Pickett, Lani Shiota, & Simine Vazire)
 
 
* Kirby has had enough of your shit:  Dog



** Technically, some of the six of us got R&Rs, but the revisions requested were so dramatic that we have a hard time imagining being able to make them without compromising the main themes of our original pieces. 
 
*** It shouldn't come as a surprise that Eagly's commentary concerned issues relating to gender and power. It was also the only co-authored piece (David Miller was also an author) in the bunch. Eagly and Miller's title was "Scientific eminence:  Where are all the women?" Hi Alice and David - we're over here!
 
**** Doesn't appear that the original special issue had such a tight word limit, but who are we to judge?
 
***** We were also disheartened to notice that, ironically, many of themes we raised in our pieces surfaced in the treatment we received from the editor and reviewers. For instance, we were told to provide evidence for claims like "women in psychological science may face discrimination." One reviewer even claimed that white men are actually at a disadvantage when it comes to receiving awards in our field. We collectively wondered why we, the "diverse voices," were seemingly being held to higher standard of evidence than the pieces in the original symposium. Color us surprised that as members of a group stereotyped as less competent, and as outsiders to the eminence club, we had to work impossibly hard to be seen as competent (see Biernat & Kobrynowicz, 1997, among many others).
 
****** On the other hand, it's entirely possible that there were 20 such submissions, and the six of us represent the weakest of the bunch. Hard to get over that imposter syndrome...
 
******* We've chosen to post the unedited submissions so that you can see them in their original form. Of course, the reviewers did raise some good points, and we anticipate revising these papers in the future to address some of the issues they raised.
      
 

looking under the hood

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

Screen Shot 2017-03-02 at 4.07.25 PM
lipstick on a hippo
 

before modern regulations, used car dealers didn't have to be transparent.  they could take any lemon, pretend it was a solid car, and fleece their customers.  this is how used car dealers became the butt of many jokes.
 
scientists are in danger of meeting the same fate.*  the scientific market is unregulated, which means that scientists can wrap their shaky findings in nice packaging and fool many, including themselves.  in a paper that just came out in Collabra: Psychology,** i describe how lessons from the used car market can save us.  this blog post is the story of how i came up with this idea.
 
last summer, i read Smaldino and McElreath's great paper on "The natural selection of bad science."  i agreed with almost everything in there, but there was one thing about it that rattled me.  their argument rests on the assumption that journals do a bad job of selecting for rigorous science. they write "An incentive structure that rewards publication quantity will, in the absence of countervailing forces, select for methods that produce the greatest number of publishable results." (p. 13).  that's obviously true, but it's not necessarily bad.  what makes it bad is that "publishable result" is not the same thing as "solid study" - if only high quality studies were publishable, then this wouldn't be a problem. 
 
so this selection pressure that Smaldino and McElreath describe is only problematic to the extent that "publishable result" fails to track "good science."  i agree with them that, currently, journals*** don't do a great job of selecting for good science, and so we're stuck in a terrible incentive structure.  but it makes me sad that they seem to have given up on journals actually doing their job. i'm more optimistic that this can be fixed, and i spend a lot of time thinking about how we can fix it.****
 
a few weeks later, i read a well-known economics article by Akerlof, "The market for "lemons": Quality uncertainty and the market mechanism" (he later won a nobel prize for this work).  in this article, Akerlof employs the used car market to illustrate how a lack of transparency (which he calls "information asymmetry") destroys markets.  when a seller knows a lot more about a product than buyers do, there is little incentive for the seller to sell good products, because she can pass off shoddy products as good ones, and buyers can't tell the difference.  the buyer eventually figures out that he can't tell the difference between good and bad products ("quality uncertainty"), but that the average product is shoddy (because the cars fall apart soon after they're sold). therefore, buyers come to lose trust in the entire market, refuse to buy any products, and the market falls apart.
 
it dawned on me that journals are currently in much the same position as buyers of used cars.  the sellers are the authors, and the information asymmetry is all the stuff the author knows that the editor and reviewers do not: what the authors predicted ahead of time, how they collected their data, what the raw data look like, what modifications to the data or analyses were made along the way (e.g., data exclusions, transformations) and why, what analyses or studies were conducted but not reported, etc.  without all of this information, reviewers can only evaluate the final polished product, which is similar to a car buyer evaluating a used car based only on its outward appearance. 
 
because manuscripts are evaluated based on superficial characteristics, we find ourselves in the terrible situation described by Smaldino and McElreath: there is little incentive for authors to do rigorous work when their products are only being evaluated on the shine of their exterior.  you can put lipstick on a hippo, and the editor/reviewers won't know the difference. worst of all, you won't necessarily know the difference, because you're so motivated to believe that it's a beautiful hippo (i.e., real effect).
 
that's one difference between researchers and slimy used car dealers***** -- authors of scientific findings probably believe they are producing high-quality products even when they're not.  journals keep buying them, and until recent replication efforts, the findings were never really put to the test (at least not publicly).  
 
the replicability crisis is the realization that we have been buying into findings that may not be as solid as they looked. it's not that authors were fleecing us, it's that we were all pouring lots of time, money, and effort into products that, unbeknownst to us, were often not as pretty as they seemed.
 
the cycle we've been stuck in, and the one described by Smaldino and McElreath, is the same one Akerlof explained with the used car market.  happily, that means Akerlof's paper also points to the solution: transparency.  journals have the power to change the incentive structure.  all they need to do is reduce the information asymmetry between authors and reviewers by requiring more transparency on the part of authors.  give the reviewers and editors (and, ideally, all readers) the information they need to accurately evaluate the quality of the science.  if publication decisions are strongly linked to the quality of the science, this will provide incentives for authors to do more rigorous work.
 
we would laugh at a buyer who buys a used car without looking under the hood.  yet this is what journals (and readers) are often doing in science.  likewise, we would laugh at a car dealer who doesn't want us to look under the hood and instead says "trust me," but we tolerate the same behavior in scientists.
 
but, you might say, scientists are more trustworthy than used car dealers!  sure,****** but we are also supposed to be more committed to transparency.  indeed, transparency is a hallmark of science - it's basically what makes science different from other ways of knowing (e.g., authority, intuition, etc.).  in other words, it's what makes us better than used car dealers.  
 
 
* if you've listened to paula poundstone on NPR, you might think it's already too late
 
** full disclosure: i am a senior editor at Collabra:Psychology.  i tried to publish this paper elsewhere but they didn't want it.  i do worry about the conflict inherent in publishing in a journal i'm an editor at, and i have been trying avoid it.  my rationale for making an exception here is that publishing in Collabra: Psychology is not yet so career-boosting that i feel i am getting a major kickback, but perhaps i am wrong.  also, if it helps, you can see the reviews and action letter from a previous rejection, which i submitted to Collabra for streamlined review. [pdf]
 
*** the other journals.
 
**** sometimes i lay awake in bed for hours thinking about this.  well, doing that or reading the shitgibbon's tweets and planning my next act of resistance.
 
***** sorry used car dealers.  i have little reason to smear you. well, except for the time you didn't want to let me test drive a car then called me a "bold lady" when i insisted that i could be trusted with a manual transmission. (i was 28 years old.) (and i have always driven a stick shift.) (because i am french.) 
 
****** not actually sure.  
 
Hippo2
underwater hippos
 

power grab: why i'm not that worried about false negatives

[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.] 

Bear26


i've been bitching and moaning for a long time about the low statistical power of psych studies.  i've been wrong.

our studies would be underpowered, if we actually followed the rules of Null Hypothesis Significance Testing (but kept our sample sizes as small as they are).  but the way we actually do research, our effective statistical power is actually very high, much higher than our small sample sizes should allow.  

let's start at the beginning.

background (skip this if you know NHST)

NHST tableNull Hypothesis Significance Testing (over)simplified

in this table, power is the probability of ending up in the bottom right cell if we are in the right column (i.e., the probability of rejecting the null hypothesis if the null is false).  in Null Hypothesis Significance Testing (NHST), we don't know which column we're in, we only know which row we end up in.  if we get a result with p < .05, we are in the bottom row (and we can publish!*  yay!).  if we end up with a result with p > .05, we end up in the top row (null result, hard to publish, boo).  within each column, the probability of ending up in each of the two cells (top row, bottom row) adds up to 100%.  so, when we are in the left column (i.e., when the null is actually true, unbeknownst to us), the probability of getting a false positive (typically assumed to be 5%, if we use p < .05 as our threshold for statistical significance) plus the probability of a correct rejection (95%) add up to 100%.  and, when we are in the right column (i.e., when the null is false, also unbeknownst to - but hoped for by - us), the probability of a false negative (ideally at or below 20%) plus the probability of a hit (i.e., statistical power; 80%) add up to 100%.

side note: even if the false positive rate actually is 5% when the null is true, it does not follow that only 5% of significant findings are false positives.  5% is the proportion of findings in the left column that are in the bottom left cell.  what we really want to know is the proportion of results in the bottom row that are in the bottom left cell (i.e., the proportion of false positives among all significant results).  this is called the Positive Predictive Value (PPV) and would likely correspond closely to the rate of false positives in the published literature (since the published literature consists almost entirely of significant key findings).  but we don't know what it is, and it could be much higher than 5%, even if the false positive rate in the left column really was 5%.**

back to the main point.

we have small sample sizes in social/personality psychology.  small sample sizes often lead to low power, at least with the effect sizes (and between-subjects designs) we're typically dealing with in social and personality psychology.  therefore, like many others, i have been beating the drum for larger sample sizes.

not background

our samples are too small, but despite our small samples, we have been operating with very high effective power.  because we've been taking shortcuts.

the guidelines about power (and about false positives and false negatives) only apply when we follow the rules of NHST. we do not follow the rules of NHST.  following the rules of NHST (and thus being able to interpret p-values the way we would like to interpret them, the way we teach undergrads to interpret them) would require making a prediction and pre-registering a key test of that prediction, and only interpreting the p-value associated with that key test (and treating everything else as preliminary, exploratory findings that need to be followed up on). 

since we violate the rules of NHST quite often, by HARKing (Hypothesizing After Results are Known), p-hacking, and not pre-registering, we do not actually have a false positive error rate of 5% when the null is true.  that's not new - that's the crux of the replicability crisis.  but there's another side of that coin.  

the point of p-hacking is to get into the bottom row of the NHST table - we cherry-pick analyses so that we end up with significant results (or we interpret all significant results as robust, even when we should not because we didn't predict them).  in other words, we maximize our chances of ending up in the bottom row.  this means that, when we're in the left column (i.e., when the null is true), we inflate our chances of getting a false positive to something quite a bit higher than 5%.  

but it also means that, when we're in the right column (i.e., when the null hypothesis is false), we increase our chances of a hit well beyond what our sample size should buy us.  that is, we increase our power. but it's a bald-faced power grab.  we didn't earn that power.

that sounds like a good thing, and it has its perks for sure.  for one thing, we end up with far fewer false negatives. indeed, it's one of the main reasons i'm not worried about false negatives.  even if we start with 50% power (i.e., if we have 50% chance of a hit when the null is false, if we follow the rules of NHST), and then we bend the rules a bit (give ourselves some wiggle room to adjust our analyses based on what we see in the data), we could easily be operating with 80% effective power (i haven't done the simulations but i'm sure one of you will***).  

what's the downside?  well, all the false positives.  p-hacking is safe as long as our predictions are correct (i.e., as long as the null is false, and we're in the right column).  then we're just increasing our power.  but if we already know that our predictions are correct, we don't need science.  if we aren't putting our theories to a strong test - giving ourselves a serious chance of ending up with a true null effect - then why bother collecting data?  why not just decide truth based on the strength of our theory and reasoning?

to be a science, we have to take seriously the possibility that the null is true - that we're wrong.  and when we do that, pushing things that would otherwise end up in the top row into the bottom row becomes much riskier.  if we can make many null effects look like significant results, our PPV (and rate of false positives in the published literature) gets all out of whack. a significant p-value no longer means much.

nevertheless, all of us who have been saying that our studies are underpowered were wrong.  or at least we were imprecise.  our studies would be underpowered if we were not p-hacking, if we pre-registered,**** and if we only interpreted p-values for planned analyses. but if we're allowed to do what we've always done, our power is actually quite high.  and so is our false positive rate.

also

other reasons i'm not that worried about false negatives:

  • they typically don't become established fact, as false positives are wont to do, because null results are hard to publish as key findings.  if they aren't published, they are unlikely to deter others from pursuing the same question.
  • when they are published as side results, they are less likely to become established fact because, well, they're not the key results.
  • if they do make it into the literature as established fact, a contradictory (i.e., significant) result would probably be relatively easy to publish because it would be a) counter-intuitive, and b) significant (unlike results contradicting false positives, which may be seen as counter-intuitive but would still be subject to the bias against null results).

in short, while i agree with Fiedler, Kutzner, & Krueger (2012)***** that "The truncation of research on a valid hypothesis is more damaging [...] than the replication of research on a wrong hypothesis", i don't think many lines of research get irreversibly truncated by false negatives.  first, because the lab that was testing the valid hypothesis is likely motivated to find a significant result, and has many tools at its disposal to get there (e.g., p-hacking), even if the original p-value is not significant.  second, because even if that lab concludes there is no effect, that conclusion is unlikely to spread widely.

so, next time someone tells you your study is underpowered, be flattered.  they're assuming you don't want to p-hack, or take shortcuts, that you want to earn your power the hard way. no help from the russians.*******

* good luck with that.

** it's not.

*** or, you know, use your github to write up a paper on it with Rmarkdown which you'll put in your jupyter notebook before you make a shinyapp with the figshare connected to the databrary. 

**** another reason to love pre-registration:  if we all engaged in thorough pre-registration, we could stop all the yapping and get to the bottom of this replicability thing. rigorous pre-registration will force us to face the truth about the existence and magnitude of our effects, whatever it may be.  can we reliably get our effects with our typical sample sizes if we remove the p-hacking shortcut?  let's stop arguing****** and find out!

***** Fiedler et al. also discuss "theoretical false negatives", which i won't get into here.  this post is concerned only with statistical false negatives.  in my view, what Fiedler et al. call "theoretical false negatives" are so different from statistical false negatives that they deserve an entirely different label.

****** ok, let's not completely stop arguing - what will the psychMAP moderators do, take up knitting?

******* too soon?


Bear25
psychMAP moderators, when they're not moderating
 

 

 

 
 
   
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.