How much practice is needed before daily actions are performed in a way that feels habitual?

ORCID_LOGO based on reviews by Benjamin Gardner, Wendy Wood and Adam Takacs
A recommendation of:

How long does it take to form a habit?: A Multi-Centre Replication


Submission: posted 26 May 2022
Recommendation: posted 17 January 2023, validated 17 January 2023
Cite this recommendation as:
Dienes, Z. (2023) How much practice is needed before daily actions are performed in a way that feels habitual?. Peer Community in Registered Reports, .


Even small changes in daily life can have a significant impact on one’s health, for example going to the gym at regular times and eating a healthy breakfast. But how long must we do something before it becomes a habit? Lally et al. (2010) tracked the subjective automaticity of a novel, daily (eating or exercise-related) routine. Based on 39 participants, they found a median time of 66 days. This estimate has never been replicated with their exact procedure, so the question remains of how well this holds up. Yet the estimate is useful for knowing how long we have to effortfully make ourselves perform an action until we will do it automatically.
In the current study, de Wit et al. (2023) propose a four-centre near-exact replication of Lally et al. (2010), for which they aim to test 800 subjects to provide a precise estimate of the time it takes to form a habit.
The Stage 1 manuscript was evaluated over four rounds of review. Based on detailed responses to the reviewers' comments, the recommender judged that the manuscript met the Stage 1 criteria and therefore awarded in-principle acceptance (IPA).
URL to the preregistered Stage 1 protocol:
Level of bias control achieved: Level 4. At least some of the data/evidence that will be used to answer the research question already exists AND is accessible in principle to the authors (e.g. residing in a public database or with a colleague), BUT the authors certify that they have not yet accessed any part of that data/evidence.
List of eligible PCI RR-friendly journals:
1. Lally, P., van Jaarsveld, C. H. M., Potts, H. W. W., & Wardle, J. (2010). How are habits formed: Modelling habit formation in the real world. European Journal of Social Psychology, 40, 998–1009.
2. de Wit, S., Bieleke, M., Fletcher, P. C., Horstmann, A., Schüler, J., Brinkhof, L. P., Gunschera, L. J., AND Murre, J. M. J. (2023). How long does it take to form a habit?: A Multi-Centre Replication, in principle acceptance of Version 4 by Peer Community in Registered Reports.
Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Evaluation round #4

DOI or URL of the report:

Version of the report: v3

Author's Reply, 12 Jan 2023

Decision by ORCID_LOGO, posted 03 Dec 2022, validated 03 Dec 2022

We are almost there, but there is an inferential problem in your Design Table. For H2 (and similarly H3) you say "Therefore, not obtaining a significant effect will not be treated as evidence against the theory. " but this does not follow from your claim to have determined a minimally theroetically interesting effect. The point arises because you have not determined a minimally interesting effect, but rather an effect you happened to be powered to detect (which does seem very small, so it seems quite relevant to inference). The way to put it, if you are basing inferences on power, is that you will report the minimal effect you are powered to detect without claiming evidence against the theory as a whole, precisely because you have not determined a theoretically relevant minimal effect. The next column switches to a different inference procedure, what I call "inference by intervals". Namely you say "Conversely, if the lower bounds of confidence intervals are below the minimal interesting effect, then this indicates that performing the behaviour consistently is irrelevant for automatization." If you did have a minimally interesting effect, it would be if the upper bound of the CI was below the minimally interesting effect that the effect could not be larger than the miimally interesting; if the lower bound is below, your CI may span interesting and uninteresting effects, so one would withold judgment. But as you don't have a  theoretically justified minimally interesting effect, this does not quite work anyway. If you stick to power you can say simply a non-significant result does not count against the theory as a whole, but you were powered to pick up effects larger than X units difference.  (Incidentally I prefer inference by intervals, as you have in the interpretation column rather than simply relying on power, because it makes use of the data you actually obtained - the CI tells you what effects you are allowed to rule out in the light of the data, no matter what you were originally powered to do, without considering the data. But then inference hinges on the minimally interesting effect size, so for conclusions to related to theory, that minimal effect better depends on theory rather than on the theory-irrelevant issues such as a researcher's resources.)

Evaluation round #3

DOI or URL of the report:

Version of the report: v3

Author's Reply, 01 Dec 2022

Download author's reply Download tracked changes file

Dear Zoltan,

Please find the rebuttal letter and the revised manuscript attached. We look forward to hearing from you!

All the best,


Decision by ORCID_LOGO, posted 11 Nov 2022

Thank you for your excellent revision. Some of my points have not been yet fully taken on board; this is not surprising as RRs often require thinking in ways people are rarely used to.

1) minor point: p 4 " even more powerful"Given what has been said so far the issue is not power which is about testing hypotheses, but the precision of estimation (i.e. narrower confidence intervals). So it would be better to say your study will have more precision.

2) The main issues concern control of multiple testing; but especially how one draws inferences without justifying what minial effect size is of scientific relevance, for rows 2 and 3 of the Design Table. 

But first row 1. The final column of the first row is no longer accurate; you will not be testing Hull's theory.

In terms of "interpretation" in the first row, you can leave it as it is. But whether the best estimate is really exactly 66 is not really the point.  I would think if the overall confidence interval included any values in say the range 55 - 75, the original estimate was really pretty good. (Imagine you ran a million subjects and no CIs included 66, but they were all tightly formed around 61 days. One would still conclude that the original study had done a fine job of estimating the time taken.) So you mghit want to rephrase along these lines.

Second row. Three tests are suggested here. Will you conclude consistent performance is important if any one of them are significant? If so you should use a bonferroni or other familywise error correction. If not specify how you will draw conclusions - will you infer consistency is important only if all three tests are significant, for example? You do not justify a minimally interesting effect size i.e. give reasons why there is minimally interesting effect relevant to this scientific problem. Thus, a non-significant result does not count against any theory. Thus, the inference in the final column is incorrect.

Third row.  Be clear about which family you're correcting for: Are you correcting for the fact you are looking at 5 DVs? Or for pairwise comparisons for each of those ANOVAs? State how you will correct for each.  The same issue mentioned for row 2 also arises here, you find it hard to give reasons for why some effect is of minimal interest. That is not surprising, but it does mean a non-significant result is not support for any conclusion.  What if an effect were interesting that were smaller than even the small ones you are powered to detect? Then you haven't got evidence against any theory.

So one has two choices: either give scientific reasons for why an effect size is relevant (in the case of the significance testing you are doing, that means a minimal interesting effect; see; or else forgo any claims that one can find evidence against an effect existing. In the latter case one could simply estimate relevant mean differences and draw no existential conclusions. So for your second and third rows you could just estimate relevant mean differences and make claims like: given there is an effect, it is between these bounds (i.e. a 95% CI).  The CIs would still be corrected for multiple testing. If you go this way, do not slip in any claims of having shown there is a an effect or not effect, just stick to saying how big it is.

A  third option is that one suspends judgment for all non-significant results. This was a position Fisher himself sometimes took. I personally don't like taking this option because it means there is no way of getting evidence against a theory that predicts an effect. Many RR journals will not accept it either.

Evaluation round #2

DOI or URL of the report:

Version of the report: v3

Author's Reply, 04 Nov 2022

Download author's reply Download tracked changes file

Dear Zoltan,


Thanks for your patience. We have addressed the suggestions and comments in the manuscript and revision letter below, and look forward to hearing from you.


All the best,


Decision by ORCID_LOGO, posted 27 Jul 2022

Dear Lukas

I now have three positive reviews back for your submission, which are as a whole enthusiastic about the planned research.  In addition to other points raised by the reviewers, also address the following in your revision:

1)  Wood raises the issue of whether data has already been collected. You say in your cover letter that it has not, but the verb tense used in other places implies it might be otherwise e.g. "The detailed study protocol, materials, anonymized raw data, code used in the analyses and output are permanently stored on Open Science Framework (" (which I didn't have access to), and the frequent use of past tense for Method. To be clear, assuming data have not been collected, for Stage 1 use future tense throughout for things that have not happened, including Method (and e.g. "the data will be permanently stored.."). Future tense can then be changed to past tense in all cases when the Stage 2 is submitted. (Of course if data have been collected, than clarify this as well, and also therefore what precautions are in place for bias control.)

2) In the results section indicate *exactly* what analyses you will do. The specification should be so clear that there is no analytic flexibility left. Make sure for example that anyone could fit the curves exactly the same way you will, and so they could precisely reproduce your results with your data. Be clear what the other "plausible curve" is that you will fit in addition (I know you refer to the original authors, but this information should be in your mansucript if it figures in your analytic pipeline.). be clear exactly what comparisons are done with what error control (more on this below). Another example "We will also perform multiple regression analyses to determine whether impulsivity, personal need for structure and conscientiousness were related to curve parameters and performance variables." State how many regressions you will perform. Specify the variables for each regression. If you mention analyses in the results section also put them in the Design Table and justify power for each analysis and what conclusions hang on each. Alternatively, leave out mentioning any analyses you do not want to tie down exactly (nor make sure are properly powered) and put them in a non-pre-registered section in the Stage 2. (There are other cases where the analysis needs to be tied down I have not mentioned.)

3) Relatedly, in comparing linear and s-shaped functions you say you will do a sign test on R2's. But you also provide cogent reasons for why this is a flawed strategy, and hence say you will also use BIC and AIC to aid the reader. This leaves a lot of inferential wriggle room. A possible outcome is higher R2 for the s-shaped than linear function because of the difference in parameter numbers, but BIC comes down the other way. Has your manipulation check on the form of the function failed or not? Justify one measure as most suitable  - and I presume it will be either BIC or AIC or related, e.g. an informed Bayes factor - and test the form of the function with that measure, stating your criteria for good enough evidence for choosing s-shape over linear functions (or one s-shape over another).

4) As per my previous point, it seems to me that providing evidence for your exponential function is more an outcome neutral test than a test of theory. You point out that the original authors were motivated by Hull's theory of habit; and thus you frame the choice of function as testing Hull's theory, by testing the predicted shape of the function defining the increase of habit strength over time. However, strength will be measured on Likert scales, therefore with fixed minimum and maximum values. A linear function is therefore a priori ruled out if testing continues long enough. Something at least approximating an s-shape is guaranteed by the nature of the measurement. Therefore no theory can be at stake depending on the outcome of this test. Rather, obtaining something like an s-shape is a necessary precondition for your study in order to estimate when a habit has formed. 

5) I take it you plan to perform all pairwise comparisons between the 5 data sets (your 4 plus original) with Kolmogorov-Smirnov tests, which is 10 tests. How will you control familywise error rate?  Given that corrected alpha, determine N to control power to detect a difference you are interested in. How far apart should median number of days be before it is interesting? As Gardner points out, 66 days is just a rough figure; 59 is in the same ball park. Is there any way (other than reaching deep in one's soul) for specifying how far away from 66 would start to be interesting? (e.g. ) Once you have justified an interesting difference, use simulations to determine the power of KS to detect the effects that are just interesting. (And you should note in the paper that KS is sensitive to more than location differences, as a proviso on your analysis.)

6) For RQ3, specify an interesting effect size in raw units:  What difference in rated automaticity would be just interesting? (When several LIkert ratings are combined, I find using an average rather than sum over number of ratings useful to put the final number on the same scale as the rating itself, so one has a more intuitive grasp of what one unit is.) Otherwise we just have "medium effect size" plucked out of the air; and being standardized it depends on measurement noise and reliability. But presumably what is interesting is the actual difference in automaticity.

7) In terms of what defines an interesting effect, Takacs asks "for RQ4, would a non-significant result prove that “complexity is irrelevant for automatization”?" You can just qualify the conclusion that it applies to this particular difference in complexity. (If in addition you could qunatify or measure the complexity difference (and I am not requiring you do) it would help place the conclusion in perspective too.)  How will you control familywise error rate for number of Dvs? Determine interesting differences in raw units, then determine power for those diferences, taking into account the corrected alpha. Specify the IV and its levels - is it more than 2 as you are performing a KW? Will there be post hocs? What conclusions follow from different patterns? There is also inferential flexibility in specifying both ANOVA and KW. Justify one, or provide a decision procedure for choosing between them (one that does not allow wriggle room).




Reviewed by , 26 Jun 2022

This is a note-perfect replication of a seminal study on habit formation (Lally et al., 2010). The report meets all criteria for a Stage 1 replication study: the research questions are scientifically valid; the hypotheses are logical and plausible; the methodology is sound and feasible, and as described, permits replication. I have only two comments.

1. One important methodological difference between the original study and the present study, as the authors openly acknowledge (on p6), is that, whereas the participants in the original study met with the researcher in-person in a lab, replication study participants will meet the researcher online via video conferencing. This is important because motivation is needed to initiate and maintain a habit formation attempt before habit solidifies. This difference could feasibly affect results in two ways. First, providing support, advice and/or guidance in person might be inherently more motivating than doing so online. Second, participants who are willing to travel to a lab in central London to participate (as in Lally et al's study) may be inherently more motivated than those who are only required to meet via video conferencing. Do the authors view the difference in meeting format as a problem, and if so, how might it affect their results, and to what extent might this mitigate this?

2. Hypothesis 2 focuses on testing whether habit really does peak after 66 days, as Lally et al found. This seems overly restrictive; even if Lally et al's findings are 'true', I very much doubt that a replication of this result would find habit to peak at exactly 66 days. (For example, Keller et al [2021] found a once-daily behaviour to peak in habit strength after 59 days. While not exactly 66 days, this finding intuitively appears in keeping with Lally et al's findings.) Will the authors conclude that Lally et al's findings have not been replicated if the peak habit duration is NOT 66 days? Or is there an acceptable range within which a peak other than 66 days might sit *and* Lally et al's findings be supported?

Benjamin Gardner

University of Surrey, UK

Reviewed by , 10 Jul 2022

This is an important research project that proposes to replicate an earlier investigation by Lally et al. (2010). The authors are correct that this earlie investigation had very few participants and consequently unstable results, despite that it has been cited over 2000 times on GoogleScholar. The opportunity to replicate this research and in addition to assess individual differences makes this a highly useful piece of research for science and for popular understanding. For these reasons, I believe it should be accepted for publication.

The guidelines for evaluating registered reports we were given are:

1A. The scientific validity of the research question(s): I take this as a question of how important is the research, and I answered this above.

1B. The logic, rationale, and plausibility of the proposed hypotheses: This project will be informative whatever the results. 

1C. The soundness and feasibility of the methodology and analysis pipeline: The research has apparently already been conducted, and the data analysis is already in progress (?). For this reason, I am not going to comment in any detail on the methods and procedures except to note that it would be helpful to add a kind of intention-to-treat analysis that assesses the effects of participant attrition on the conclusions drawn.

1D. Whether the clarity and degree of methodological detail is sufficient: The authors rely heavily on the original project and the protocol given, and so clarity is presumably assured.

1E. Whether the authors have considered sufficient outcome-neutral conditions: In this case, the one major threat to validity is the daily report procedure. The authors do not address this or provide any insight into how they will handle it. But it leaves readers wondering whether the results would be the same if participants were not reminded each day about their behavior by the habit questionnaire.

Although I am very favorable toward acceptance, I wonder why the authors are submitting a registered report Stage 1 for a project with already-collected data (Level 5 o 6?). How far have the authors proceeded with data collection and analysis? I think that the project is so worthwhile, it should be attractive at a number of journal outlets. So, I'm not clear why this publication format.

Reviewed by , 05 Jul 2022

Evaluation round #1

DOI or URL of the report:

Author's Reply, 24 Jun 2022

Download author's reply

Dear Zoltan,

Thanks for your comments, and my apologies for the delay in getting back to you. We have addressed your remarks by adding power calculations for the third and fourth research questions in the revised study design table. Moreover, we rewrote sections of the analysis and interpretation column to enhance readability.

We look forward to your comments.

All the best,


Decision by ORCID_LOGO, posted 09 Jun 2022, validated 11 Nov 2022

Dear Dr de Wit


Thank you for your clearly written submission "the shape of habits". Before I send to review, one point needs to be addressed. Namely, you should demonstrate sensitivity to answer each question in your Design Table; or put another way, you should be able to, or at least know whether you can, obtain evidence against any existential claims made. At the moment, N is determined with respect to one issue, namely the precision with which the median number of days is estimated. But you also ask a number of other questions, and it needs to be shown whether you have the power or sensitivity to answer each of them, considered on their own terms. Some journals require a certain power or Bayes factor threshold for any hypothesis testing; and for those that have no set requirement, it should be known whether you did or did not e.g. have the power to justify asserting H0 (or do or do not have the resources to achieve good Bayesian evidence). Some advice is given here for approaching the problem: For each row of your Design Table, show that you will (or will not) have the means to obtain evidence that would show the claim wrong if it were wrong.




User comments

No user comments yet