Overall, I find this paper to be an important and useful step forward in NLP modeling of psychological items and scales. As always, there is some room for improvement on the margins, but I think it's a great sign that I am eagerly awaiting the results of this Registered Report.

In terms of feedback, it seems important to mention that my comments below are not ordered in terms of priority. They match the ordering of the Registered Report instead.

-----

Introduction:

I think the first 2-3 pages try to cover too much. Especially the 2nd page (roughly paragraphs 5-8). As a reader, I felt that it took a lot of effort to get through, though I was also aware that the authors were trying to contextually frame NLP/AI modeling for psychologists who may be encountering it for the first time (not easy to do!). To be clear, the authors did a nice job with the text that's given -- I view it as accurate and reasonably concise (though it is unavoidably jargony). But, I think much/all of this could be "off-loaded" by directing readers to prior work using citations that the the authors already have in place. This technique is done well later when explaining bi-directional encoding. In fact, I think it would (almost) be okay to simply cut the first ~1000 words, beginning the paper with the last paragraph of the 2nd page ("Wulff & Mata (2023) used..."), then expanding that paragraph extensively. In other words, I'd like to hear more about negation (a really big issue!), why scale aggregation is better, why ada-002 should not be seen as the gold standard, why training a model for the task at hand is a big deal (it is!), etc. As for the consequences of cutting all the background info, I think this is probably okay. Given the typical timing of registered reports, many readers will be caught up on these techniques (and their relevance to questions of structure) by the time the paper comes out. For those that are not yet familiar, something other than a technical empirical paper like this would probably be needed (e.g., a "Current Directions" article).

Personally, I would not call these synthetic correlations. For me, "synthetic correlation" already has already has a meaning (though I acknowledge that this term is not widely used, especially in social science). ... With sufficient cooccurrences among two variables, it's possible to derive plain-old, empirically-observed correlations; with zero-or-too-few cooccurrences, it is sometimes possible to derive *synthetic* correlations based on other known parameters (like, by estimating distributions based on joint cooccurrences with other variables). I don't think this meaning matches the coefficients called 'synthetic' in the paper, but I may be mistaken. Either way, I think it's more important and informative to reference these correlations as being *semantic*, as this would help to emphasize the difference from typical (ratings-based) corrs used in psych research. I realize this is a major change of a small thing bc the term 'synthetic' is in many locations. But, I really think it's worth considering. They are semantic (which is important to highlight), and I don't think they're synthetic (but maybe... if so, explain how?).

Stage 1: Polarity Calibration. The description is wonderful! But, I was left with many questions and wanting more. "The fine-tuned model was then trained to predict this new criterion, which combined the magnitude and direction of the similarity." Can you say more about how this was done, and (more importantly) how it went? I thought Figure 2 might address the latter question to some extent, but the caption was largely inadequate for making sense of the images, in my opinion. Of course, this was particularly true for my first pass through the manuscript as I did not yet know where the empirical data came from (it's introduced in the next section). I don't think a lot of detail is needed about fine-tuning, but a few more sentences would help. This will be completely new for most readers (even many who know a bit about AI).

Stage 2... Again, the text provided is very good (i.e., the description of data sources and partitioning), but some further description of how the authors "fine-tuned the model to focus on text segments that convey psychologically relevant information" is missing. I also think some readers may think this interpretation is sort of a stretch... that it sort of assumes the semantic correlations should be roughly equivalent to the rating data. (I actually suspect this may prove to be the case, but I also think it should remain an open question for a while -- certainly, I can argue that they aren't the same, that psychology is not just language...). At the end of this section, the comment that the focus was  "on enhancing the model's ability in predicting item correlations within the test partition" seems more straight-forward and accurate.

Why "mnemosyne"? It's cool but this may warrant some description if the hope is that the name will catch on or be used beyond the current project. If that's not the plan, it may be easier to give it a more informative name (boring is sometimes functional).

In the pilot study, using random subsets of items to evaluate reliability estimates that are not biased seemed strange to me. I think this should probably be dropped. The analysis of reliability estimates for published scales is useful and interesting, but randomly subsetting items and then using some exclusion criteria to estimate de novo reliabilities is not providing additional information beyond what is shown with the full item-level correlation results in my opinion. If anything, I find it a bit misleading.

For the scale analyses, I think it would be useful to include a detailed analysis of cases where there were very high and very low r's (especially low). This might provide some ideas about what topics aren't explained by semantics alone. These results seem especially important for the pre-registered work to be done next.

In the Measures section, I think it's imperative to mention an inclusion criteria relating to IP. I would recommend making this explicit and omitting any content that has not been clearly licensed/put in the public domain. This is actually a major issue, in my opinion, and it's much better to leave out any measures with unclear IP status. If APA PsycTests has done this for you (I think they have, right?) -- that's great, it will make it super easy to add a statement in the inclusion criteria to put reader's concerns to rest. Of course, not everything that's publicly available for various uses is public domain. In my experience, this can get quite complicated!

Also in Measures: I agree that a uniform response pattern is best, but I would give slight preference to a 6 point response format. I am aware that the prior work here suggests that 5 vs 6 is not a critical point, but I have some good reasons. For one, psychometrically, there is evidence than >5 is better than 5 or less (6 seems to be the threshold where most things do fine when treated as continuous), and that having an even number of options is slightly better or equivalent to odd (probably better bc no neutral option). Second, and more important, I am also familiar with a very large dataset including many of these measures that uses a 6 point response scale, and it would be nice if the data sets could someday be directly compared (even descriptively).

The note for Table 1 has an error: "Because increasing the number of scales is costlier than number of scales..."

Finally, an overall comment: I have some confusion about the structure of this Registered Report. Every outlet is different, but my own experiences with this submission format have still made use of a familiar Intro/Method/Result/Discussion structure. This manuscript departs from that structure quite a lot, which is manageable but a bit confusing. I guess my point here is just to suggest that the authors take care to order the full/complete manuscript in a way that's easy for readers to digest.