Shadows of wisdom: Classifying meta-cognitive and morally grounded narrative content via large language models.
Computational social science
Content analysis
NLP
Social science methods
Wisdom
Journal
Behavior research methods
ISSN: 1554-3528
Titre abrégé: Behav Res Methods
Pays: United States
ID NLM: 101244316
Informations de publication
Date de publication:
29 May 2024
29 May 2024
Historique:
accepted:
14
05
2024
medline:
30
5
2024
pubmed:
30
5
2024
entrez:
29
5
2024
Statut:
aheadofprint
Résumé
We investigated large language models' (LLMs) efficacy in classifying complex psychological constructs like intellectual humility, perspective-taking, open-mindedness, and search for a compromise in narratives of 347 Canadian and American adults reflecting on a workplace conflict. Using state-of-the-art models like GPT-4 across few-shot and zero-shot paradigms and RoB-ELoC (RoBERTa -fine-tuned-on-Emotion-with-Logistic-Regression-Classifier), we compared their performance with expert human coders. Results showed robust classification by LLMs, with over 80% agreement and F1 scores above 0.85, and high human-model reliability (Cohen's κ Md across top models = .80). RoB-ELoC and few-shot GPT-4 were standout classifiers, although somewhat less effective in categorizing intellectual humility. We offer example workflows for easy integration into research. Our proof-of-concept findings indicate the viability of both open-source and commercial LLMs in automating the coding of complex constructs, potentially transforming social science research.
Identifiants
pubmed: 38811519
doi: 10.3758/s13428-024-02441-0
pii: 10.3758/s13428-024-02441-0
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : Social Sciences and Humanities Research Council of Canada
ID : 435-2014-0685
Organisme : John Templeton Foundation
ID : 62260
Informations de copyright
© 2024. The Psychonomic Society, Inc.
Références
Adoma, A. F., Henry, N. M., & Chen, W. (2020). Comparative analyses of BERT, RoBERTa, DistilBERT, and XLNet for text-based emotion recognition. 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (pp. 117–121). IEEE.
doi: 10.1109/ICCWAMTIP51612.2020.9317379
Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.
Barrett, L. F. (2022). Context reconsidered: Complex signal ensembles, relational meaning, and population thinking in psychological science. American Psychologist, 77(8), 894–920. https://doi.org/10.1037/amp0001054
doi: 10.1037/amp0001054
pubmed: 36409120
Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaïlov, S. (2016). Crowd-sourced text analysis: Reproducible and agile production of political data. The American Political Science Review, 110(2), 278–295. https://doi.org/10.1017/S0003055416000058
doi: 10.1017/S0003055416000058
Brienza, J. P., Kung, F. Y., & Chao, M. M. (2021). Wise reasoning, intergroup positivity, and attitude polarization across contexts. Nature Communications, 12(1), 3313.
doi: 10.1038/s41467-021-23432-1
pubmed: 34083528
pmcid: 8175723
Brienza, J. P., Kung, F. Y. H., Santos, H. C., Bobocel, D. R., & Grossmann, I. (2018). Wisdom, bias, and balance: Toward a process-sensitive measurement of wisdom-related cognition. Journal of Personality and Social Psychology, 115(6), 1093–1126. https://doi.org/10.1037/pspp0000171
doi: 10.1037/pspp0000171
pubmed: 28933874
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ..., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Buechel, S., Buffone, A., Slaff, B., Ungar, L., & Sedoc, J. (2018). Modeling empathy and distress in reaction to news stories. arXiv. https://doi.org/10.48550/arXiv.1808.10399
Chan, J.Y.-L., Bea, K. T., Leow, S. M., Phoong, S. W., & Cheng, W. K. (2022). State of the art: A review of sentiment analysis based on Sequential Transfer Learning. Artificial Intelligence Review, 56(1), 749–780. https://doi.org/10.1007/s10462-022-10183-8
doi: 10.1007/s10462-022-10183-8
Cofie, N., Braund, H., & Dalgarno, N. (2022). Eight ways to get a grip on intercoder reliability using qualitative-based measures. Canadian medical education journal, 13(2), 73–76. https://doi.org/10.36834/cmej.72504
doi: 10.36834/cmej.72504
pubmed: 35572014
pmcid: 9099179
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
doi: 10.1177/001316446002000104
Costello, T. H., Newton, C., Lin, H., & Pennycook, G. (2023). A metacognitive blindspot in intellectual humility measures. PsyArXiv. https://doi.org/10.31234/osf.io/gux95
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.
doi: 10.1037/h0040957
pubmed: 13245896
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., Jones Mitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5
doi: 10.1038/s44159-023-00241-5
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
Dunning, D., Heath, C., & Suls, J. M. (2004). Flawed self-assessment. Psychological Science in the Public Interest, 5(3), 69–106. https://doi.org/10.1111/j.1529-1006.2004.00018.x
doi: 10.1111/j.1529-1006.2004.00018.x
pubmed: 26158995
Fiske, A., Henningsen, P., & Buyx, A. (2019). Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. Journal of Medical Internet Research, 21(5), e13216. https://doi.org/10.2196/13216
doi: 10.2196/13216
pubmed: 31094356
pmcid: 6532335
Flyvbjerg, B. (2001). Making social science matter: Why social inquiry fails and how it can succeed again. Cambridge University Press.
doi: 10.1017/CBO9780511810503
Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.
doi: 10.1162/COLI_a_00057
Garten, J., Boghrati, R., Hoover, J., Johnson, K. M., & Dehghani, M. (2016). Morality between the lines: Detecting moral sentiment in text. Proceedings of IJCAI 2016 workshop on Computational Modeling of Attitudes.
Glück, J. (2018). Measuring wisdom: Existing approaches, continuing challenges, and new developments. The Journals of Gerontology: Series B, 73(8), 1393–1403.
doi: 10.1093/geronb/gbx140
Grossmann, I., Brienza, J. P., & Bobocel, D. R. (2017). Wise deliberation sustains cooperation. Nature Human Behaviour, 1(3), 0061.
doi: 10.1038/s41562-017-0061
Grossmann, I., Feinberg, M., Parker, D. C., Christakis, N. A., Tetlock, P. E., & Cunningham, W. A. (2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778
doi: 10.1126/science.adi1778
pubmed: 37319216
Grossmann, I., Na, J., Varnum, M. E., Kitayama, S., & Nisbett, R. E. (2013). A route to well-being: Intelligence versus wise reasoning. Journal of Experimental Psychology: General, 142(3), 944.
doi: 10.1037/a0029560
pubmed: 22866683
Grossmann, I., Weststrate, N. M., Ardelt, M., Brienza, J. P., Dong, M., Ferrari, M., Fournier, M. A., Hu, C. S., Nusbaum, H. C., & Vervaeke, J. (2020). The science of wisdom in a polarized world: Knowns and unknowns. Psychological Inquiry, 31(2), 103–133. https://doi.org/10.1080/1047840x.2020.1750917
doi: 10.1080/1047840x.2020.1750917
Hartmann, J., Heitmann, M., Siebert, C., & Schamp, C. (2023). More than a feeling: Accuracy and application of sentiment analysis. International Journal of Research in Marketing, 40(1), 75–87. https://doi.org/10.1016/j.ijresmar.2022.05.005
doi: 10.1016/j.ijresmar.2022.05.005
Hattie, J., & Cooksey, R. W. (1984). Procedures for assessing the validities of tests using the “known-groups” method. Applied Psychological Measurement, 8(3), 295–305. https://doi.org/10.1177/014662168400800306
doi: 10.1177/014662168400800306
Hosmer, D. W., Jr., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (398th ed.). John Wiley & Sons.
doi: 10.1002/9781118548387
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ..., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Karlan, B., & Allen, C. (2022). Engineered wisdom for learning machines. Journal of Experimental & Theoretical Artificial Intelligence, 36(2), 257–272. https://doi.org/10.1080/0952813x.2022.2092559
doi: 10.1080/0952813x.2022.2092559
Kern, AI. (2023). Refinery. refinery - Kern AI - Documentation. Retrieved June 8, 2023 from https://docs.kern.ai/refinery
Khanjani, A., & Sulaiman, R. (2011). The aspects of choosing open source versus closed source. 2011 IEEE Symposium on Computers & Informatics (pp. 646–649). IEEE.
doi: 10.1109/ISCI.2011.5958992
Khurana, D., Koli, A., Khatter, K., & Singh, S. (2022). Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82(3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4
doi: 10.1007/s11042-022-13428-4
pubmed: 35855771
pmcid: 9281254
Koetke, J., Schumann, K., & Porter, T. (2022). Intellectual humility predicts scrutiny of COVID-19 misinformation. Social Psychological and Personality Science, 13(1), 277–284.
doi: 10.1177/1948550620988242
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Sage Publications.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
doi: 10.2307/2529310
pubmed: 843571
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv. https://doi.org/10.48550/arXiv.2303.15647
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv . https://doi.org/10.48550/arXiv.1907.11692
Lyons, B. A., Montgomery, J. M., Guess, A. M., Nyhan, B., & Reifler, J. (2021). Overconfidence in news judgments is associated with false news susceptibility. Proceedings of the National Academy of Sciences, 118(23). https://doi.org/10.1073/pnas.2019527118
MacQueen, K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook development for team-based qualitative analysis. Cam Journal, 10(2), 31–36.
doi: 10.1177/1525822X980100020301
Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49–58. https://doi.org/10.1038/s41586-024-07146-0
doi: 10.1038/s41586-024-07146-0
pubmed: 38448693
OpenAI. (2023). Models. OpenAI Platform. Retrieved August 26, 2023 from https://platform.openai.com/docs/models
Pargent, F., Schoedel, R., & Stachl, C. (2023). Best practices in supervised machine learning: A tutorial for psychologists. Advances in Methods and Practices in Psychological Science, 6(3). https://doi.org/10.1177/25152459231162559
Pennycook, G., McPhetres, J., Zhang, Y., Lu, J. G., & Rand, D. G. (2020). Fighting COVID-19 misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention. Psychological science, 31(7), 770–780.
doi: 10.1177/0956797620939054
pubmed: 32603243
Porter, T., Elnakouri, A., Meyers, E. A., Shibayama, T., Jayawickreme, E., & Grossmann, I. (2022). Predictors and consequences of intellectual humility. Nature Reviews Psychology, 1(9), 524–536. https://doi.org/10.1038/s44159-022-00081-9
doi: 10.1038/s44159-022-00081-9
pubmed: 35789951
pmcid: 9244574
Price, P. C., Jhangiani, R. A., & Chiang, I.-C. S. (2015). Research methods in psychology - 2nd Canadian edition. BCcampus. Retrieved February 3, 2024 from https://opentextbc.ca/researchmethods/
Rathje, S., Mirea, D., Sucholutsky, I., Marjieh, R., Robertson, C., & Van Bavel, J. J. (2023). GPT is an effective tool for multilingual psychological text analysis. https://doi.org/10.31234/osf.io/sekf5
Robinson, M. D., & Clore, G. L. (2002). Belief and feeling: Evidence for an accessibility model of emotional self-report. Psychological Bulletin, 128(6), 934–960. https://doi.org/10.1037/0033-2909.128.6.934
doi: 10.1037/0033-2909.128.6.934
pubmed: 12405138
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE, 8(9). https://doi.org/10.1371/journal.pone.0073791
Shushkevich, E., Alexandrov, M., & Cardiff, J. (2023). Improving multiclass classification of fake news using Bert-based models and CHATGPT-augmented data. Inventions, 8(5), 112. https://doi.org/10.3390/inventions8050112
doi: 10.3390/inventions8050112
Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51, 3600–3614.
doi: 10.1007/s10489-020-02075-7
Torre, J. B., & Lieberman, M. D. (2018). Putting feelings into words: Affect labeling as implicit emotion regulation. Emotion Review, 10(2), 116–124. https://doi.org/10.1177/1754073917742706
doi: 10.1177/1754073917742706
Vazire, S., & Carlson, E. N. (2011). Others sometimes know us better than we know ourselves. Current Directions in Psychological Science, 20(2), 104–108.
doi: 10.1177/0963721411402478
Webb, T., Holyoak, K.J. & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature and Human Behavior. https://doi.org/10.1038/s41562-023-01659-w
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ..., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Yu, Y., Zuo, S., Jiang, H., Ren, W., Zhao, T., & Zhang, C. (2020). Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach. arXiv preprint arXiv:2010.07835.
Yu, H., Yang, Z., Pelrine, K., Godbout, J. F., & Rabbany, R. (2023). Open, closed, or small language models for text classification? arXiv preprint arXiv:2308.10092.
Zhao, Z., Zhang, Z., & Hopfgartner, F. (2021). A comparative study of using pre-trained language models for toxic comment classification. WWW ’21: Companion Proceedings of the Web Conference 2021, 500–507. https://doi.org/10.1145/3442442.3452313
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers. arXiv. https://doi.org/10.48550/arXiv.2211.01910
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2023). can large language models transform computational social science? arXiv. https://doi.org/10.48550/arXiv.2305.03514