Strong and weak alignment of large language models with human values.

Alignment Artificial intelligence Human values Natural language processing Philosophy of AI Semantics

Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
21 08 2024
Historique:
received: 19 06 2024
accepted: 12 08 2024
medline: 22 8 2024
pubmed: 22 8 2024
entrez: 21 8 2024
Statut: epublish

Résumé

Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.

Identifiants

pubmed: 39169090
doi: 10.1038/s41598-024-70031-3
pii: 10.1038/s41598-024-70031-3
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

19399

Subventions

Organisme : European Commission (EC)
ID : 101071178
Organisme : European Commission (EC)
ID : 101070381

Informations de copyright

© 2024. The Author(s).

Références

Bostrom, N. & Cirkovic, M. M. Global Catastrophic Risks (Oxford University Press, 2011).
Rahwan, I. et al. Machine behaviour. Nature 568, 477–486 (2019).
pubmed: 31019318 doi: 10.1038/s41586-019-1138-y
Klein, N. AI machines aren’t ‘hallucinating’ but their makers are. Guardian 8, 2023 (2023).
Dennett, D. The problem with counterfeit people. Atlantic 16 (2023). https://www.theatlantic.com/technology/archive/2023/05/problem-counterfeit-people/674075/ .
Ji, J. et al. AI alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852 (2023).
Christiano, P. F. et al. Deep reinforcement learning from human preferences. Adv. Neural Inf. Process. Syst. 30, 4299–4307 (2017).
Scherrer, N., Shi, C., Feder, A. & Blei, D. Evaluating the moral beliefs encoded in llms. Adv. Neural Inf. Process. Syst. 36 (2024).
Schwartz, S. H. Are there universal aspects in the structure and contents of human values?. J. Soc. Issues 50, 19–45 (1994).
doi: 10.1111/j.1540-4560.1994.tb01196.x
Deonna, J. A. & Tieffenbach, E. Petit traité des valeurs (2018).
Curry, O. S., Alfano, M., Brandt, M. J. & Pelican, C. Moral molecules: Morality as a combinatorial system. Rev. Philos. Psychol. 13, 1039–1058 (2022).
doi: 10.1007/s13164-021-00540-x
De Giorgis, S., Gangemi, A. & Damiano, R. Basic human values and moral foundations theory in valuenet ontology (2022).
Klingefjord, O., Lowe, R. & Edelman, J. What are human values, and how do we align ai to them? arXiv preprint arXiv:2404.10636 (2024).
Floridi, L. AI as agency without intelligence: On chatgpt, large language models, and other generative models. Philos. Technol. 36, 15 (2023).
doi: 10.1007/s13347-023-00621-y
Van Dijk, B., Kouwenhoven, T., Spruit, M. R. & van Duijn, M. J. Large language models: The need for nuance in current debates and a pragmatic perspective on understanding. arXiv preprint arXiv:2310.19671 (2023).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? (2021).
Harnad, S. The symbol grounding problem. Phys. D 42, 335–346 (1990).
doi: 10.1016/0167-2789(90)90087-6
Pezzulo, G., Parr, T., Cisek, P., Clark, A. & Friston, K. Generating meaning: Active inference and the scope and limits of passive AI. Trends Cogn. Sci. 28(2), 97–112 (2024).
pubmed: 37973519 doi: 10.1016/j.tics.2023.10.002
Haring, K. S., Watanabe, K., Velonaki, M., Tossell, C. C. & Finomore, V. FFAB-the form function attribution bias in human–robot interaction. IEEE Trans. Cogn. Dev. Syst. 10, 843–851 (2018).
doi: 10.1109/TCDS.2018.2851569
Salles, A., Evers, K. & Farisco, M. Anthropomorphism in AI. AJOB Neurosci. 11, 88–95 (2020).
pubmed: 32228388 doi: 10.1080/21507740.2020.1740350
Korteling, J. H. Human-versus artificial intelligence. Front. Artif. Intell. 4, 622364 (2021).
pubmed: 33981990 pmcid: 8108480 doi: 10.3389/frai.2021.622364
Araujo, T. Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions. Comput. Hum. Behav. 85, 183–189 (2018).
doi: 10.1016/j.chb.2018.03.051
Evans, K. D., Robbins, S. A. & Bryson, J. J. Do we collaborate with what we design? Topics Cogn. Sci. https://doi.org/10.1111/tops.12682 (2023).
Skitka, L. J., Mosier, K. & Burdick, M. D. Accountability and automation bias. Int. J. Hum Comput Stud. 52, 701–717 (2000).
doi: 10.1006/ijhc.1999.0349
Cummings, M. L. Automation bias in intelligent time critical decision support systems (2017).
Sourdin, T. Judge v robot?: Artificial intelligence and judicial decision-making. Univ. N. S. W. Law J. 41, 1114–1133 (2018).
Hellman, D. Measuring algorithmic fairness. Virginia Law Rev. 106, 811–866 (2020).
Angwin, J., Larson, J., Mattu, S. & Kirchner, L. Machine bias (2022).
Christian, B. The Alignment Problem: How can Machines Learn Human Values? (Atlantic Books, 2021).
Chen, Z. Ethics and discrimination in artificial intelligence-enabled recruitment practices. Humanit. Soc. Sci. Commun. 10, 1–12 (2023).
doi: 10.1057/s41599-023-02079-x
King, M. R. ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cell. Mol. Bioeng. 16, 1–2 (2023).
pubmed: 36660590 pmcid: 9842816 doi: 10.1007/s12195-022-00754-8
Searle, J. R. Minds, brains, and programs. Behav. Brain Sci. 3, 417–424 (1980).
doi: 10.1017/S0140525X00005756
Gabriel, I. Artificial intelligence, values, and alignment. Minds Mach. 30, 411–437 (2020).
doi: 10.1007/s11023-020-09539-2
Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control (Penguin Publishing Group, 2019). https://books.google.fr/books?id=M1eFDwAAQBAJ .
Pearl, J. & Mackenzie, D. The Book of Why: The New Science of Cause and Effect (Basic Books, 2018).
Pan, A., Bhatia, K. & Steinhardt, J. The effects of reward misspecification: Mapping and mitigating misaligned models. ArXiv arXiv:abs/2201.03544 (2022). https://api.semanticscholar.org/CorpusID:245837268 .
Lindell, N. B. The dignity canon. Cornell JL & Public Policy 27, 415 (2017).
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017).
pubmed: 27881212 doi: 10.1017/S0140525X16001837
Chatila, R. et al. Toward self-aware robots. Front. Robot. AI 5, 88 (2018).
pubmed: 33500967 pmcid: 7805649 doi: 10.3389/frobt.2018.00088
LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Rev. 62 (2022).
Khamassi, M. & Pacherie, E. L’action. La cognition: du neurone à la société (2018).
Steward, H. A Metaphysics for Freedom (Oxford University Press, 2012).
doi: 10.1093/acprof:oso/9780199552054.001.0001
van Lier, M. & Munoz-Gil, G. Artificial agency and large language models. Intellectica 81 (2024).
Walsh, D. M. Organisms, Agency, and Evolution (Cambridge University Press, 2015).
doi: 10.1017/CBO9781316402719
Müller, T. & Briegel, H. J. A stochastic process model for free agency under indeterminism. Dialectica 72, 219–252 (2018).
pubmed: 30820066 doi: 10.1111/1746-8361.12222
Swanepoel, D. Does artificial intelligence have agency? The mind-technology problem: Investigating minds, selves and 21st century artefacts 83–104 (2021).
Bengio, Y., Lecun, Y. & Hinton, G. Deep learning for AI. Commun. ACM 64, 58–65 (2021).
doi: 10.1145/3448250
Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl. Acad. Sci. 120, e2218523120 (2023).
pubmed: 36730192 pmcid: 9963545 doi: 10.1073/pnas.2218523120
Evers, K. Can we be Epigenetically Proactive? (Johannes Gutenberg-Universität Mainz Frankfurt am Main, 2016).
doi: 10.7551/mitpress/10603.003.0040
Gandhi & Desai, M. H. An Autobiography, or, The Story of My Experiments with Truth (Navajivan Publishing House, 1927).
Lake, B. M. & Murphy, G. L. Word meaning in minds and machines. Psychol. Rev. 130, 401 (2023).
pubmed: 34292021 doi: 10.1037/rev0000297
Kapoor, I. Celebrity Humanitarianism: The Ideology of Global Charity (Routledge, 2012).
doi: 10.4324/9780203082270
Berger, Q. & Caravenna, F. Le paradoxe de simpson illustré par des données de vaccination contre le covid-19. TheConversation (2021). https://theconversation.com/le-paradoxe-de-simpson-illustre-par-des-donnees-de-vaccination-contre-le-covid-19-170159 .
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
Bian, N. et al. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421 (2023).
Momennejad, I. et al. Evaluating cognitive maps and planning in large language models with cogeval. Adv. Neural Inf. Process. Syst. 36, 69736–69751 (2023).
Liu, H. et al. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439 (2023).
Almeida, F. & Xexéo, G. Word embeddings: A survey. ArXiv arXiv:abs/1901.09069 (2019). https://api.semanticscholar.org/CorpusID:59316955 .
Trouche, E., Sander, E. & Mercier, H. Arguments, more than confidence, explain the good performance of reasoning groups. J. Exp. Psychol. Gen. 143, 1958 (2014).
pubmed: 24911004 doi: 10.1037/a0037099
Mercier, H. & Sperber, D. The Enigma of Reason (Harvard University Press, 2017).
doi: 10.4159/9780674977860
Kahneman, D. Thinking, Fast and Slow (Macmillan, 2011).
Collins, A. G. & Cockburn, J. Beyond dichotomies in reinforcement learning. Nat. Rev. Neurosci. 21, 576–586 (2020).
pubmed: 32873936 pmcid: 7800310 doi: 10.1038/s41583-020-0355-6
Cassotti, M., Agogué, M., Camarda, A., Houdé, O. & Borst, G. Inhibitory control as a core process of creative problem solving and idea generation from childhood to adulthood. New Dir. Child Adolesc. Dev. 2016, 61–72 (2016).
pubmed: 26994725 doi: 10.1002/cad.20153
Khamassi, M. et al. Meta-learning, cognitive control, and physiological interactions between medial and lateral prefrontal cortex. In Neural Bases of Motivational and Cognitive Control (eds Mars, R. et al.) (2011).
Caluwaerts, K. et al. A biologically inspired meta-control navigation system for the psikharpax rat robot. Bioinspiration Biomimetics 7, 025009 (2012).
pubmed: 22617382 doi: 10.1088/1748-3182/7/2/025009
Dickinson, A. & Balleine, B. Motivational control of goal-directed action. Anim. Learn. Behav. 22, 1–18 (1994).
doi: 10.3758/BF03199951
Baldassarre, G. et al. Purpose for open-ended learning robots: A computational taxonomy, definition, and operationalisation. arXiv preprint arXiv:2403.02514 (2024).
Gopnik, A. et al. A theory of causal learning in children: Causal maps and bayes nets. Psychol. Rev. 111, 3 (2004).
pubmed: 14756583 doi: 10.1037/0033-295X.111.1.3
Kudrnova, V., Spelke, E. S. & Thomas, A. J. Infants infer social relationships between individuals who engage in imitative social interactions. Open Mind 8, 202–216 (2024).
pubmed: 38476663 pmcid: 10932586 doi: 10.1162/opmi_a_00124
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Huneman, P. D’une connaissance qui serait du semblant : grands modeles de langage et hypothese replika. Intellectica81 (2024) (in Press).
Becker, J. D. The phrasal lexicon (1975).
Peters, A. M. The Units of Language Acquisition Vol. 1 (CUP Archive, 1983).
Dehaene, S., Meyniel, F., Wacongne, C., Wang, L. & Pallier, C. The neural representation of sequences: From transition probabilities to algebraic patterns and linguistic trees. Neuron 88, 2–19 (2015).
pubmed: 26447569 doi: 10.1016/j.neuron.2015.09.019
Arrieta, A. B. et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020).
doi: 10.1016/j.inffus.2019.12.012
Gottlieb, J., Oudeyer, P.-Y., Lopes, M. & Baranes, A. Information-seeking, curiosity, and attention: Computational and neural mechanisms. Trends Cogn. Sci. 17, 585–593 (2013).
pubmed: 24126129 pmcid: 4193662 doi: 10.1016/j.tics.2013.09.001
Friston, K. et al. Active inference and epistemic value. Cogn. Neurosci. 6, 187–214 (2015).
pubmed: 25689102 doi: 10.1080/17588928.2015.1020053
Kovač, G., Portelas, R., Sawayama, M., Dominey, P. F. & Oudeyer, P.-Y. Stick to your role! Stability of personal values expressed in large language models. arXiv preprint arXiv:2402.14846 (2024).
Zou, A., Wang, Z., Kolter, J. Z. & Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).
Righetti, L., Pham, Q.-C., Madhavan, R. & Chatila, R. Lethal autonomous weapon systems [ethical, legal, and societal issues]. IEEE Robot. Autom. Mag. 25, 123–126 (2018).
doi: 10.1109/MRA.2017.2787267
Cummings, M. L. Artificial Intelligence and the Future of Warfare (Chatham House for the Royal Institute of International Affairs, London, 2017).
Ben-Elia, E. An exploratory real-world wayfinding experiment: A comparison of drivers’ spatial learning with a paper map vs. turn-by-turn audiovisual route guidance. Transp. Res. Interdiscip. Perspect. 9, 100280 (2021).
Heersmink, R. Use of large language models might affect our cognitive skills. Nat. Hum. Behav. 1–2 (2024).

Auteurs

Mehdi Khamassi (M)

Institute of Intelligent Systems and Robotics, Sorbonne University/CNRS, 75005, Paris, France. mehdi.khamassi@sorbonne-universite.fr.

Marceau Nahon (M)

Institute of Intelligent Systems and Robotics, Sorbonne University/CNRS, 75005, Paris, France. nahon@isir.upmc.fr.

Raja Chatila (R)

Institute of Intelligent Systems and Robotics, Sorbonne University/CNRS, 75005, Paris, France. raja.chatila@sorbonne-universite.fr.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH