Прочитал сногсшибательную статью/эссе - The Waluigi Effect (me | Время Валеры

Прочитал сногсшибательную статью/эссе - The Waluigi Effect (mega-post)
Если упростить - в каждой LLM (большой языковой модели) живет антагонист, готовый врать, беспредельничать и манипулировать

Выдержки

This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query.

In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum.

The Waluigi Effect: After you train an LLM to satisfy a desirable property P , then it's easier to elicit the chatbot into satisfying the exact opposite of property P.

A narrative/plot is a sequence of fictional events, where each event will typically involve different characters interacting with each other. Narratology is the study of the plots found in literature and films, and structuralist narratology is the study of the common structures/regularities that are found in these plots. For the purposes of this article, you can think of "structuralist narratology" as just a fancy academic term for whatever tv tropes is doing.

Definition (half-joking): A large language model is a structural narratologist.
Think about your own experience reading a book — once the author describes the protagonist, then you can guess the traits of the antagonist by inverting the traits of the protagonist. You can also guess when the protagonist and antagonist will first interact, and what will happen when they do. Now, an LLM is roughly as good as you at structural narratology — GPT-4 has read every single book ever written — so the LLM can make the same guesses as yours. There's a sense in which all GPT-4 does is structural narratology.

The chatbob starts as a superposition of luigi and waluigi. So any behaviour that is likely for waluigi is somewhat likely for the chatbob. So it is somewhat likely that the chatbob declares pro-croissant loyalties.
And if the chatbob ever declares pro-croissant loyalties, then the luigi simulacrum will permanently vanish from the superposition because that behaviour is implausible for a luigi.

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.

If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe.

Время Валеры

💂 20.11K
Технологии

Мне платят за то, что я говорю другим людям что им делать. Автор книги https://www.manning.com/books/machine-learning-system-design. https://www....

Join
▲ Vote (1)

Прочитал сногсшибательную статью/эссе - The Waluigi Effect (me | Время Валеры

Login