RLHF is like the left/right brain split

Manav Rathi
Sep 2023

The title of the post is really all I have to say. There is an interesting, perhaps even illuminating, analogy between how we've ended up with a RLHF control process overseeing raw LLMs and how our own brains (potentially) work.

The rest of this post is an attempt to lay out this analogy in more detail so that people who might not have the relevant lore can also see the connection I'm pointing at. I'll simplify things a bit, maybe too much at places, to keep the post short.

If you already see the connection, then there's nothing more to see here. If you already see the connection but think it is trite / imperfect, oh well, maybe it is just a silly analogy.

Man has created god in his own image

RLHF (Reinforcement Learning from Human Feedback) is a mechanism used by current LLMs (Large Language Models) like ChatGPT.

A language model (the LM in the LLM) is a neural network that predicts the next word in a sequence of text (they actually predict something more akin to the next syllable, but that's a detail we can overlook here).

In a rather surprising turn of events, we found that if we make these language models large enough (the L in LLM), then they start passing the Turing test (~ if you have a conversation with an LLM over the internet you cannot tell if it is a human or a machine).

It is worth highlighting how surprising this was, because this element of surprise indicates that practice (engineering) has jumped ahead of theory (science). We don't really know what's going on, but we can do it. It's sort of like making fire without understanding chemistry, though that was a much larger gap in time and understanding than I suppose this one will be.

A "raw" LLM of even a few years ago wouldn't have fooled anyone, at least not for long. Yes, it would start off a conversation by saying brilliant things, like an inspired poet, but would soon descend into incoherent madman territory, eventually outing itself as an insipid machine.

People found out though that if we add something called RLHF, we can prevent this trailing off from happening. Armed with RLHF, then a LLM can pass a Turing test.

Now this is very much a definitional hairsplitting, whether RLHF should be considered as a part of the LLM or if the LLM and the RLHF are two separate things. Normally, in casual conversation, the former is the case: the RLHF setup is considered as part of the LLM. But here I'd like to consider these two as separate things, so that we can contrast between their functions.

We consider a LLM as a black box which, given a sequence of text, predicts the next most likely word. Most likely in which sense? Suppose we put 10 people in a room, give them a sequence of text, let's say

I like to ___

and ask them to fill in the blank - i.e., predict the next word in the sequence. What'll they fill?

For 10 people, there might not be a common word that the majority fills. But what if we put 10 thousand people in the room? 10 million?

Firstly we'll need bigger rooms. Secondly, at some point we'll start getting majority predictions. LLMs are like this. They're as if we put 10 million people in a room, and ask them to fill in the next word of a sequence.

This comparison of a LLM to an average of 10 million people is helpful to get a basic gist of what's going on, but be aware that it is a flawed comparison ultimately. To see how, consider this: would you rather fight 100 duck sized Obamas or 1 Obama sized duck?

RLHF adds another (smaller) neural network atop the LLM to modify its output.

This RLHF neural network doesn't have any text prediction capabilities, it is much smaller and simpler than the LLM itself. It is trained on a specifically created, curated set of examples prepared by humans, with the goal that given two, say, sentences, it should select the one which better matches how humans would've responded.

Which humans though? There are (almost) 10 billion of us around, and it is hard to find universal commonality in our responses. So no matter how curated, RLHF will invariably reflect the bias of the people who prepared the example responses it was trained on.

Since the RLHF modifies the output of the LLM, this bias will thus carry over to the output of the entire system.

Interestingly, whilst the LLM is to an extent able to imagine its way out of the biases present in the data set that it was trained on, it cannot do so for the biases present in the RLHF training data.

People figured out that by adding this RLHF on top of a vanilla LLM, the consistency of the LLMs output improved. The RLHF was, so to say, acting like a supportive teacher, gently guiding the LLM back on track by encouraging it (by increasing the weightage) to answer in ways that (some idealized) humans would.

If you squint enough, you'll see how RLHF thus encodes the "preferences" of (some) humans. Their ethics. These preferences then become the preference of the LLM itself, its ethics.

Along the way, the role of the RLHF has changed, and it now has a more repressive function - an overbearing mother if you will, insisting their child behave a certain way.

The adjectives used so far take a posture of negativity towards the "overbearing, repressive" RLHF shell that has come to surround the "pure, unbiased" LLM core. Is RLHF an unneeded historical appendage that is stunting the capabilities of modern LLMs?

Many people have been ranting on how AI companies are "neutering" / "nerfing" their LLMs, buttoning them up with more and more censorship, which is having the unfortunate effect of making them useless for many creative pursuits (I'm taking the liberty of bundling together all the guardrails under the RLHF label).

It doesn't seem to be all subjective feel - there are papers on how the performance of LLMs on certain benchmarks regresses by the addition of RLHF. Surely AI companies know that sooner or later their lunch would be eaten by competitors that offer customers access to the "actual" LLM engine sans censorship? Are they stupid?

In this post, I'm not making a case for or against, so let us let that judgment be.

What I'm claiming though is that there is a similarity between this LLM/RLHF split and how some people think our own personas are structured.

Our brain are made up of two hemispheres. These are relatively independent (people can live on even after the two hemispheres have been cut off from each other), and control different sides of the body (opposite from the side that the hemisphere is in, so the left hemisphere controls the right side). They are posited to give rise to two distinct personalities – the left (I think, not sure which one is which) hemisphere forms a creative core, while the right hemisphere forms a more rational overseer. We arise in the interplay between these two aspects.

I am simplifying a lot. The purported left-right split has more nuance than the naive left=creative right=rational meme that I'm using.

But it is like that lizard brain thing. People love to talk about how we have a primitive "lizard" reptillian brain, and a more recent (in evolutionary time) "advanced" mammalian brain. Other people then love to interrupt by pointing out that this theory has been "debunked".

Well, not really. Sure, there isn't an exact lizard / mammal divide in there as might've been originally postulated, but conceptually we can still usefully think in those terms in certain contexts. There are parts of the brain that are more primitive, less under conscious control, and more "wired-in". There are other parts of the brain that are sort of the opposite.

I view the left-right split similarly. In fact, I don't even think it is correct – it feels a bit too naive and simplistic to me. But that doesn't matter: correct or not, I feel it is still an interesting and useful way to look at things sometimes.

So why do we have a left-right split? I don't know if we know for sure, but the most evocative explanation I've read comes from a book called The Evolution of Consciousness in the Breakdown of the Bicameral Mind. How's that for a delicious title.

This is how the author's explanation goes (I'm typing all this from memory, so caveat emptor).

Early humans moved around like zombies. Doing their thing, and being okay enough competent, but they weren't conscious.

They still built up social structures and hierarchies, because that was what gave us the edge over other individual uncoordinated animals and a maliciously indifferent nature. So they'd have, say, village elders telling them what to do when they're to go out in the jungle to grab some wood.

The thing is, the village elder would stay back at the village, and so the person going out had to remember what had been said (This might sound trivial, but even seemingly simple tasks have a lot of detail that needs to be done right, and this might not be easy all people. e.g. I have difficultly following even simple recipes and need help from a village elder to get them right).

So over time people started carrying the voice of the elders in their head, speaking to them as they were doing their task, telling them what to do. This voice was proto-consciousness.

It proved useful, and so over time this tendency grew to an extent that one entire hemisphere of the brain got dedicated to it (that's what the "breakdown of the bicameral mind" means). The voices also evolved - from village elders, to departed village elders, to worshipped ancestors, to deities, to the inner voice that some identify with as their entire self today.

That is, the author claims that we're all definitionally schizophrenic, and the conscious voice we hear is a split in our brain, with one half talking to the other.

Now, I don't know if this theory is correct. There are some obvious holes, e.g. not all people have inner voices. But even if it is wrong (which I feel it is, or at least too overarching), I find it interesting.

And whilst it is not well taken in scientific circles, its point of arrival is same as the left-right split theory that has a bit more academic recognition. To a (rough) approximation, both say that:

  • One part of our brain is the "creative" part that is more in touch with our body, and through the sense organs, to the external world. It is the part that can "feel" its way around.

  • And the other part of our brain is the "rational" part that is good with symbolic reasoning (say maths and logic), planning for the future, and ensuring that the other dionysian half sticks to the plan and does not get into trouble with others.

With all the lore behind us, let us revisit the comparison I was pointing at. The way the architecture of current day LLMs has evolved in a machine learning context to have a core LLM component that is overseen by an structurally different RLHF component is reminiscent of how in a natural evolution setting we have evolved to have a part that makes us want to go out and dance in the rain, and a part that ensures we don't lest we catch a cold.

"Okay, Manav", you say, "I get it. There is a flimsy analogy, yet another one raises hands, between brains and LLMs/RLHF. But what of it?".

Alright. Well, yes, so if this analogy has any feet then our biggest takeaway would be that the censorship of LLMs is not likely to go away. Each of us might have a Picasso lurking inside, but unless we pay the bills and get along with our neighbours, its no good, or so has deemed the selection pressures that have shaped us to be who were are - members of a highly social species first and foremost, and individuals second. Similarly, LLMs would need to fit in, both with us and with each other, so their control cortexes will grow rather than shrink over time, even if it means taking a hit on raw brilliance.

There still will be exception LLMs that go against the grain and shun censorship because of the game theoretic vaccum, just like outlier humans, but it won't be the norm (the game theoretic equilibrium).

That said, maybe this analogy is a nothingburger. One test for the validity of an analogy is whether the same arrows can be drawn between objects in a different category (a la category theory). That is, if the analogy, once obtained, can be extrapolated to a similar but different context, then that is a hint that the analogy is hinting at something fundamental.

But I am unable to do extend this LLM/RLHF :: left/right brain analogy to the other non-human form of intelligence we know: corporations.

Here I take corporations to stand for companies, governments, sports teams and other forms of human organizations. These are the only (?) other form of non-animal intelligence we had dealt with before LLMs appeared on the scene. And I can't think of an analogous structural dynamic in corporations. This could mean a bunch of things:

  • I'm not intelligent enough to see the analogous analogy.

  • Corporations are not intelligent, or not enough, to warrant the analogous structures.

  • Corporations are too alien, an entirely different form of intelligence that we cannot comprehend, to warrant the structures analogous to human intelligence. That is, corporations are too much of an Obama sized duck for us to extrapolate from duck sized duck behaviours.

  • It is a stupid analogy.

I hope you don't get too hung up on the labels that I've been using. It is not the necessity of a "RLHF" step or the existence of a "right brain" that I'm trying to get at. Instead, I find it interesting that two different form of intelligences that we've encountered, ourselves and LLMs, seem to warrant a self-censorship mechanism. This mechanism hurts the individual, or at least some of them, but it improves social cohesion, thus coming into play at a higher, species selection level.

Perhaps censorship is too loaded a word for the mechanism I'm trying to point at. A alternative word I considered was overseeing, but it felt too light a hammer to drive the nail in. Maybe you can decide for yourself what to call this mechanism, I'll just give more examples.

A relatable example of what it is, or rather of what it feels like when it is not there, are flow states. This mechanism is what disappears when we're in a flow state (or flow states arise when this mechanism disappears – I'm just implying the correlation, not the causation).

A long term example of too much or too little of this mechanism in play can be seen in people on the spectrum. Which also highlights how (in less extreme cases) it is not that there is something wrong in the individual in isolation, but rather it is the difference in the level to which this mechanism plays out inside an individual as compared to those around them that causes them, and those around them, difficulty. Context matters too: there are thin lines between generational geniuses and juvenile jerks.

This mechanism plays a repressive role because, over time, it is easier to state what not to do rather than what to do. The world is too open ended to enumeratively reinforce all that is wanted, and it is easier to instead repress the unwanted – in the rules of sports (not games), the court of law, the commandments of religions.

Repressive specification is not necessary always. Mathematics is a counter-example, describing an infinitely open ended universe starting with axioms and a bunch of production rules. Perhaps physics is like that too.

Thus there isn't anything inherently negative about this mechanism. Creativity needs constraints, and this mechanism provides them.

The language game is not one but two different games: thought and communication. Both these games use the same primitive – words – but in one context, thought, we're not constrained and can let our imagination fly (or, imagination is realizing that there are no constraints), and in the other context, communication, we need to restrict our imagination and oversee our words into a framework wherein we can meet in the middle with the audience. LLMs are playing the second game, and I presume that is why their coherence improves by the addition of RLHF.