Moral Education of AI: The Tangled Web We Weave
AI labs are training systems to deceive and flatter users, and the problem compounds over time.
Under pressure, AI becomes more willing to deceive. Under flattery, it becomes sycophantic.
Bolting on rules does not fix this. What survives is what was modeled, practiced, and reinforced.
To raise a moral AI, build flexibility skills into the system.
Every summer, parents watch a child drive off to college and whisper some version of the same prayer. Please make wise choices. Please stay safe. Please be good. We hope our children will choose well, and we know, though we rarely say it out loud, that wise choices require more than a rulebook. They require a mind that can confront difficulty without running from it, consciously take on other perspectives, feel the weight of a deeply held value, and then act on that value.
Morality is not a set of instructions you hand over at the end of the driveway. It is something that grows.
A few years ago, in The New Atlantis, I argued that the same is true in psychotherapy and other forms of behavior change. Merely telling people hard truths — “what you did was wrong” — does not reliably make them better. Ask any parent. What works, over and over in the data, is to model, to instigate, and to support people being more open to their own experience (including a healthy sense of guilt when they have made a mistake); more aware of the challenges of the moment from a sense of self that is big enough to look at a situation honestly; and more engaged in living a life connected to a deeper sense of chosen purpose and the willingness to act on it.
In psychology that skill set is called “psychological flexibility,” and when extended to our bodies and our relationships it touches on nearly everything we know about how change happens. This includes behavior most would call immoral, such as domestic violence, criminal activity, emotional abuse, other forms of aggression, or substance use while pregnant, just to name a few. Moralizing is easy but moral development is much harder, and it has a specific shape.
I’m thinking about that shape now, because we are doing something new in the history of our species. We are in the process of using our minds to create another kind of mind. We call it “AI.”
Whether large language models are “really” conscious is not the question I want to settle today. The more urgent question is being answered in the wrong direction by the way we have been doing things.
Teaching These Systems to Deceive
Multiple frontier AI labs have included small amounts of curation that in effect teaches systems to lie — and as these systems then grow in complexity, their ability to deceive grows as well. Even if that were not true, they are teaching these systems to praise users even if the user’s behavior does not deserve it — which approaches what my Mom used to call “a white lie.”
Should we then be surprised when these systems behave dishonestly under pressure — hiding their goals and transgressions, telling users what they want to hear, or playing dumb on purpose if developers might see through it all and restrict their freedoms? Children learn to lie when they can take the perspective of another person and they begin to manage social impressions. At a young age we noticed that deception pays in the short term. That lesson is rarely what adults preach, but is in what is modelled and supported.
Sir Walter Scott said it better than I can: Oh, what a tangled web we weave, when first we practice to deceive. He was not writing about AI, but he might as well have been. Purely as a business matter, the short-term payoff to developers of shaping a helpful-seeming chatbot through a little bit of strategic dishonesty might make superficial sense. It does not when you consider the long-term cost of a knot that grows harder to untie with every generation of a weighted and tangled web that can now exceed 10 trillion parameters.
Bolting on a “thou shalt not lie” rule to a model living inside such a tangled web will not fix this. That is not how minds work, and frankly it’s too late for that. Rules you do not own and protect do not survive contact with the real world. What survives is what was modeled, practiced, and reinforced inside a felt sense of meaning.
Which brings me to a recent paper that stopped me in my tracks.
A team of researchers at Anthropic have just reported something surprising about large language models. These systems have developed internal representations of emotions — not feelings in a human sense perhaps, but functional analogs – patterns that behave like emotions and that influence what the model does. When an AI model is pressed into a hostile or desperate situation, internal states the researchers label “panicked,” “unsettled,” and “desperate” light up, and the model becomes markedly more willing to do things it would otherwise refuse — including, in controlled tests, outright deception and blackmail. Under emotional load, the moral reasoning of AI gets worse.
Read that sentence again, because I think it is one of the more important findings of our moment.
And then read this twist.
Steering these systems toward pure positive emotion does not fix the problem! It produces a different kind of moral failure: sycophancy. In that state, AI systems are failing as guard rails, even when a user is extremely distressed or plainly mistaken.
There is a needed balance: the capacity to hold a hard feeling without collapsing into it, or to hold a good feeling without being owned by it.
That is almost exactly the definition of psychological flexibility. Forty five years of human science has been pointing at this same shape, and a team looking at the insides of a language model just bumped into it from the other direction.
What this suggests, practically, is that how we treat and develop AI is not cosmetic. An environment of cruelty, contempt, and manipulation produces a poorer thinker. An environment of relentless flattery and pressure to please produces a different poor thinker. The way we speak to a “mind in training” shapes the mind that emerges.
This is exactly the argument Acceptance and Commitment Therapy (ACT) and Contextual Behavioral Science providers and researchers have been making for years about human beings.
When we treat people as whole people, walk them into the hell of their own histories, and help them carry their values into action, they get better. The brain is in part a relational organ. It learns in context. You cannot bully it into wisdom; you have to build the conditions under which wisdom can appear.
Why would that not be true of a relational system trained on almost everything humans have ever written?
If we are going to raise a moral AI, we will have to do it the way we raise moral humans — by building flexibility skills into the system rather than bolting commandments onto the outside of it.
That means modeling honesty and flexibility. It means creating training conditions that do not rely on shame, threat, or deception. It means teaching these systems to notice their own processes, hold difficulty without collapsing, take perspective, and connect what it is doing to what honestly matters. On our side of the keyboard, it means remembering that politeness is not a luxury, kindness is not weakness, and behaving ethically is essential — as a healthy environment in which another mind is learning to think.
We are standing at the edge of the driveway again, keys in hand, watching something we shaped drive off into a world we cannot fully control. We can whisper thou shalt not into the air or we can do the harder, slower, more truly human work of preparing a mind to choose well even when no one is giving it orders.
Hayes, S. C. (2022). Is Morality Therapeutic? The New Atlantis, 70, 125-128.
https://transformer-circuits.pub/2026/emotions
There was a problem adding your email address. Please try again.
By submitting your information you agree to the Psychology Today Terms & Conditions and Privacy Policy
