**454. You Provably Can’t Trust Yourself**

Okay, this post seems pretty important, but I’m unable to summarize it intelligently (maybe due to my headaches) so I’ll just quote *heavily*:

Peano Arithmetic

seemspretty trustworthy. We’ve never found a case where Peano Arithmetic proves a theorem T, and yet T is false in the natural numbers. That is, we know of no case where []T (“T is provable in PA”) and yet ~T (“not T”).We also know of no case where first order logic is invalid: We know of no case where first-order logic produces

false conclusionsfromtrue premises. (Whenever first-order statements H are true of a model, and we can syntactically deduce C from H, checking C against the model shows that C is also true.)Combining these two observations, it seems like we should be able to get away with adding a rule to Peano Arithmetic that says:

All T: ([]T -> T)

But Löb’s Theorem seems to show that as soon as we do that, everything becomes provable. What went wrong? How can we do

worseby adding a true premise to a trustworthy theory? Is the premise not true—does PA prove some theorems that are false? Is first-order logic not valid—does it sometimes prove false conclusions from true premises?Actually, there’s nothing wrong with reasoning from the axioms of Peano Arithmetic plus the axiom schema “Anything provable in Peano Arithmetic is true.” But the result is a

differentsystem from PA, which we might call PA+1. PA+1 does not reason from identical premises to PA; something new has been added. So we can evade Löb’s Theorem because PA+1 is not trustingitself—it is only trusting PA.As soon as you start trusting yourself, you become unworthy of trust. You’ll start believing

anydamn thing that you think, just becauseyouthought it. This wisdom of the human condition is pleasingly analogous to a precise truth of mathematics.As far as any human knows, PA does happen to be sound; which means that what PA proves is provable in PA, PA

willeventually prove andwilleventually believe. Likewise, anything PA+1 can prove that it proves, it will eventually prove and believe. It seems so tempting to just make PA trustitself—but then it becomes Self-PA and implodes. Isn’t that odd? PA believes everything it proves, but it doesn’t believe “Everything I prove is true.” PA trusts a fixed framework for how to prove things, and that framework doesn’t happen to talkabouttrust in the framework.You

canhave a system that trusts the PA frameworkexplicitly,as well as implicitly: that is PA+1. But the new framework that PA+1uses,makes nomentionof itself; and the specific proofs that PA+1 demands, make no mention of trusting PA+1, only PA. You might say that PA implicitly trusts PA, PA+1 explicitly trusts PA, and Self-PA trusts itself.For everything that you believe, you should always find yourself able to say, “I believe because of [specific argument in framework F]”, not “I believe because I believe”.

Of course, this gets us into the +1 question of why you ought to trust or use framework F. Human beings, not being formal systems, are too reflective to get away with being

unableto think about the problem. Got a superultimate framework U? Why trust U?And worse: as far as I can tell,

usinginduction is what leads me toexplicitlysay that induction seems to often work, and myuseof Occam’s Razor is implicated in my explicitendorsementof Occam’s Razor. Despite my best efforts, I have been unable to prove that this is inconsistent, and I suspect it may be valid.But it does seem that the distinction between

usinga framework andmentioningit, or betweenexplicitlytrusting a fixed framework F and trustingyourself,is at leastimportantto unraveling foundational tangles—even if Löb turns out not to apply directly.

But what does this have to do with metaethics?

I’ve been pondering the unexpectedly large inferential distances at work here—I thought I’d gotten all the prerequisites out of the way for explaining metaethics, but no. I’m no longer sure I’m even close. I tried to say that morality was a “computation”, and that failed; I tried to explain that “computation” meant “abstracted idealized dynamic”, but that didn’t work either. No matter how many different ways I tried to explain it, I couldn’t get across the distinction my metaethics drew between “do the right thing”, “do the human thing”, and “do my own thing”. And it occurs to me that my own background, coming into this, may have relied on having already drawn the distinction between PA, PA+1 and Self-PA.

Coming to terms with metaethics, I am beginning to think, is all about

distinguishing between levels. I first learned to do thisrigorouslyback when I was getting to grips with mathematical logic, and discovering that you could prove complete absurdities, if you lost track even once of the distinction between “believe particular PA proofs”, “believe PA is sound”, and “believe you yourself are sound”. If you believe any particular PA proof, that might sound pretty much the same as believing PA is sound in general; and if you use PA and only PA, then trusting PA (that is, being moved by arguments that follow it) sounds pretty much the same as believing that you yourself are sound. But after a bit of practice with the actual math—I did have to practice the actual math, not just read about it—my mind formed permanent distinct buckets and built walls around them to prevent the contents from slopping over.Playing around with PA and its various conjugations, gave me the notion of what it meant to trust

arguments within a framework that defined justification. It gave me practice keeping track of specific frameworks, and holding them distinct in my mind.Perhaps that’s why I expected to communicate more sense than I actually succeeded in doing, when I tried to describe

rightas a framework of justification that involved being moved by particular, specific terminal values and moral arguments; analogous to an entity who is moved by encountering a specific proof from the allowed axioms of Peano Arithmetic. As opposed to a general license to do whatever you prefer, or a morally relativistic term like “utility function” that can eat the values of any given species, or a neurological framework contingent on particular facts about the human brain. You can make good use of such concepts, but I do not identify them with the substance of what isright.

In conclusion: If we were as smart and knowledgeable in mathematical logic as Eliezer we would see the light and understand his metaethics. I don’t know, man. Yudkowsky’s metaethics still looks pretty relativistic to me.

Anyway, good discussion in the comment-section:

Orthonormal:

“I agreeâ€” and I balk at the concept of “the” Coherent Extrapolated Volition precisely because I suspect there are many distinct attractors for a moral framework like ours. Since our most basic moral impulses come from the blind idiot god, there’s no reason for them to converge under extrapolation; we have areas of agreement today on certain extrapolations, but the convergence seems to be more a matter of cultural communication. It’s not at all inconceivable that other Everett branches of Earth have made very *different* forms of moral progress from us, no less consistent with reason or consequences or our moral intuitions.

…

Just to clarify (perhaps unnecessarily): by an attractor I mean a moral framework from which you wouldn’t want to self-modify radically in any direction. There do exist many distinct attractors in the space of ‘abstracted idealized dynamics’, as Eliezer notes for the unfortunate Pebblesorters: they might modify their subgoals, but never approach a morality indifferent to the cardinality of pebble heaps.

Eliezer’s claim of moral convergence and the CEV, as I understand it, is that most humans are psychologically constituted so that our moral frameworks lie in the ‘basin’ of a single attractor; thus the incremental self-modifications of cultural history have an ultimate destination which a powerful AI could deduce.

I suspect, however, that the position is more chaotic than this; that there are distinct avenues of moral progress which will lead us to different attractors. In your terms, since our current *right* is after all not entirely comprehensive and consistent, we could find that both *right1* and *right2* are both *right* extrapolations from *right*, and that *right* can’t judge unequivocally which one is *better*.

I’d be very interested, of course, to hear Eliezer’s reasons for believing the contrary.”

Yudkowsky:

“If you go back and check, you will find that **I never said that extrapolating human morality gives you a single outcome**. *Be very careful about attributing ideas to me on the basis that others attack me as having them.*

**The “Coherent” in “Coherent Extrapolated Volition” does not indicate the idea that an extrapolated volition is necessarily coherent.**

The “Coherent” part indicates the idea that if you build an FAI and run it on an extrapolated human, the FAI should only *act on the coherent parts*. Where there are multiple attractors, the FAI should hold satisficing avenues open, not try to decide itself.

**The ethical dilemma arises if large parts of present-day humanity are already in different attractors**.”