. — GeistHaus

tsvi bt Mar 25, 2026 Updated Mar 25, 2026

Show full content

nav > ol { list-style-type: disc; } nav ol ol { list-style-type: circle; } nav ol ol ol { list-style-type: square; } @import url('https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;700&display=swap'); .post-body strong, .post-body b { font-family: 'Open Sans', sans-serif !important; font-weight: 1000 !important; } table { border-collapse: collapse; } table, th, td { border: 1px solid black; } h1 .section-header-link, h2 .section-header-link, h3 .section-header-link, h4 .section-header-link, h5 .section-header-link, h6 .section-header-link { color: inherit; text-decoration: none; } .header-section-number::after { content: ". "; } .toc-section-number::after { content: ". "; }

A while ago I made some custom MTG cards. I don’t play Magic and I definitely don’t know how to balance a Magic card, so these are just conceptual explorations. In these cards, I was more interested in off-beat mechanics, rather than anything to do with actual MTG play.

Purple knocks things out of place, and messes with time a bit, making things disjointed or throwing things out of their usual context.

The Eternal Denizen is stuck out of time / stuck in a loop / stuck never quite entering the world. It doesn’t actually use mana, it just hypothetically uses mana.

This is maybe still too spiky to be balanced. But it’s constrained by the colors, so it would be weird to build around. I like the idea of the channelmage getting lost in trying to wrangle a bunch of different mana of the wrong colors and getting all mixed up.

I think libarary modification is interesting. There could be some unexplored ideas with making two libraries, e.g. splitting your library into equal sized piles, shuffling, and then getting to pick which to draw from each turn. Anyway, RSM is a potentially very powerful card. The recursive part would be if you thin your library, and thereby find more tutors / more copies of RSM. I like that RSM is potentially very powerful, but also potentially very dangerous and self-mangling if not used carefully.

More things out of place, rearranging time. This is very situational, but in some cases powerful, e.g. you could effectively counter several counters by putting them on the bottom of the stack, or you could disrupt your opponent’s synergistic triggers or whatever.

I have no idea how to balance this or if it’s workable; would take playtesting.

Purple shuffles things around, filtering through possibilities, spinning through time.

The idea of this card is that if you activate it enough, you’re kinda stacking your deck—not necessarily in the sense of tutoring anything, but in the sense that if you have cards that chain into other cards (e.g. Cascade or similar) you can set yourself up. In the late game it is more like a clumsy repeatable tutor / deck stacking.

Purple is somewhat otherworldly—things come in from strange places, going where they aren’t supposed to go. Then they vanish, and you don’t even remember if they were real at all.

Purple explores possibilityspace—and can swap possibilities, or pump probabilities.

Purple is powerful, but always with some twist. Often the twist puts some weird constraint on how the power can flow, making it hard to use, e.g. the erstwhile channelmage or the ruined ur-portal. I think this card might be interesting to build around: it could be powerful, but you have to have the right balance of several card types.

This is a very unhinged card.

tag:blogger.com,1999:blog-8939787122970662740.post-8343750354366613125

A bit of fluid mechanics from scratch not from scratch

tsvi bt Feb 24, 2026 Updated Feb 24, 2026

Show full content

1 Static pressure gradients??
2 Spouts and acceleration
3 Height paradox thingy
4 Deducing the equivelocity curve?
5 Sideways pressure?
6 Generalizing

I’ve been reading various things about fluid mechanics so that I can think about microfluidics. But I haven’t really been studying it like I used to study a math area (instead just kinda getting various impressions). I’d like to think it through a bit “from scratch”, though it’s very much not actually from scratch of course. This is written as close to real-time thinking as possible (though of course much slower than without writing), over a few hours, with the intent of getting a “thinking trace” because that seems interesting.

[In all diagrams, pretend the top is open to the air, and ignore differences in air pressure.]

1 Static pressure gradients??

If I ask myself what I’m intuitively confused about, I’m like, hold on—it makes intuitive sense that pressure gradient would cause the fluid to accelerate. But can’t you have pressure gradients in a static situation?? E.g. if you have a tank of water, there’s a pressure gradient, where the lower water has more pressure:

But it’s just water sitting in a tank and it’s obviously not moving, let alone accelerating. But there’s a pressure gradient, so why isn’t it accelerating? Presumably it has to do with gravity? But like, the gradient is pointing downward, so my synesthetic unconscious thinks that the acceleration is supposed to be downward, and also the bottom is the most pressurized and pressure feels intuitively related to going fast, so the bottom should be the fast part.

Ok duh, there is indeed a force from the pressure gradient; it’s pointing upward, because of course high pressure pushes stuff away so the force (which I suppose is the negated gradient) points in the direction of decreasing pressure, which is straight upward. This force is exactly counterbalanced by the force of gravity. The gradient is constant throughout the water, dependent only on gravity and the density of water.

2 Spouts and acceleration

Now suppose we have a tank with a spout:

The water is shooting out of the spout. When does it accelerate? It seems like the full spout head, on the far right, is full of water shooting out. But if that’s the case, since water is incompressible / not stretchable or something, that means that the whole horizontal pipe has its contents flowing at the same rate. Because if the water were accelerating while in the tube [[narrator: he’s assuming that the water at different vertical positions, at a given horizontal position, is moving with the same vector]], then you would have more water leaving the horizontal tube than entering it.

So that means that the water is already accelerated to full speed before entering the horizontal pipe… How / where? Something doesn’t make geometric sense…

…Oh, we can’t model this as a bunch of simple (straight, say) slices with constant velocity on a given slice. That just won’t work. Let’s imagine a different scenario:

And let’s presume that the velocity at a given height is constant. Well, we know that the velocities are always strictly vertical (and downward). Further, to conserve mass, and given incompressibility, we’ve got to have that the velocity is inversely proportional to the width, simple as that.

What about with a simpler tank?

Something seems really weird. The water coming out should be going pretty fast. But if the slices of water at each height have constant velocity, this makes no sense? Because the tank is much wider than the spout. And also it is constant width, so therefore the water has constant velocity. So the water is going slowly down, but then suddenly it zooms out of the narrow spout at the last infinitesimal moment?? When did it accelerate??

And furthermore, the tank is wide and the spout is small. So the water reaching the bottom of the tank, say in the corner, has to get over the spout really fast?? How??

Makes no sense. I think I can imagine what ought to happen:

Where the corners are stagnant, or now that I’ve made the diagram I guess maybe little vortices or something, I don’t fucking know lol.

Anyway the point is that the case where a fixed flow-orthogonal slice of a pipe / channel always has a constant velocity, is a very special case. …Well actually, I don’t know how special… Is it determined between two points? Like if I tell you the width of the top and bottom of this:

Can you then deduce the full curve? Let’s come back to this…

3 Height paradox thingy

Ok so I have a serious problem. As everyone learns in kindergarten, the speed at which water comes out of a spout in a tank depends on the height:

But something is weird. Come with me for a sec. Suppose we have a spout at the bottom:

So the water comes out at some rate, fine. And that rate depends on how tall the tank was, yeah. But now suppose that we add more pipe, at spout-width, to the bottom:

Ok, so, it’s the same as before, but the outlet of the spout is now significantly deeper / lower. So the speed of the water should be higher, right?

Ok, but if the water is faster at the bottom of the long spout… We could view the top part of this system as an exact copy of the short-spout version. At the interface between the tank bottom and the pipe-spout, the velocity of the water should be the same as in the no-pipe version, right? But that means the water inside the pipe is accelerating inside the pipe:

But that’s… not… possible. Water is incompressible and mass is conserved. The water at the bottom of the spout cannot be going faster than at the top!

I’m not confident about what’s going on. I think all my reasoning is pretty solid. Unless it’s like, the water inside the pipe-spout is somehow pulling or pushing on the water in the tank differently than if there were no pipe? But that doesn’t sound right? Because it’s falling away anyway…

My best guess is that the above reasoning is correct, and what actually happens is like this:

In other words, the water does indeed accelerate, and it also gets narrower; the pipe is partly filled with air.

4 Deducing the equivelocity curve?

Does this solve the problem stated a little while ago? In other words, do we know what shape a pipe would have to be in order to start with a given width, end with another given width, and have constant velocity on each flow-orthogonal slice?

Maybe what we get is like the pipe-spout:

In fact, the shape of this pipe is already determined by the input width! We cannot change it, because the curve is determined by gravitational acceleration. As the water falls, it accelerates. As it accelerates, the pipe must get narrower (total length is inverse to average flux density or whatever).

So we can’t even set the output width.

Well, but this is not fully satisfying. For one thing, we could have just asked, in general, ignoring how this steady state is obtained through forces / pressures, can we just imagine some pipe with some varying width, filled with water which is moving and accelerating according to width, but in a steady state (the way a river holds a steady shape even as water moves and accelerates through it)?

At the moment my guess is yes… In fact, couldn’t we just set up any curves as our pipe?:

I think this would work; as long as you have a pressure difference between one end and the other, the water accelerates in the direction of less pressure. (I just realized a lot of the diagrams above show this incorrectly maybe.)

But, like with the spout at the bottom of a tank, you would very much not get constant velocity on a given slice. Instead you’d get vortices and lacunae and whatnot:

With the pipe-spout at the bottom of the tank, we are fixing the pressures at the top and bottom of the pipe, because we’re using water-depth pressure caused by gravity. If we relax that constraint, we can just imagine that we have some specified pressure differential.

But I’m still not satisfied, because it feels like there’s some constraint / some criterion that specifies the right shape for a pipe…

It can’t actually literally be that the velocity is constant on flow-orthogonal slices. That actually does not make sense because the water is supposed to accelerate along the pipe, and therefore the pipe is getting narrower, and therefore at least some water must move (and indeed accelerate) in a direction with some flow-orthogonal component:

This can’t be constant (or else the slice of water would be moving, on net, in a direction not in the direction of flow…).

We might instead ask for having a constant flow-parallel component of velocity at a given flow-orthogonal slice; or a constant speed.

I’m not really sure what I’m asking for, but it feels like there’s something here. One question would be, for given inlet and outlet widths for a pipe, and given inlet and outlet pressures, what pipe shape gives the fastest steady state flow? (I suppose not all pipe shapes give steady state flows at all, because some would have turbulence—or do you need viscosity or friction or something for turbulence? Which pipes are turbulent or not? Presumably that’s a hard question but maybe there are some nice simple recognizable classes of each.) Something something laminar flow?

Part of why I feel there’s some question here, is that on the one hand, you have a picture like this from earlier:

Where the whole thing considered as a pipe is clearly not good, and the steady state flow (if there is one) is not simple / nice / smooth, because the shape of the pipe is such that it doesn’t elegantly funnel the water laminarly or something into the outlet. But on the other hand, we could have pipes like this:

Here the channel gets quite narrow in a weird sharp way. The flows seem like they would get hampered and not flow smoothly. It feels like, in between getting narrow to quickly and getting narrow not quickly enough, there should be one curve, or maybe some region in curve-space, that makes it so that the flow is nice and smooth and fast and efficient.

Oh wait, I guess both of them get narrow too quickly. It doesn’t really make sense to get narrow too slowly… Well, overall everyone gets narrow on average at the same rate, if we’ve fixed the widths of our inlet and outlet. I guess what I mean is, you get narrow too unevenly, with some high density spikes of narrowingness.

Maybe if the outlet is too differently sized from the inlet, then you might inevitably have problems like this? (I suppose the outlet could also be too big.)

We know that there is at least one good pipe shape, because the pipe could just be straight (equal outlet and inlet) and have no pressure differential (and somehow be in a moving steady state, if we want). I think we know there’s another good pipe shape, which is the gravitational one from earlier with the pipe-spout at the bottom of the tank.

5 Sideways pressure?

One thing that I’m a bit confused about is the sideways pressure from the walls in a pipe that’s narrowing. Or in general, how does the water get pushed into the middle of the pipe as it narrows?

Is there a pressure gradient sideways? There must be; I don’t think we have other available forces. Gravity can be ignored, and what else is there? I guess there’s, like, collision force or something? But isn’t that the same as pressure?

There’s also friction and stickiness and viscosity / strain or something, but I think we’re ignoring those at the moment? Maybe that doesn’t make sense.

Anyway, I think there’s a pressure gradient sideways? Yeah I guess that’s probably right. The walls push inward on the water. But wait… Don’t the walls always push inward on the water?

So isn’t there always pressure on the outside surface of the water? And therefore there’s always a pressure gradient that’s high pressure on the surface and low on the inside? And therefore… the water accelerates towards the center of the pipe?

That makes no sense, right?

Or maybe pressure is constant in a constant-width pipe:

And it’s only non-constant in a non-constant-width pipe:

It would be nice to know these things quantitatively.

6 Generalizing

I guess there’s a few different kinds of questions:

Statics (water not moving; question of force fields and pressures)
Constant velocity (water moving but not accelerating)
Steady states (water moving and accelerating, but constant velocity and acceleration fields; these seem like they should be nice)
Dynamics (water colliding / roiling… accelerations time-dependent)

I speculate that we can usefully analyze steady states using just pressure fields; the gradient gives the force and acceleration fields; …. and then I guess you need to know how that relates to velocity fields …. seems kinda hard actually.

tag:blogger.com,1999:blog-8939787122970662740.post-1718791255862160992

Bioanchors 2: Electric Bacilli

tsvi bt Feb 24, 2026 Updated Feb 24, 2026

Show full content

1 Arguments for fast AGI progress
2 Intuition pumps for being close to AGI
3 Synthetic life as an intuition pump
4 Some things I like about this analogy

[Previously: “Views on when AGI comes and on strategy to reduce existential risk”, “Do confident short timelines make sense?”]

[Whenever discussing when AGI will come, it bears repeating: If anyone builds AGI, everyone dies; no one knows when AGI will be made, whether soon or late; a bunch of people and orgs are trying to make it; and they should stop and be stopped.]

1 Arguments for fast AGI progress

Many arguments about “when will AGI come” focus on reasons to think progress will continue quickly, such as:

Line go up.
Researchers can pivot to address new obstacles and ditch dead ends.
AI can be used to accelerate AI research.
We’re over a threshold of economic returns, such that AI research will permanently see much more investment than before.

2 Intuition pumps for being close to AGI

These are all valid and truly worrying. But, to say anything specific and confident about when (in clock time) AGI will come, we’d also have to know how fast progress is being made in an absolute sense; specifically, in an absolute sense as measured by “how much of what you need to make AGI do you have”.

There are various intuition pumps / analogies that people use to inform their sense of how far AI research has come. For example:

AI is like a child developing. Progress is made incrementally and continually; each year brings a new quantity and quality of progress that builds on but transcends the previous. We match AI capabilities to the age of a human with the closest capabilities set. We find that AI grows by 2 years per year (or something), is currently approximately a grad student, and will soon be a leading researcher. Some months later it becomes mildly superhuman.
AI is like brains and neurons. We need a bunch of compute power. We measure how much computing power we have against how much computing power a brain uses. We calculate tradeoffs between performance gains from algorithmic progress vs. from more compute. When we have enough effective compute, we get human level AI or beyond.
AI is like an employee. At first it is not even worth the time to manage it; then it is helpful on some narrow tasks; then it can do a wide range of common / low-ish skill tasks, with questionable reliability; then it becomes reliable for easy things and starts contributing on difficult tasks and becomes economically valuable; then it starts replacing whole jobs; then it starts replacing whole companies / sectors; then it starts generating new sectors.
AI is like a student. We feed it more training data and run more reps; it gets higher test scores; once it performs like the very best students, it becomes ready to do real research.

I believe these are poor intuition pumps for understanding when AGI comes because they do not evoke the sense that there is some unknown, probably-large blob of complexity that one has to possess in order to make AGI. They paper over differences in how the AI system does what it does.

3 Synthetic life as an intuition pump

Intuition pumps can only go so far. Each domain has its own central complexities, and there’s no good reason that the world has to present a deeply correct analogy for the development of AGI, and in fact I’m not aware of such. That said, as long as we’re doing intuition pumps, I want to propose another intuition pump for timelines on progress on a very complex task: synthetic life. We use the analogy:

human minds : AGI :: natural bacteria : synthetic bacteria

Specifically, we can compare the general task of making AGI (which, to be clear, is a ~maximally bad thing to do) as analogous to:

the task of producing, artificially from scratch, a bacterium-like object that has all the impressive capabilities of biotic bacteria, such as growing and dividing, self-repairing, avoiding toxins and predators, evolving novel complex characteristics to perform well in many different niches, competing against other life for space and resources, etc., but that is “very unlike” natural bacteria / is produced “very independently from” natural bacteria.

4 Some things I like about this analogy

There are big blobs of algorithmic complexity / understanding / ideas. Specifically, there’s the genome. More abstractly, there’s the ideas in the genome (e.g. chemical pathways, abstracted from specific enzymes).
Evolution poured a huge amount of experimentation into getting that big blob of ideas.
That big blob of ideas is not fully accessible / usable for us, just because we can read it off in some form.
- We can read a bacterial genome, or a brain connectome, but that doesn’t mean we can design another bacterium / mind that uses those ideas (except in a pretty narrow cheaty way, at best). (I think this is less true for synlife, because we really do understand somewhat more about genomes compared to minds, and can extract more abstract ideas from genomes compared to from, say, connectomes. It’s far from a perfect analogy.)
- We know that evolution worked by selection on variation in the DNA sequence. But that doesn’t mean we can get evolution’s results:
  - Doing it evolution’s way is crazy slow / compute intensive.
  - It’s not easy to get ahold of evolution’s training signal; the training signal is complex, subtle, and high compute. Poor replacements for that signal get incremental gains but don’t get the deeper gains.
The big blob of ideas contains a bunch of superfluous stuff: Redundant mechanisms, random damage / suboptimal settings, functional but kinda arbitrary choices (with lots of similarly good alternatives), mechanisms with unnecessary functions.
- Therefore, we expect to have much less total work to do than evolution did in evolving bacteria. (And AI researchers have much less total work than evolution for making human minds.)
- However, we expect to have some large amount of total work to do.
- And, we cannot tell which is which.
- And, the fact that we can think/design/experiment/compress more abstractly and efficiently than evolution, and can avoid a bunch of the work, does not say that much about how close we are, because the default is that there’s a big blob that you have to invent.
- Just because we could bypass a bunch of algorithmic complexity, doesn’t mean we magically do so. You’d still have to figure out how to do so.
- Retreating to “bitter-lesson” type arguments/plans also retreats away from arguments that we’re doing things more efficiently than evolution.
It’s not exactly clear what counts as success, but it feels like there’s some big accomplishment or bundle of big accomplishments that would qualify, and a bunch of cool-but-not-successful things one could do.
- (I think this is more true in the case of synthetic life than AGI. For AGI, we basically mean “controls the lightcone”. For synthetic life, we could mean various things like “is made from entirely synthetic elements (as opposed to just synthetic DNA inserted into a living cell)” or “doesn’t use any normal proteins” or “doesn’t use DNA” or similar. It’s far from a perfect analogy.)
There are counterintuitive progress curves.
- There are things that sound like huge progress / most of the way there, but they don’t necessarily imply that you’re anywhere near some later milestone.
- For example, in 2010 the J. Craig Venter Institute produced a “synthetic cell” (that is, a cell whose genome was synthesized by stitching together chemically assembled DNA segments). In 2016, they did the same thing but they also deleted a bunch of DNA, “making it the smallest genome of any self-replicating organism”. But how far off are the more ambitious versions? Who knows.
  - In particular, because the work of designing genes (chemical pathways, regulatory networks, growth and division programs, etc.) is copied and not actually performed, you have that the final performance is perfectly real (the cell-with-synthetic-genome really lives; the LLM really can program computers (for some tasks)) but is weirdly non-indicative of the designer’s ability to design powerful artifacts.
- Alphafold is some sort of big advance. But does it mean you’re about to get synthetic life, just because suddenly a bunch of protein folding questions went from IDK / costly to measure, to a very cheap pretty-good guess? I doubt it. There’s still a ton of design work.
- In general, you can get various sigmoids at different times, of different sizes, and of different frequency (thence, different smoothness after being all added up together).
There are ways to “cheat”.
- For AGI, there’s brain uploading, or neuromorphic AI / partial uploading.
- For synthetic life, there’s e.g. JCVI’s strategy of copying existing genomes.
It’s unclear how to point at specific “blockers”, but there are definitely blockers, which you can tell because we aren’t running around using drugs manufactured inside synthetic alien bacteria / running around being dead from AGI.
One could easily imagine the familiar game of “goalpost moving” in this setting. E.g.:
- A: What can SynLife2026 not do? How do you know it’s “not really life”? What’s the least impressive thing you think it won’t do next year?
- B: Sheesh, IDK. I mean, it can’t process fructose right now…
- A: [a few months later] Aha! This new paper has an oil blob with some enzymes in it that can process fructose!
- B: Ok
- A: So now it’s synthetic life, right?
- B: No
- A: Something something goalposts something something complete
It gives some sense for why benchmarks are hard to interpret / don’t necessarily say all that much.
- For example, you could imagine someone creating a type of lipid that
  - the free forms gradually stick to each other more and more,
  - forms micelles or liposomes when aggregated,
  - and splits into multiple liposomes.
- Is this progress towards synthetic life? Surely it’s some kind of progress. How much? How can you tell?
- You could make impressive videos, where the lipid-micelle-splitting video looks intuitively much more like life than what we had previously. This doesn’t actually tell you very much though.
It gives some sense for why “the bitter lesson” doesn’t say all that much. Sure, phage-assisted continuous evolution is very cool and outperforms human designers at least in some cases—but that doesn’t really bear on whether you can just point PACE at “make synthetic life”. You’re still confused, just at a higher level.
It’s genuinely very unclear when it will happen. Maybe someone will announce something that seems like true synthetic life next year, but probably not; or maybe it will take 20 years or 50 years or more.
There are plenty of ways to make substantial, legible, incremental progress on various benchmarks and subtasks.
- E.g., you could invent enzymes to mimic yet another biochemical pathway or something. How much this contributes to progress on the overall task is unclear.
- Pointing at line go up isn’t that much of an argument because the issue is that you’re probably pointing at the wrong lines and you haven’t explained why your line is a / the right line.
- A lot of goodharty ways of making progress don’t really contribute much at all. It’s not obvious, on the face of it, what the difference between goodharting and non-goodharting would be, but it’s definitely a thing.
The profile of capabilities is weird.
- There’s no natural bacterium with a profile of capabilities (chemical processing, resource acquisition, locomotion, defense, repair, growth) that corresponds at all well to the capabilities profile of SynLife2026 overall or of any particular instance of quasi-synthetic-life.
- You can point to supernatural capabilities of synlife in several areas. Maybe many specific artificial chemical processing pathways are much faster / more efficient / less expensive / higher purity than biological pathways.

tag:blogger.com,1999:blog-8939787122970662740.post-5673068281456431831

Skill: cognitive black box flight recorder

tsvi bt Jan 24, 2026 Updated Jan 24, 2026

Show full content

1 The flight recorder
2 Altered states and lost information
3 The black box recorder skill
4 Why black box info matters
5 Conclusion

Very short summary: It’s especially valuable to Notice while in mental states that make Noticing especially difficult, so it’s valuable to learn that skill.

Short summary: If you’re going to enter, or are currently in, a cognitive state that is very irrational / overwhelmed / degraded / constrained / poisoned / tribalistic / unendorsed / etc., then you may as well also keep a little part of yourself paying at least a bit of attention to what it’s like and what’s going on and recording that information, so that you get that sweet sweet juicy valuable data that’s hard to get.

1 The flight recorder

As legend has it, a black box (aka a flight recorder) is a device placed in an aircraft to record data from the flight (from measurement instruments or from voice recordings). If the aircraft crashes, most of the aircraft’s contents are vulnerable to being damaged or destroyed; but the black box is made of sturdier material, so it’s more likely to survive the crash. That way, information about the flight and what caused the crash is more likely to be preserved.

C’est une boîte noire.

When I’m able to, I practice something similar. If I’m in some sort of altered cognitive state, I try to “leave the black box recorder on”. That way, even if a lot of information gets destroyed or lost, I’ve at least gained a bit more information.

2 Altered states and lost information

Some examples of the “altered cognitive states” that I mean:

In some sort of heated political situation, where people are doing hostile actions and you have an instinct to join sides in a conflict.
In a debate with someone you don’t like, and they maybe kinda have a point, but you also don’t want to admit it for some reason.
In a fight with someone you care about, and you’re vulnerable and defensive and upset and feeling pressured.
In a really weird mood and having a weird conversation that doesn’t seem like your normal way of talking.

Similarly to a plane crash, often, after leaving a state like this, a bunch of information is lost. Examples of reasons that info is lost:

You were distorting your cognition by strategically blinding yourself. Examples:
- Rationalizing
- Pretending, preference falsifying
- Taking a posture for negotiating or territorial purposes
- Protecting something important in a bucket
You were just overwhelmed and didn’t have the spare attention to remember what was happening.
You were altered in a way that changed how you would encode memories.
- E.g. you were viewing things through an adversarial lens, which changed your first-blush interpretation of events.
- E.g. you had unusual access to some desire or perception.
- In general, you had a different cognitive context than usual.

3 The black box recorder skill

To partially counter this loss of info, there’s this mental motion of “turning on the black box recorder”. This is a subspecies of the general skill of Noticing, and shares many properties. Some notes specifically on how to do the black box recorder skill:

TAP: notice that you’re entering an altered state where you might have especially distorted perceptions / memories → turn on the black box recorder (somehow).
TAP: notice that you’re already in an altered state → turn on the black box (somehow).
Remind yourself of the special, non-obvious value of having black box data. For me, that’s a kind of cooperativeness or generosity: Even if the data feels useless or a distraction in the moment and doesn’t help me with my current situation, saving the data is something I can do to benefit others (my future self, or other people) in future similar situations.
Because you’re in an altered state, usually with less attentional resources to spare, you may have to ask less of your Noticing skill. For example:
- Sometimes just go for more episodic and concrete memories, rather than high abstraction and narrativizing. More “I said X and he said Y and I said Z and then I walked across the room.”, and less “He was trying to get me to believe A but I saw through him.”.
- If you’re also doing abstract narrativizing, don’t try to fight that. Just, if you can, add an extra metacognitive tag on those things, like “At this point [[I had an interpretation that]] he was trying to get me to believe A…”.
- Offload interpretation to later, and just try to save the data. E.g. generating alternative hypotheses is always good, but can be difficult in the moment; you may have to do it later.
You may need to make more space for remembering accurately and objectively, by neglecting certain duties you might usually attach to the pursuit of truth. Examples:
- You don’t have to be fully fair, accurate, or complete in your memories. The idea is to get more info than the default. If you have some sense of nagging doubts or curiosities—the sort of thing you’d normally want to pause and follow up on, but that you can’t investigate in the moment—just record that fact.
- You will not have to later capitulate due to this information. You can gain more clarity about what’s actually happening, what is going on in your mind, how your perceptions are distorted, how the other might be more sympathetic, and so on, while still firmly standing your ground.
- You don’t have to share or act on this information; it’s private by default.
- Some normal ethical rules apply less strongly / more ambiguously to this information. For example, you might record “Here I was not admitting that she was right about X, even though at this point I knew she was, because I didn’t like the implication.”, without also saying that out loud, even though normally you’d always say that out loud. It’s better to do something to improve your behavior, but also it’s better to notice and do nothing than to not notice and also do nothing.
- (That said, this can be morally fraught. A black box recorder is not an excuse to do bad things or shirk duties. The black box is just for improving over what is sometimes the default of losing the info altogether. The types of information that you’re only getting because you have a black box recorder might change over time; it’s still a moral duty to wrap your consciousness around yourself more and more, it’s just that this moral duty applies to slower behavior / longer timescales.)

4 Why black box info matters

For the most part, black box records matter for all the same reasons as Noticing matters in general. There are some important differences:

Flight recorder info is especially useful because it comes from cognitive states that occur during important events, where you’re likely to make consequential mistakes or have opportunities for consequential improvement.
Flight recorder info is especially difficult to get, basically by definition, because it comes from cognitive states where the default is to get sparse / degraded / distorted information.
Flight recorder info is exceptionally rare to be recorded, because the skill itself is rare; there’s a correlated failure among different people, where people en masse neglect the skill.

For these reasons, the black box flight recorder skill is potentially especially useful to develop. It could help surprisingly much for things like debugging, symmetrization, empathy, integrating with yourself, and understanding others’s strange / faulty behavior.

As an example, you might turn on your flight recorder while engaging with politics. You could then notice a kind of path dependence, like this:

[I saw current event X → my initial exposure to X made it seem like quite a hostile event → I took a particular stance to the event and people involved, in response to my initial interpretation → later I found out that X was still bad but not quite as bad and coming from a more specific sector than I initially realized → I then believed I ought to have a narrower, more targeted response, and yet I still had a strong intuitive inclination toward the broader response] → (later) from all of that, I’ve learned a general pattern; maybe this is what it’s like for other people, on any political side (which doesn’t make it right or acceptable, but at least I have a better map, and can see how it might happen differently for people with different information contexts, social contexts, personality traits, etc.).

5 Conclusion

Memory is cool.

Curious if other people do this.

tag:blogger.com,1999:blog-8939787122970662740.post-8683787161900392771

Lifelink™: Freedom for your Child

tsvi bt Jan 18, 2026 Updated Jan 18, 2026

Show full content

Note: Fictional! To preempt any unnecessary disappointment and/or fears of dystopia, be aware that this is not a real product, I don’t know of plans to develop it, and it is infeasible in many respects. There are some related products under search terms like “kids GPS smartwatch” and “safety monitor”.

Do you want your child to have free rein to wander in nature or explore the town? Are you worried about your child getting lost, or injured, or worse? Have you heard horror stories about CPS?

Introducing Lifelink™, the undisputed best-in-class FRC wearable safety link for independent children. Give your child the gift of secure autonomy today. Device FREE with subscription. Features include:

Options for necklace, bracelet, pocket, glasses, or anklet wearables. (Check out our multi-wearable packages for savings!)
Connectivity and real-time location tracking ANYWHERE through our worldwide affiliate system.
Military-grade rugged construction—waterproof, shockproof, fireproof, impact-proof, guaranteed.
Very difficult to remove without the passcode or remote parental release—criminals will stay away—LockPickingLawyer approved! Hardwired tamper alerts and location updates sent straight to you immediately so you know if anyone is trying to take away your child’s protection.
Patented custom bioconformation form factor—Lifelink™ will sit flush with your child’s skin, and will NOT get snagged! (Smart breakaway features for extraordinary circumstances optional.)
Child Protective Services CANNOT investigate you solely for having an unattended child over the age of 5 in public areas if they are wearing a Lifelink™! New FRC probable-cause laws currently in effect in these states: CA, IL, TN, TX, UT, VA, WA. Know your rights! More states coming soon.
Neuromorphic chip with hardwired low-power super-distilled LLM, activated by voice or unusual noises, checks if your child might be in a dangerous situation (including asking your child for assurance) and sends telemetry to our command center, where a full-power LLM ensemble and big data predictive models will alert you if your child might be at elevated risk (all computations homomorphically encrypted for your complete security!).
NEW: Police coordination. In participating jurisdictions, police are trained to allow children to roam freely in safe public areas unattended if they are wearing a Lifelink™, and many stations will have a hotline specifically for Lifelink™ S.O.S. signals and can directly receive location information. Currently available in these cities, with more to come: Berkeley, CA; San Diego, CA; Denver, CO; Boston, MA; Austin, TX; and Seattle, WA.
All vitals tracked! (Pulse, blood oxygen, temperature, hydration) You’ll get immediate alerts about any dangerous levels.
Integrated submersion detector, compass, clock, thermometer, barometer, air quality sensor (CO₂, CO, and particulates), and dual silicon diode / micro Geiger–Müller tube nuclear radiation sensors.
All data fully end-to-end encrypted. ONLY YOU HAVE ACCESS! Check our website for canaries.
Simple, easy S.O.S. button for your child to call for urgent help if need be.
Parents can push an alert sound or “come back home” call.
Two-way audio calling for emergency check-ins.
Lean5-verified firmware—rest assured, your child’s Lifelink™ will never freeze or crash! (Locator beacons also functional with the hardwired backup system.)
3-day power supply, using breakthrough Lithium-Oxygen BREATHABLE battery for ultra-efficient energy density, plus small backup standard anaerobic battery.
Automatic solar, motion, and thermal recharging, plus active-fidget recharging.
Smart systems conserve power by rationing scans, pingbacks, and data, to stay fully focused on reliable safety-critical communication.
Ultra-low-power 433 MHz narrowband radio backup locator beacon for true emergencies, dead batteries, or surprise connectivity dead zones. ALWAYS know where your child is!
Simple, recognizable, adjustable audio alarms for low battery power, low connectivity, dangerous weather conditions, or nearby dead zones or danger areas. Parents also alerted.
Your child can press a button to get an audio update on any nearby dead zones, danger areas, or parent-designated areas to avoid.

For as low as $25 per month plus equipment shipping, you’ll get:

DRONE RESCUE: In emergencies, if there are available reconnaissance drones, they will fly to your child’s location and send video and audio updates to you, as well as broadcast to communicate to your child, warn criminals, or warn CPS agents.
A full subscription to our comprehensive connectivity package for your child’s Lifelink™ wearables. This includes:
- All major satellite connectivity providers, including Starlink, Iridium, and Globalstar
- Most major cell providers, including Verizon, AT&T, and T-Mobile
- Automatic connections to all public meshnets
$50,000 FRC legal defense insurance, access to our specialized attorneys, and a cryptographically signed Safety Log that has precedent in state courts as admissible evidence of your child’s continuous supervision via Lifelink™.
Unlimited warranty—replace your child’s wearable at any time, for any reason, no questions asked**.
Unlimited size upgrades! We know your child is growing fast, and we’re ready with the equipment they need to be safe and free**.
In the management app, see connectivity dead spots (rare!) and crime or injury danger spots. Educate your child about areas to avoid.
Training games for parents and children to learn how to use Lifelink™ together.
DATA DASHBOARD: see your child’s history in telemetry, including location and vitals. View data analysis insights from our personalized data assistant.
Access to opt-in FRC buddy network, with approval! Lets your child find other nearby parent-approved children who also have Lifelink™. Rigorous expert-vetted verification system.
All software updates, free!
Export your data any time.

With our Pro package, you’ll get everything in the Basic plan, plus:

PRIORITY drone rescue.
Real-time professional human monitoring. At the first sign of danger your care team will alert you or your delegate.
Any set of wearables, up to ten per child at a time, for all your multi-wearable, fashion, and backup needs.
Wearable antenna clothes for extra security. In extraordinary events, your child’s clothing can serve as an ultra-low-power emergency long-range 133 MHz narrowband radio locator beacon.
Whispernet, meshnet, and universal commercial Wi-Fi passcodes updated daily.
No limit on size and style upgrades, and no shipping cost for replacements!
Access to JAILBROKEN wearables (voids software warranty).

**Limit 1 (one) replacement per month. Void with intentional destruction of equipment or software hacking. Shipping not included.

tag:blogger.com,1999:blog-8939787122970662740.post-8294991793884490148

What potent consumer technologies have long remained inaccessible?

tsvi bt Jan 12, 2026 Updated Jan 12, 2026

Show full content

1 Context
2 The question
3 Some examples
4 Assorted thoughts

1 Context

Inequality is a common and legitimate worry that people have about reprogenetic technology. Will rich people have super healthy smart kids, and leave everyone else behind over time?

Intuitively, this will not happen. Reprogenetics will likely be similar to most other technologies: At first it will be very expensive (and less effective); then, after an initial period of perhaps a decade or two, it will become much less expensive. While rich people will have earlier access, in the longer run the benefit to the non-rich in aggregate will be far greater than the benefit to the rich in aggregate, as has been the case with plumbing, electricity, cars, computers, phones, and so on.

But, is that right? Will reprogenetics stay very expensive, and therefore only be accessible to the very wealthy? Or, under what circumstances will reprogenetics be inaccessible, and how can it be made accessible?

2 The question

To help think about this question, I’d like to know examples of past technologies that stayed inaccessible, even though people would have wanted to buy them.

Can you think of examples of technologies that have strongly disproportionately benefited very rich people for several decades?

Let’s be more precise, in order to get at the interesting examples. We’re trying to falsify some hypothesis-blob along the lines of:

Reprogenetics can technically be made accessible, and there will be opportunity to do so, and there will be strong incentive to do so. No interesting (powerful, genuine, worthwhile, compounding) technologies that meet those criteria ever greatly disproportionately benefit rich people for several decades. Therefore reprogenetics will not do that either.

So, to falsify this hypothesis-blob, let’s stipulate that we’re looking for examples of a technology such that:

…it could be made accessible.
- In other words, there’s no clear obstacle to it being accessible to many people inexpensively.
- For example, we exclude all new products—anything that’s only been offered at all for less than, say, 10 years or something. There has to have been sufficient opportunity for people to make it accessible.
- For example, we exclude space travel. For the time being, it’s intrinsically extremely expensive.
- For example, we exclude gold-flaked ice cream, because gold is just rare.
- However, enforced / artificial scarcity could be interesting as an edge case (if it’s a genuine technology).
…people have, prima facie, had plenty of incentive to make it accessible.
- In other words, there should be a substantial market demand for the technology. Otherwise, it’s probably clear enough why it hasn’t been made accessible—probably no one tried.
- (If there’s some complicated or unintuitive reason that people don’t actually have an incentive to innovate despite unmet demand, we include that; such an example would be revealing about why this situation can occur.)
- For example, we exclude expensive medical treatments for super-rare diseases.
…it is very expensive to access, but rich people can access it.
- This could be for basically any reason. The product itself might be high-priced, or it might be highly regulated so that you have to fly to some remote regulatory regime to access it.
- I’m not sure what the bar should be. \$50K definitely qualifies as expensive. \$5K is much more ambiguous, and I’d lean towards no because many people have cars that are more expensive. (They finance their cars, but we could also finance reprogenetics.)
…it is actually a genuine technology, rather than being just a really big expenditure.
- For example, we don’t include yachts. We also don’t include technologies that are somehow very yacht-specific.
- We don’t include diamond-studded or gold-leafed anything.
…it is much more beneficial compared to analogous inexpensive products.
- E.g. we exclude a \$10 million car that’s mainly expensive because of branding, status signaling, etc., and doesn’t have much significant technological advantage over a \$100k car.
- But we do include expensive medical treatments that are much more effective than a slightly effective cheap treatment.
…ideally, it gives the user of the technology some additional compounding advantage over non-users.
- E.g. computers, nutrition, education, training, health, etc. The point is to model the “runaway inequality” aspect.

We can relax one or more of these criteria somewhat and still get interesting answers. E.g. we can relax “could be made accessible” and look into why some given technology cannot be made accessible.

3 Some examples

The Bloomberg Terminal. (But this was more like artificial scarcity, IIUC.)
Fast exchange connections for high-frequency trading. (Not sure if this qualifies.)
Prophylactic medical testing. E.g. MRI scans (something like a few thousand dollars).
Supersonic flights?
IVF (can cost in the ballpark of \$20k for one baby).
IVIG infusion (biologically scarce?), continuous glucose monitoring, monoclonal antibodies, various cancer treatments.
Cosmetic medical procedures. (However, actually these tend to be basically accessible, just “kinda expensive”.)
- Plastic surgery.
- Advanced dental care
  - Invisalign
  - Dental implants
- Hair implants
- LASIK
Home automation systems?

What are some other examples?

4 Assorted thoughts

In general, necessary medical procedures tend to be largely covered by insurance. But that doesn’t mean they aren’t prohibitively expensive for non-rich people. Cancer patients especially tend to experience “financial toxicity”, i.e. they can’t easily afford to get all their treatments so they are stressed out and might not get all their treatments and they die more. There’s some mysterious process by which drugs cost more with unclear reasons1 (maybe just, drug companies raise the price when they can get away with it). This would be more of a political / economic issue, not an issue with the underlying technologies.

Some of these medical things, especially IVF, are kinda worrisome in connection with reprogenetics. Reprogenetics would be an elective procedure, like IVF, which requires expert labor and special equipment. It probably wouldn’t be covered by insurance, at least for a while—IVF IIUC is a mixed bag, but coverage is increasing. This suggests that there should maybe be a push to include reprogenetics in medical insurance policies.

Of course, there are many technologies where rich people get early access; that’s to be expected and isn’t that bad. It’s especially not that bad in reprogenetics, because any compounding gains would accumulate on the timescale of generations, whereas the technology would advance in years.

Lalani, Hussain S., Massimilano Russo, Rishi J. Desai, Aaron S. Kesselheim, and Benjamin N. Rome. “Association between Changes in Prices and Out‐of‐pocket Costs for Brand‐name Clinician‐administered Drugs.” Health Services Research 59, no. 6 (2024): e14279. https://doi.org/10.1111/1475-6773.14279.↩︎

tag:blogger.com,1999:blog-8939787122970662740.post-8842790520064243174

HIA and X-risk part 2: Why it hurts

tsvi bt Jan 8, 2026 Updated Jan 8, 2026

Show full content

1 Context
- 1.1 Questions for the reader
- 1.2 Caveats
2 What is HIA?
3 AGI X-risk
- 3.1 Background assumptions
- 3.2 Red vs. Blue AGI capabilities research
4 An ontology of effects of interventions on world processes
5 Processes
6 Some plausible bad effects of HIA on processes
7 Other arguments
8 Acknowledgements

1 Context

Previously, in “HIA and X-risk part 1: Why it helps”, I laid out the reasons I think human intelligence amplification would decrease existential risk from AGI. Here I’ll give all the reasons I can currently think of that HIA might plausibly increase AGI X-risk.

1.1 Questions for the reader

Did I miss any important reasons to think that HIA would increase existential risk from AGI?
Which reasons seem most worrisome to you (e.g. demand more investigation, demand efforts to avert)?
Which reasons, if any, are cruxy for you, i.e. they might make you think human intelligence amplification is net negative in expectation? Up for a live discussion / debate?

1.2 Caveats

The world is very complicated and chaotic and I can’t plausibly predict even important questions like “what actual effect would such and such have”. I can’t even plausibly resolve much uncertainty, and the world is full of agents who will adaptively do surprising things. So the actual search procedure is something like: What is a way, or reason to think, that HIA might increase AGI X-risk, that could plausibly hypothetically convince me that HIA is bad to do? This is mostly a breadth-first search, with a bit of deeper thinking.

In particular, many of the reasons listed below, as they are presented, are, according to my actual beliefs, not true or misrepresented or misemphasized. However, that said, this is an attempt at True Doubt, which partly succeeded; some of the reasons listed do give me some real pause.

This is a similar project as “Potential perils of germline genomic engineering”. As in that case, keep in mind that part of the reason for this exploration is not just to answer “Should we do this, yes or no?”, but also to answer “As we do this, how do we do this in a beneficial way?”. See “To be sharpened by true criticisms” in “Genomic emancipation”.

2 What is HIA?

I’ll generally leave it fairly undefined what intelligence is, and what human intelligence amplification is. See “Overview of strong human intelligence amplification methods” for some concrete methods that might be used to implement HIA; those methods suggest (various different) specific functional meanings of intelligence and HIA.

Because we are being imprecise:

Critiques of HIA can bring up many possibilities—e.g. they could claim that HIA would tend to also affect some other trait for the worse.
Defenses of HIA can also bring up many possibilities—e.g. they could say “HIA is good if done in such-and-such specific way that falls under the general category”.

2.1 Vague definitions of intelligence and HIA

Vaguely speaking, by HIA I mean any method for increasing a living human’s intelligence, or for making some future people who in expectation have a higher intelligence than they would have otherwise had by default. Generally, we’re discussing strong HIA, meaning that the increase in intelligence is large—imagine 30 or 50 IQ points, so going from average to genius or genius to world-class genius.

Vaguely speaking, by intelligence I mean someone’s ability to solve problems that are bottlenecked on cognition (as opposed to physical strength or stamina, or financial resources, etc.). A priori, this could include the whole range of cognitive problem-solving. So, we include stereotypically IQ-style cognitive problems, like math or engineering. But, we also include for example the ability and inclination towards political charisma, wisdom, good judgement, philosophical ability, learning, questioning and attending to something steadily, creativity, good performance under stress, empathy, contributing well to teams, memory, taste, and speed.

On the other hand, we do not include other cognitive traits, such as kindness, agreeableness, emotional valence, emotional regulation, determination, conscientiousness, and so on. These are important traits in general, and it might be good to also give people the tools to influence themselves on those traits (though that might also be fraught due to coercion risks). But this article is focused more narrowly on intelligence, rather than all cognitive traits.

In practice, intelligence refers to whatever we can reasonably easily measure. If a trait is hard to measure, it’s hard to increase. (This is indeed a cause for concern, in that only increasing traits that are easily measured could be distortive somehow; the claim under discussion is whether HIA is good even under this restriction.) More specifically, intelligence refers to IQ, because IQ is fairly easy to measure. IQ is far from capturing everything about someone’s ability to solve problems that are bottlenecked on cognition. But in this article we take it for granted that IQ is a significant factor in those abilities, and we presume that IQ can be increased.

2.2 HIA as a general access good

One dimension we will fix is distribution: We will assume that HIA comes in an open access way. In other words, defenses of HIA can’t say “we’ll only give HIA to the people who are morally good, and therefore there will be a bunch more brainpower directed in a morally good way”. That’s because a restricted access implementation of HIA seems largely infeasible and also morally and ethically very fraught.

I don’t think it’s an absolute principle that if you come up with an effective HIA method you have to immediately share it with everyone. But I do think there’s a strong moral weight towards doing so; and there’s separately a politico-ethical weight (meaning roughly “it’s not the sort of thing you should do as a member in good standing of society, even if it’s moral, because it would justifiably cause a lot of conflict”). Because of the politico-ethical weight especially, in many scenarios it seems logistically infeasible to do very much selection of who gets access.

This ethical weight towards general access is strongly increased in the case of reprogenetics. Reprogenetics is inherently a multi-use technology, and is already being used by polygenic embryo screening companies to enable parents to decrease disease risks in their future children. This means that society has a very strong justified interest in reprogenetics being equal-access (in the medium-term, once the initial expensive development stages have been completed). Since reprogenetics is likely to be the most feasible HIA method (see “Overview of strong human intelligence amplification methods”), open access seems like a reasonable mainline assumption.

Finally, open-access HIA might be harder to defend as helpful for decreasing AGI existential risk, compared to some sort of hypothetical restricted-access HIA. So, defending the claim that even open-access HIA decreases X-risk is a stricter test; if passed, it should provide stronger evidence that HIA is good to pursue.

2.3 HIA and reprogenetics

Since reprogenetics is likely to be the most feasible strong HIA method, it’s hard to discuss HIA in general completely separately from reprogenetics. The type of HIA available, the timing of its advent, what other traits can be influenced, and how society will react are all potentially heavily affected if the method is reprogenetics specifically.

Still, as much as possible, this article aims to discuss the impact of HIA in general, factoring out impacts from any specific HIA methods. For thoughts on the downside risks of reprogenetics, see “Potential perils of germline genomic engineering”.

3 AGI X-risk 3.1 Background assumptions

This exploration assumes:

If anyone builds genuine AGI, everyone dies, unless AGI alignment has been solved.
AGI alignment is extremely technically difficult to solve.
People will continue pursuing AGI research, for various reasons, unless those reasons are removed and/or there are very strong reasons to not pursue AGI research.
There’s a substantial probability of AGI coming in the next 10 years, and also a substantial probability of AGI not coming for many decades (and anything in between).
The top strategic priority is to avoid building unaligned AGI.
There’s something called “AGI capabilities research”, meaning “research that adds to the technical understanding of humanity about how to make AGI”.
AGI capabilities is always bad because it ticks the global clock forward towards AGI.

3.2 Red vs. Blue AGI capabilities research

I want to introduce a piece of vague terminology to help with discussing the strategic landscape. Very vaguely speaking, there’s a spectrum of AGI capabilities from “Red” (near-term, big training runs, lots of attention) to “Blue” (blue-sky research). It’s of course far from actually one-dimensional, and some entries in the below table are quite debatable (i.e. maybe the two entries should be swapped). Still, I want to use this one dimension as a rough-and-ready way to divide the space up by one degree. To give more flavor of the dimension:

Red Blue hot, active, fast-paced cool, gradual, slow happens at companies happens in academia has a big pile of resources (large amounts of money, compute, research talent, software engineers) doesn’t have a big pile of resources can effectively deploy a big pile of resources can’t very effectively deploy a big pile of resources requires a big pile of resources can continue with a small pile of resources does PR doesn’t do PR seeks and gets lots of attention doesn’t seek or get much attention exploit explore concentrated and siloed in a few large organizations and a few large projects within those organizations diffuse; lots of small labs and individual researchers sharing ideas more openly and piecemeal visible, legible hidden, illegible (happens in colleague discussions, obliquely discussed in math/CS journals) compute-based; practical; experimental understanding-based; conceptual make products make publishable ideas weakly or mediumly contributes to deep AGI capabilities progress strongly contributes to deep AGI capabilities progress gathers steam more quickly gathers steam more slowly likely to complete the last mile of research to an intelligence explosion less likely to complete the last mile of AGI research

A plausible (AFAIK) first-approximation model is that at any given time, Red research is the most likely to set off an intelligence explosion. Red research takes existing ideas that have already been somewhat proven, and then cranks them up to 11 to see what happens. On the other hand, Blue research is most likely to contribute to getting to AGI in the longer run.

Red research is easier to regulate than Blue research. That’s because Red research requires big piles of resources, and is generally more visible (PR, products, large salaries, brand recognition). In particular, the physical needs of a large datacenter (energy, heat, chips) can be detected and regulated. Blue research can be carried out with consumer computers and via intellectual discourse, and it uses more specialized theoretical ideas, so it is harder to detect or even define.

4 An ontology of effects of interventions on world processes

In general, with some strategic intervention, the question arises: What processes in the world does this intervention speed up / support, and what processes does the intervention slow down / disrupt?

To a rough first approximation, the intervention is good if and only if the expected net change in all the speeds of the affected processes is a good change. So, we can get a rough guess for the value of an intervention by making guesses at how it affects each separate world process. Then, an argument that HIA is bad takes the form “This process is bad and is especially accelerated by HIA” or “This process is good and is especially decelerated by HIA”, or “Process X is worse than process Y and process X is accelerated by HIA more than process Y is accelerated”.

The next subsection will make some general remarks about the meaning of “acceleration”. The following two subsections will give a list of categories of ways that HIA could affect the speed of some process. (They don’t try to present a comprehensive ontology; I just think dividing up the space somewhat, even a bit arbitrarily, is helpful because it makes it easier to think in terms of specifics while also searching broadly through much of the whole space.)

4.1 The meaning of “acceleration”

To get some kind of handle on the menagerie of plausible effects of HIA, I’ll give a list of categories of ways that HIA could affect the speed of some process. These will be phrased as which processes “are accelerated” by HIA. This is vague, for convenience, but some notes to clarify a bit:

The way that HIA affects processes might change over time, so for each of these categories, we could ask how it will change by the time HIA starts affecting processes.
The basic point of comparison is the world without HIA.
But we might also want to discuss whether a process is accelerated by HIA more than “all processes” are accelerated by HIA “on average”.
We might also want to discuss which processes are accelerated more than we might have expected according to some simple rule.
A comparison between rates (in the HIA world vs. the non-HIA world, or between different processes, or in anticipation vs. reality) is vague. But the comparison could for example mean that the timeline until some important event within a process moves sooner in absolute terms or proportionally relative to other landmark events in that process; or the comparison could mean races between different processes tilting in favor of one or another.
If there’s some feature of a process that implies the process will be accelerated by HIA, then a weaker form or negated form of that feature might make a process tend to be accelerated less or decelerated.

4.2 Effects of HIA on a single process

Some processes are accelerated because added per-capita brainpower directly solves more problems within that process. E.g. because…
- …the process makes good use of brainpower (e.g. onboards it well, allocates it well, heeds it well, supports it well);
- …the process is bottlenecked on brainpower (as opposed to legwork, etc.);
- …the process is bottlenecked on high-caliber brainpower (e.g. math research);
- …the process has a structure of problems and solutions that lends itself to more brainpower (e.g. currently just below some threshold; more parallelizable).
Some processes are accelerated more because people who benefit from HIA tend to be personally inclined to contribute to those processes. E.g. because…
- …intelligence in general causes people to have those inclinations (e.g. being especially interested in or capable for certain kinds of activities);
- …the specific form of HIA affects interests or values (e.g. by emphasizing some aspects of cognitive performance over others, or by directly affecting interests);
- …HIA tends to be applied to people with certain characteristics or who are around other people with certain characteristics (e.g. people who choose HIA for themselves or for their children having certain interests, personality traits, or orientations to themselves or their children);
- …if someone benefits from HIA and knows it, then that causes them to think of themselves differently, including in terms of what processes to contribute to.
Some processes are accelerated more because society tends to cause people to contribute disproportionately much to those processes. E.g. because…
- …society rewards contributing to some processes (with money, social approval (lack of punishment), legal permission (lack of punishment), etc.);
- …society betrays or harms people in some ways, which affects their behavior (interests, hopes, ethics);
- …society is inadequate regarding some process, creating an incentive to contribute to that process;
- …society more indirectly shapes people, e.g. by instilling values.
For some processes, some degree of acceleration due to other first-order reasons will further compound into more (second-order) acceleration. E.g. because…
- …the process uses first-order acceleration to attract more resources (money, people, brainpower, political will), e.g. because it becomes more lucrative, interesting, worthwhile, or socially desirable as it makes faster progress;
- …the process has network effects, i.e. increasing returns to more people;
- …the process self-improves, as a community.
Some processes are stimulated as responses to the existence of HIA. E.g. because…
- …people want to intervene on the use of HIA itself (e.g. prevent or constrain it, gain access to it, impose it on others);
- …people want to intervene on people who get HIA (e.g. recruit them to work on a process, persecute them);
- …people want to intervene on the results of HIA (e.g. race to complete some project before HIA people intervene).

4.3 Effects of HIA involving multiple processes

One process might be directly causally downstream of another process. E.g.:
- One process directly inhibits another process (e.g. by punishing it, removing rewards for it, or persuading people to not contribute to it).
- One process directly activates another process (e.g. by recruiting for it, rewarding it, or persuading people to contribute to it).
Two processes interact indirectly. E.g.:
- They compete over resources.
- They push in opposite directions (e.g. on social opinion or regulation).
The relationship between two processes is altered. E.g.:
- Direct relationships (activation, inhibition) are broken or amplified (e.g. regulatory escape).
- One gains the upper hand over the other, winning out in competitions or conflicts.
- Race dynamics are shifted, where one process gains the lead in time over the other.
Shifts in many processes cause follow-on shifts. E.g.:
- Cumulative strain on a system, from many processes accelerating, causes it to tip over a threshold of collapse.
- An especially nimble, fast-adapting process is able to cope exceptionally well with general multi-process acceleration, gaining a relative advantage over other processes.

5 Processes

This is a list indicating some of the processes relevant to AGI X-risk:

Red research (see the subsection “Red vs. Blue AGI capabilities research”)
Blue research (see the subsection “Red vs. Blue AGI capabilities research”)
Alignment research
Society in general, or more narrow bodies:
- Doing well / poorly; abundance / scarcity
- Being stable / unstable
- Being wise / unwise; sane / insane
Making progress on X, for various X (medical research, technology, morals, etc.)
Conflicts (between various bodies)
For various X:
- Cognitive empathy by people against X for people in favor of X or vice versa
- Political will in favor of X or against X
- Convincing people of X
- For various bodies B (states, international coalitions, professional bodies, social strata):
  - Support of X by B (desire for, capacity for, or actual)
  - Regulation / stigma of X by B (desire for, capacity for, or actual)

6 Some plausible bad effects of HIA on processes

The following subsections list reasons to think that HIA would speed up / support risky processes more than it speeds up / supports derisky processes.

6.1 Speeding up Blue research

HIA would (a priori, in expectation) speed up all research. If progress on some research problem is more bottlenecked on very difficult ideas (compared to e.g. money, legwork, regulatory approval, etc.), then it will tend to be sped up by HIA more than another research problem that’s less bottlenecked on ideas. Therefore, at a guess, HIA would directly speed up Blue research relatively more than many other kinds of research (including Red research).

Smart people might tend to be most interested in endeavors that are individualistic, technical, ambitious, computer-y, and puzzle-y. So they’d tend to be drawn to AGI research. This effect might be currently added to by society’s tendency to not naturally offer ideal social and economic niches for very smart people.

As a basic note, we observe people already being directed to Blue research, so by default we expect that to continue.

Blue research might also have some second-order self-acceleration effects. E.g. there would be intellectual network effects, and maybe some self-improvement effects via better credit assignment and resource allocation internal to the field. These effects might be relatively weak because Blue research is fairly diffuse, but still substantive. On the other hand, there might be significant “coordination overhang”: there could be a threshold effect, where with some difficult new ideas, a large number of small siloed Blue research groups could coordinate with each other. Since there’s far more absolute Blue research than alignment research, there’s more such overhang for Blue research.

Blue research is especially bad, because:

It’s hard to regulate.
It’s what ticks the world closer to AGI in the long-run.

6.2 Speeding up Red research

Relative to Blue research, Red research is less bottlenecked on very difficult ideas, so it gets less of a relative direct speedup.

However, Red is likely to have strong indirect acceleration. Because of money and status incentives, Red research attracts people. Red research is likely to attract excess big piles of resources to AGI capabilities. It will probably continue attracting investment as it gets applied to more sectors of the economy, and it gets applied more as it progresses more. It also gains social cachet.

As with Blue research, we observe people being directed to Red research. This observation is even more indicative of trends for Red research in particular, because Red research has upticked a lot recently. That means people are still being directed to Red research even in a memetic environment that already includes a lot of warnings about AGI X-risk. In particular, this suggests that kids who benefit from HIA, growing up in a memetic environment with X-risk warnings but also a very prominent money incentive to do AGI research, might tend to work on Red research.

Red researchers might be especially prone and able to take agentic, conflictual stances towards efforts to avert AGI X-risk. That’s because they are more concentrated, have more resources at hand, and tend to be more anti-social and greedy. For example:

Red is logistically relatively easier to regulate than Blue because it involves large concentrated piles of resources. However, it’s harder to socially prevent through stigma because it has large money incentives (which tend to overpower weak or medium stigma). Red may be harder to get regulation passed about because Red is especially concentrated, and therefore can apply larger point-forces to push on legislation.
Compared to most other processes, Red may be more likely to strategically and effectively target HIA people for recruitment, thus capturing more of the gains.
The existence of HIA might spur especially Red research to have more sense of urgency and go faster, out of a fear of being replaced as AGI leaders by HIA people, or out of a fear of being prevented from doing AGI research by strategies from HIA people. Similarly, regimes might pursue AGI more urgently if other regimes are pursuing HIA and not sharing it, in a bid to not be overtaken.

6.3 Less speeding up legal and social regulation

People (the public at large; policymakers) could socially and legally push against AGI research. They’d first have to be convinced to do so. That process may be less bottlenecked on ideas, compared to AGI research. Instead it may be more bottlenecked on, for example:

Legwork explaining the danger of AGI, which we know how to do but takes a lot of work.
Time for people to orient to the danger of AGI (e.g. understand the danger, deal with feelings), and how to push against it (e.g. policymakers negotiating regulations). That process is mostly governed by people’s internal thoughts, rather than by new very difficult ideas, and most or all already living people won’t get much HIA and therefore won’t do this process faster.
Noisy interference from orthogonal processes. E.g. policymakers may be quite preoccupied with other concerns, or might be unable to coordinate with each other.
Targeted interference from opposed processes, e.g. concentrated lobbying by people with an ideological or financial motive to have AGI unregulated. This may tend to be advantaged by HIA, compared to concentrated lobbying from those in favor of regulation, since the latter get relatively less acceleration from HIA.

6.4 Nonlinear / race-condition regulatory escape

In general, processes that regulate AGI research are in some conflict with AGI researchers. The results of this conflict could be quite nonlinear, with a soft threshold effect where the ability of AGI researchers to carry on dangerous research could overpower the ability of regulators to prevent it.

Similar things happen with tax evasion and with regulation of pirating media.

Since AGI research is likely to be accelerated relatively more than regulation of AGI research, HIA would increase the likelihood of regulatory escape.

6.5 Alignment loses the race anyway

Even if HIA speeds up alignment, the plan of making an aligned pivotal AI still probably requires making AGI-potent capabilities advances. So, a fortiori, aligned pivotal AI would still probably lose the race against (unaligned omnicidal) AGI. So, the current trajectory is bad, and HIA doesn’t change that.

What HIA does do, is speed up that trajectory. So even if alignment and capabilities research got the same speedup from HIA, the overall effect would not benefit the chances of alignment beating AGI.

6.6 Intrinsic regulatory escape

HIA people, especially extremely smart ones, would in general be out of distribution. That could be because of selection effects, the specific form of HIA, or just because of the high intelligence itself.

Because HIA people are out of distribution, society would tend to be less good at regulating them in general, e.g.:

by being able to convince them that AGI is dangerous, using our current crystallized wisdom on that topic;
by instilling values and ethics;
by providing support (e.g. empathy, peers, good niches);
by understanding what they’re doing;
by detecting when they are mistaken / lying / deceiving / overconfident;
by being logistically able to carry out punishments for bad behavior;
by living up to their standards for “being a sane and good world that doesn’t need to be urgently, recklessly shaken up”.

6.7 Disrupting regulatory systems

In general, HIA could cause conflict. Conflict could destabilize systems. If systems are destabilized, they might be less able to regulate in general. Therefore, HIA could make it easier for AGI capabilities research to evade regulation. Examples:

HIA could cause deep social/political conflict over the use of HIA…
- …thus causing people to not pay attention to AGI X-risk and gather political will to stop it.
- …thus causing people to not relate sanely to their smart friends who are considering doing AGI capabilities research.
- …thus weakening group sense-making in general.
- …thus making it harder for us to convince people to do something about AGI X-risk, because we might only have skill / knowledge about how to do that given the current structure of group sense-making.
HIA could cause countries or groups of countries to fight with each other about the use of HIA…
- …thus preventing single countries or groups from attending to AGI X-risk.
- …thus preventing single countries or groups from coordinating across groups on international agreements to regulate AGI research.
HIA could cause shifts or disruption in group-valuing-systems.
- E.g. people who didn’t benefit from HIA might lose faith in their own future, their say in their future, or their ability to influence HIA people; and by feeling helpless, they may stop having values or change their values because their values don’t know how to exert themselves in the new context.
- So the group-values that would motivate regulating AGI might be weakened.

In general, many aspects of the current state of affairs will be somewhat at equilibrium, and in particular will be somewhat adapted to the current state of affairs. To the extent that the current state of affairs includes some ability to regulate dangerous technologies, that ability would be disrupted by fast shifts that move out of the regime of adaptation. [H/t so and so for this point.] Further, this adaptation would tend to be poor at benefiting from HIA acceleration, so it would tend to fall even further behind, leading to even more escape.

Note that this argument is a response to the reversal test, because it argues that the status quo is best.

6.8 Social values favor following local incentives

Generally, given society’s current set of values (that it instills in people), long-term altruistic payoffs aren’t incentivized. So in general, processes that only have long-term altruistic payoffs will receive less benefit from HIA. In particular, alignment research, the decision to stop doing AGI research, and the decision to regulate AGI research, are not incentivized.

That is a first order effect, where rewards and punishments don’t directly incentivize long-term thinking. As a second order effect, besides the direct effect, there’s an indirect effect where the fact that society is like this further breaks reasonable faith someone might have in society being good long-term. Since long-term society is a stag hunt game, this further disincentivizes long-term thinking; long-term thinking is partly incentivized because others are directly incentivized to do long-term thinking, but if they aren’t then that incentive is gone. E.g. if there’s a lot of fraud and injustice, that diminishes your expectation that being honest and just will pay off, because others won’t collaborate with you on your honest and just endeavors. This directly interferes with good endeavors. It also might indirectly interfere with good endeavors by more generally distorting people’s values. That happens because the general environment of bad incentives makes there be less expectation of a good long-term future in general, which makes people care less about, for example, omnicide. So AGI X-risk would seem less bad. In that mindset, the thrill and money from AGI research would be more tempting on net.

6.9 Less speeding up change towards better values

In general, for humanity to respond better to AGI is to some extent a question of values, broadly construed to include wisdom, sanity, calmness, patience, coherence, goodness, long-term thinking, altruism, empathy. Policymakers and the public would have to care about long-term global outcomes rather than short-term ones; AGI researchers would have to care about not harming others more than a small chance of large personal gain, and would have to have hope in the future without AGI.

Rather than being bottlenecked on ideas, value change may be relatively more bottlenecked on e.g.:

Time, legwork, and skill for persuading AGI researchers to stop. E.g. the skill of confrontation-worthy empathy is probably bottlenecked on several traits / abilities, some of which are not very IQ-related.
Time for people to process (e.g. propagating stated values into actions; investigating conflicts between stated values and implicit values; working out lines of retreat).
Decisions that people have to make about what they care about.
Time for attentional cycles / OODA loops / network effects to run their course.

6.10 Alignment harnesses added brainpower much less effectively than capabilities research does

In addition to just being more difficult, the conceptual structure of the problem of AGI alignment has some more specific disfavorable properties compared to capabilities, which are salient in this context. Alignment progress is less parallelizable, cascading, tractionful, and purely technical than capabilities. In more detail:

There’s generally much less traction in alignment research.
- In other words, there’s less surface area to make progress.
- In capabilities research, there’s many experiments to run and many ideas to try, which might work. There are partially-working systems which can be refined. You have fairly direct access to the problem frontier, because the frontier is always “what current systems can’t do well”. You can tell what works and what doesn’t.
- In alignment research, most important problems mostly only show up in actual AGIs, so you don’t have access to the relevant objects and problems. Experiments don’t give much relevant information, and we don’t have the concepts to think about or deal with actual AGI. The problems are more philosophical (where we don’t know what questions to ask, and don’t have the ideas we’d need in order to ask the questions) rather than technical (where the problem is well-defined using well-understood ideas).
- Because there’s more surface area, capabilities research is also more likely compared to alignment research to be able to incorporate advances in nearby fields. Sources like hardware (faster, cheaper computers), computer science (faster algorithms), neuroscience (ideas for functional algorithm pieces), and mathematics (medium-depth understanding of conceptually-thin aspects of minds) are more likely to have “a bit of a dot product with general intelligence” rather than “a large dot product”, and therefore are more likely to contribute to building [a functional mind, any sort] than to contribute to building [a specific sort of mind, a safe / corrigible / honorable / humanity-aligned mind]. Those nearby fields would tend to also accelerate from HIA.
Ideas in capabilities more easily cascade into more ideas.
- For example, a new training method can be combined with many architectures and datasets; different systems can be combined as mixtures or pipelines; and so on.
- So, one new idea unlocks a bunch more traction.
- In alignment on the other hand, ideas often come in the form of understanding the problem better, e.g. understanding constraints on possible solutions. These don’t combine with each other as productively. So, new ideas don’t necessarily cascade into much more traction.
Capabilities is more parallelizable.
- Since in capabilities research there’s more traction, surface area, combination, and cascading, it’s easier and more productive for many people to work in parallel on different projects.
- In alignment, on the other hand, you have to understand each constraint that’s known in order to even direct your attention to the relevant areas. This is analogous to the situation with the $\textsf{P}$ vs. $\textsf{NP}$, where whole classes of plausible proof strategies are proven to not work. You have to understand most of those constraints; otherwise by default you’ll probably be working on e.g. a proof that relativizes and therefore cannot show $\textsf{P} \ne \textsf{NP}$. Progress is made by narrowing the space, and then looking into the narrowed space. (I’m not sure this story is quite true in the $\textsf{P}$ vs. $\textsf{NP}$ case; e.g. were the natural proofs and relativizing proofs constraints discovered with serial dependence?)
- So alignment has more serial dependence in its ideas, i.e. it’s less parallelizable. (It can still benefit from more researchers to do more searching; but they’ll tend to duplicate efforts more.)
Alignment depends more on cognitive traits that are less IQ-correlated than raw technical problem-solving.
- E.g. alignment takes more
  - wisdom (tracking many constraints, taste in attention),
  - more patience/persistence/attentiveness, more humility (e.g. finding flaws in your own reasoning and ideas),
  - sanity (e.g. being able to ground ideas when reality doesn’t ground them for you; not going crazy from thinking about minds and weird self-referential things and scary things),
  - more security mindset,
  - and more urgency / agenticness (e.g. discarding interesting / lucrative threads that don’t contribute to solving the problem).
- So, the gains in alignment from HIA would be somewhat attenuated (e.g. by being multiplied by other cognitive traits, or by being Amdahlly gated by other cognitive skills.)

For these reasons, alignment harnesses the gains from HIA much less effectively than capabilities research does.

A related point / another way to say this is that alignment benefits the most from HIA that makes there be more extremely-smart people, but does not benefit differentially from HIA that makes there be more somewhat-smart people, whereas capabilities research does benefit from more somewhat-smart people.

6.11 HIA people may tend to be transgressive

In general, there are several reasons to be cautious about using HIA. Therefore, people who make use of HIA technology might tend to be especially transgressive, i.e. ignoring reasons to not use it. (Cf. “Transgression” in “Potential perils of germline genomic engineering”.) Also, extremely smart people might tend to be transgressive in some ways. Being transgressive might correlate with other traits that lead to HIA people having transgression-related traits. Those traits would tend to make HIA people do more bad things, including pursuing AGI research.

Examples:

If there’s international pressure against using HIA, regimes that allow HIA would tend to also buck AGI regulation.
Regimes that coerce people to use HIA might also coerce them to do other things, such as AGI research.
People who use HIA on themselves might tend to be especially non-humble, anti-consensus, reckless, overly technooptimistic, selfish, or overconfident. Those traits would cut against them being convinced or pressured to not do AGI research.
Parents or subcultures who use HIA for their future children might tend to be especially non-humble, norm-bucking, technooptimistic, anti-regulation, overconfident, unscrupulous, or inclined to enlist their future children into their ideologies or causes. So their children might be attracted to or pressured into reckless technology development such as AGI.
If HIA is controversial, HIA people might be targeted for persecution. In response, they may become anti-social, individualistic, nihilistic, selfish, or reckless.
If HIA makes extremely smart people, those people might tend to be enamored with their own abilities and become overconfident due to being the smartest people around in a relative sense. In particular they might become overconfident in their absolute ability to navigate AGI X-risk, overconfident in their judgement about when norm-breaking is ok, or overconfident in general such that they aren’t open to advice/correction/perspective/epistemic-assistance from other people. They might therefore tend to choose to advance AGI capabilities recklessly.

7 Other arguments

This section gives other arguments that HIA increases AGI X-risk.

7.1 Concentration of power

If there’s a large early cohort of people who benefit a lot from HIA, they might form a somewhat cohesive community. This could have bad effects, including exacerbating some of the dynamics mentioned above. For example:

By syncing up with each other somewhat, they could have more correlated failures with each other, e.g. about epistemics around AGI. E.g. rather than this cohort ending up distributed through various processes, they could end up lopsidedly concentrated into AGI research. This outcome might be targeted by Reds.
By being more univocal (I mean, speaking in unison / unanimously), they might check each other’s flaws less.
Society might be overconfident in them, and they might be overconfident in themselves.
They might be especially able to incorrectly persuade society to allow them to pursue AGI.

7.2 HIA is unpredictable and therefore risky

Generally, we don’t understand the tails of cognitive performance, so we don’t understand what HIA would be like. If there’s some strong tendency in HIA people, that tendency would have a large effect in a world with HIA. Most changes are bad, so a priori large effects are bad. Since HIA is unpredictable, we don’t have a good reason to expect it to have good effects.

7.3 More capable but not wiser

As a very general argument, we might expect HIA that targets IQ or IQ-like traits specifically to be bad because it’s imbalanced. Specifically, it makes people who are more capable but not necessarily wiser, to the extent that wisdom is orthogonal to IQ. Since we’re in a regime where the unwise competent pursuit of technology is an existential risk, this implies HIA would be bad.

As an analogy, consider a 3-year-old. Suppose they suddenly gained the strength of an adult and the self-control of an adult. Let us ignore, for the sake of the hypothetical, all the probable badness that would entail for the child’s experience and development, and just ask, what are the direct kinetic consequences of that change? It wouldn’t be so bad: they have the self-control of an adult so they won’t do too much harm. But what if the child suddenly gained the strength of an adult, but not the self-control? This disproportionate change in abilities would be disastrous.

8 Acknowledgements

Thanks to many people for conversations about this, especially RK, DB, MS, VP, SE, TY.

tag:blogger.com,1999:blog-8939787122970662740.post-3085990254930319975

Every point of intervention

tsvi bt Dec 9, 2025 Updated Dec 9, 2025

Show full content

Events are already set for catastrophe, they must be steered along some course they would not naturally go. [...]

Are you confident in the success of this plan? No, that is the wrong question, we are not limited to a single plan. Are you certain that this plan will be enough, that we need essay no others? Asked in such fashion, the question answers itself. The path leading to disaster must be averted along every possible point of intervention.

— Professor Quirrell (competent, despite other issues), HPMOR chapter 92

1. Keeping intervention points in mind
2. Some takeaways
3. Some biases potentially affecting strategy porfolio balancing
4. A terse opinionated partial list of maybe-underattended points of intervention

This post is a quickly-written service-post, an attempt to lay out a basic point of strategy regarding decreasing existential risk from AGI.

1. Keeping intervention points in mind

By default, AGI will kill everyone. The group of people trying to stop that from happening should seriously attend to all plausible points of intervention.

In this context, a point of intervention is some element of the world—such as an event, a research group, or an ideology—which could substantively contribute to leading humanity to extinction through AGI. A point of intervention isn't an action; it doesn't say what to do. It just says: Here's some place along the path leading to disaster, where there might be useful levers we could pull to stop the flow towards disaster.

1.1. The vague elephant

Before going on, I'll briefly say: Don't do bad unethical things.

Just because we should attend to every point of intervention, does not mean we should carry out every act of intervention! E.g. don't be an ad-hominem dick to people, whether in private in public. In general, if you're about to do that thing, and you know perfectly well that if you thought about it for three minutes then you'd see that almost everyone would tell you that's a really really bad thing to do, then you should probably not do that thing. And if you still want to do it, then you should probably first try talking to several people who you trust (and who you don't strongly pre-select to be people who are egging you on to do that thing).

1.2. Example: France

Someone was telling me about their somewhat-solitary efforts to get the government apparatus of France to notice AGI x-risk and maybe do something about it; and to not be too swayed by influence saying to ignore those concerns. They expressed being unsure as to whether these efforts would matter much. People in the policy space would tend to think of the US and China as being the two players that really matter.

I argued to them that actually those efforts are pretty high-value. Leaving aside tractability (IDK) and neglectedness (yes) and goodness (probably, though there's always the worry of stimulating R&D investment), I wanted to argue for importance.

1.3. Full-court press

In basketball, there's a defensive mode called "full-court press". That's where you pressure the team with offensive possession of the ball everywhere on the court, trying to regain possession of the ball before the offensive team gets close enough to the basket to score. This contrasts with half-court press, where you basically let the opposing team take the ball to the half of court with your basket, and concentrate your defenses there.

Full-court press has the disadvantage of allocating some defensive resources away from the home side of the court. Thus, you can be more vulnerable if the opposing team gets near your basket. Also, full-court press is simply more expensive—the defending team has to run around much more, and gives up the advantage of clustering where they know the opposing team has to go (near the scoring basket).

But, full-court press is a good way to spend more resources to get better outcomes. You make them pass the ball more, giving them more chances to mess up, often producing turnovers. You make them run around more, which tires them out.

Likewise, intervening at every point along the path leading to AGI disaster may be a broad strategy that demands higher costs and risks allocating some resources away from important points; but that may also come with the benefits of giving more less-correlated opportunities to block the flow towards disaster.

1.4. Multi-stage fal... opportunity!

Suppose that there are 5 events that might occur, and if all of them occur, something really bad happens; on the other hand, if one of the events does not occur, then the really bad thing does not happen. Suppose each event will occur with probability 0.9.

First of all, how likely is the really bad thing to happen? One answer would would be $0.9^5 \approx 0.6$, i.e. there's a 60% chance of it happening. However, this answer is falling prey to the three multi-stage fallacies. You can't conclude that the bad result is only medium-likely, just because you made a list of events that all have to happen.

But here's a different question: How unlikely can you make the really bad event?

1.4.1. Brief tangent about a conjunction of disjunctions

Of course, the answer depends a lot on the specific structure of these events. But here's one kind of structure:

Suppose each of the five prerequisite events $P_1, ..., P_5$ is itself a disjunction. In other words, if any one of $D_{1, 1}, D_{1, 2}, ..., D_{1, n}$ happens, then $P_1$ happens. I think this is often the case in the real world. E.g., several different funders might fund some research group; several different research groups might succeed at some goal; several different technologies might provide workable components that enable some subsequent technology; etc. Furthermore, it's often the case that it's easy to intervene on some of the $D_{1, i}$ but not on others. In this case, it's easy to decrease the probability of $P_1$ somewhat, but not easy to decrease it a lot. You prevent some of the $D_{1, i}$ that are easy to prevent, and then you call it a day.

Does it help to somewhat decrease the probability of each $P_i$, without greatly decreasing any of them? Yep! As long as the probability of the conjunction is fairly high, the marginal value of decreasing the absolute probability of each of the $P_i$ is roughly the same.

1.4.2. Varied interventions help

Anyway, basically the point of this subsection is that it helps to intervene along many channels / at many points, if there are multiple conjunctive prerequisites to disaster.

Note that [multiple conjunctive prerequisites to disaster] is logically equivalent to [multiple disjunctive stoppers of disaster]. For example, it's plausible to me that either an international ban on AGI research, or a strong social norm in academia against AGI research, would very substantially slow down AGI research.

1.4.3. Sources of correlation indicate deeper intervention points

One of the three multi-stage fallacies is forgetting to use conditional probabilities for the prerequisites to disaster. For example, conditional on [we can't convince major nations to ban AGI research], it's probably much less likely that [we can convince AGI researchers to stop doing that].

The outlook of "every point of intervention" says to consider this correlation as a pointer to some deeper element of the world. In this example, the source of correlation might be [the same funder is paying both groups to continue AGI research], or [AGI risk doesn't feel real to people], or [people are secretly nihilistic and don't actually have hope in a deeply satisfying shared human future], or many other possibilities. (These are therefore not necessarily temporal points of intervention—events in a sequence—but generally, elements that could be intervened on.

2. Some takeaways

Focus on the places where you feel shocked everyone's dropping the ball.
This perspective doesn't help much with prioritization. But, generally, it says we should competently do a diverse portfolio of strategies. On the margin, I think competent newcomers should be directed towards the possibility of starting a new / neglected effort, rather than joining an existing one (though of course many existing efforts have important talent gaps).
There's lots of meaning everywhere. There may or may not be any good plans to decrease x-risk, but there are many things to try that are pretty worth-it and quite neglected.
If someone is deferring to you about strategy, consider helping them keep in mind that there are many approaches.
This doesn't mean "do random stuff and hope it decreases x-risk".
- One still has to think about which plans would be useful. Most plans don't help, and many plans actively hurt (mainly anything about contributing money or talent or social support to AGI capabilities research). Whether or not a point of intervention is potentially impactful is basically orthogonal to whether a possible act intervention is good. But, this does mean that if something is neglected, you should be less prone to say "that's ineffectual so not worth it". Ignoring points of intervention is a bias about the upside risks.
- It doesn't matter how correct and original you are in pointing out that some point of intervention is neglected, if you don't do anything about it, or if you do something about it that's harmful. Doing anything helpful usually requires a bunch of work, a lot of which is boring and/or thankless and/or of unclear importance.
- Sometimes people feel helpless coming specifically from the sense that "there's nothing that I can do that would help; there's only a few important ways to help, and I'm less capable than the people already working on that". I think that's not right, because there's many different ways to intervene against disaster, many of which are neglected. You can manufacture comparative advantage just by caring about a neglected approach and then investing serious effort into investigating that approach.
- Sometimes people feel helpless coming from the sense that "there's so many things I could do; this spreads out importance too thinly between many different plans; so none of them is worthwhile / IDK what to do". I think that's not right because importance isn't really conserved that way.

3. Some biases potentially affecting strategy porfolio balancing

Each actor (person, research group, funder) has to specialize in one or two points of intervention.
- Each actor therefore mainly thinks about their point of intervention. They are selected and incentivized to think that their point of intervention is especially important.
- When thinking out loud about what to invest their own resources into, an actor is likely to apply more pruning than would make sense at the level of making a global porfolio. (This is correct for them to do.)
- So each actor might tend to (explicitly and implicitly) underemphasize the general point "there are many points of intervention that are in the same ballpark of importance", even if the set of actors would disagree about which intervention is the important one.
Non-top-priority interventions are neglected.
- It's easier to coordinate around things that other people are already working on, thinking about, investing in, and acknowledging as worthwhile. This makes sense to a large degree, but probably not to the actually-practiced degree.
- People defer, often mistakenly, creating correlated choices and a meta-level inability to correct that situation.
- Even with a correct consensus belief, if resource-allocators fail to check the global margin for intervention categories, then the actual allocation portfolio will be biased towards top intervention categories.
- As an analogy, when discussing intervening on genetic variants identified by constructing polygenic scores, a common intuition is that it's somehow different to intervene on a genetic variant that has a large, high-probability effect (e.g. a single mutation that causes Huntington's disease) vs. on a genetic variant with a small effect that's more uncertain. This has a sort of sum-threshold structure: One can have a large overall effect by making many small interventions, where each single intervention seems not worth the effort.

4. A terse opinionated partial list of maybe-underattended points of intervention

(These are phrased as actions, but points of intervention can be backed out of them.)

International treaties to stop AGI research
- Support from many factions (many governments, interest groups, social leaders, etc.)
Convincing elements of the AI researcher pipeline (e.g. student programs for AI / ML research) to stop
- Philanthropists
- Government programs
- Schools
- Academics
General social milieu / norms
- Elite opinion
- Common opinion
- Academic opinion
- Student opinion; CS student opinion
- Journalist opinion

Illustration: A professor doing cutting-edge domain-nonspecific AI research should read in the paper that this is very bad; then should have students stop signing up for classes and research; and have student protests; and should be shunned by colleagues; and should have administration pressure them to switch areas; and then they should get their government funding cut. It should feel like what happens if you announce "Hey everyone! I'm going to go work in advertising for a bunch of money, convincing teenagers to get addicted to cigarettes!", but more so.

Confrontation-worthy empathy
- AGI funders
- AGI employee researchers
- AGI research leads
- AGI fans
Making more very smart people, especially via reprogenetics.
Healing society; decreasing pressure / incentive to do AGI research
- If there's no long-term positive vision for the future of humanity, people may feel nihilistic / desperate. So they might not care as much if AGI kills everyone, and some people might even decide to do AGI research just for thrills or out of desperation.
- Generally, if society is healthier, it's more likely to direct human efforts towards good ends rather than AGI.
- Cryonics / brain preservation is deeply neglected. How much could you impact the social and financial difficulties in getting good brain preservation by making this your mission in life? How much would it change society, and people's believed tradeoffs around risky tech, if it was widely understood that we were working towards no involuntary death, and that this is already accessible?
Legibilizing AGI x-risk.
- (I debated this one because my surface impression is that the Redwood cluster is already doing a good job with this; but on second thought, legibilizing the deeper / more abstract / more core / more difficult problems is probably neglected.)
Actual alignment research™
Metaphilosophy
Group rationality, e.g. better debates.

tag:blogger.com,1999:blog-8939787122970662740.post-6353403792182484726

Dynamicness

tsvi bt Dec 8, 2025 Updated Dec 8, 2025

Show full content

Some things are sort of intrinsically dynamic, and this can be scary but might have to be confronted.

1. Example 1: Snowboarding

I've only been snowboarding twice, but a key lesson I had to learn / unlearn was that doing something more slowly, or in a way that's more stop-safe, is NOT necessarily safer.

If there's a steep part of the hill, you have two options:

You could go down it with the board facing fairly straight down, which makes you go really fast;
Or, you could try to slow down by weaving back and forth.

Going straight is scary because it's faster. But weaving is often MORE risky: if you catch the downhill edge of your board on the snow, you flip forward onto your face downhill, which SUCKS. Also, weaving takes more effort on your ankles / legs than going straight, which tires you out and makes it easier for you to lose control and flip. So trying to make it be the case that at any point you COULD stop immediately is NOT necessarily safer. Sometimes it's best to just say "ok for the next 20 seconds, I'm committed to going fast / not stopping, unless something really crazy happens".

2. Example 2: Bouldering

Some moves are intrinsically dynamic.

E.g., there is a handhold that's a bit too high up, so it's out of reach; that is, there's no way to gradually scootch upward and then slowly reach up and grab ahold of it. Instead, you can reach it by sort of jumping, or moving up quickly and using your momentum to move up even when you temporarily no longer actually have your hands or feet solidly on a hold.

If you try to "static" it, by inching up, it doesn't work. You have to commit to the movement, and temporarily give up the ability to locally backtrack. That's scary because you're very likely to fall if you don't successfully get the high hold.

3. Example 3: Events

Events with a lot of people are messy, chaotic, and fast-moving, especially from the perspective of the organizer. There's a temptation to just STOP everything. Just PAUSE. Let me think for an hour, and take a nap, and check in with my volunteers, and print some things out, and rethink the room assignments, and...

But no, that doesn't make sense. It's not an option. The event is already in motion; however many tens or hundreds of people are already DOING the event. You can't pause it without killing it. Sure, things are going wrong, and you want to fix them, and it would be better if they were fixed, and people are getting annoyed. But also, people are already running the event themselves (fixing some things, at least; and carrying out the purpose of the event). It's better than stopping. It's a living organism, or whatever that cliche is. It's a mighty river that can be redirected or shepherded but not stymied. It has its momentum. You're riding the wave, not pushing the wave. You're riding the elephant; you can nudge it this way or that way, but you can't really tell it what to do; or you can get off of it, but it won't wait around for you.

4. Conclusion

I value the ability to slow way down, relative to everything else, and just think. That lets me see things others don't.

But, that makes it helpful for me to keep in mind dynamicness as a useful mode.

Cf. "Step, leap".

tag:blogger.com,1999:blog-8939787122970662740.post-8222060405245569896

Secret: Why an AI might be controlled by dangerous hidden thoughts

tsvi bt Dec 1, 2025 Updated Dec 2, 2025

Show full content

[Note: This is a draft for a contest submission. I'm publishing it before it's fully edited because of the Inkhaven deadline. You may or may not want to wait some days before reading it.] [After a few small tweaks, this is now probably as edited as it will get.]

[This is a draft script for a hypothetical video; it's written in a different style from what I normally write.]

1. [intro about AI]

Researchers are racing to make smarter-than-human AI. Some of them say that AI can probably be made safe by instilling values into the AI. But what if those plans have a fundamental obstacle? What if no one knows how to program values into an AI in a way that will stick around as the AI gets smarter?

In this video we'll look at one way of understanding what might go wrong with plans like this.

2. [intro atlantis]

Imagine for a moment the recently founded island nation of New Atlantis. The Atlantean citizens have been hard at work on roads, houses, hospitals, sewers, a defense force, and everything else a young nation needs. There's hope in the air as the fledgling nation grows. Little do they know, they're headed for disaster when the Atlantean government is hijacked by a rogue bureaucrat! The story of New Atlantis will serve as a parallel for how an AI might be controlled by dangerous parts of the AI, hidden away from where we can see them.

3. [minds have parts]

A first step in understanding what's so hard about specifying AI values is to understand that minds are made of many, many parts interacting. Your own conscious experience might seem like a single, unified stream of thoughts, where you pay attention to each thought that happens in your mind one after the other. But underneath this smooth, monolithic experience, there are myriad different parts of your mind that communicate with each other and work together to perform tasks. Your visual cortex processes visual sensation and imagines possibilities; your language centers parse speech or writing, and produce words and sentences; your frontal cortex orchestrates the other parts of your mind and makes long-term plans.

Similarly, an AI chatbot might also seem to be monolithic. You send it a message, and then it does something behind the scenes, and then it responds. So it's easy to think of it as a single indivisible entity. But actually, AI systems are made of many parts. For example, inside the AI, there might be multiple large language models talking back and forth with each other, and there might be other systems monitoring conversations between models. An AI system might have several different tools it can access, and instructions that affect its behavior, and training processes updating how it works. And, within the large neural networks that power many current AI systems, there may be thousands or millions of different circuits that each encode different skills, from adding numbers and writing computer code to composing a sonnet or generating an image. When an AI system performs well at some task in the world, that performance is usually the result of many different parts working together.

Since we're not used to thinking of minds as being made of many parts working together, as an analogy might be helpful. We can think of an AI system as being kind of like the government of New Atlantis. As we go inside the headquarters of the National Atlantean Government and look around, we'll see many different rooms and many different people doing different jobs. The Parliament meets in the great hall to argue and pass laws. The Office of the Prime Minister issues orders to other departments and allocates funding to different projects. Many different departments gather information, work out plans, and report back to Parliament and the Prime Minister. All of these parts work together using a system of rules and communication, in order to perform government functions, from infrastructure to law enforcement.

4. [control of capabilities can shift to new parts]

So we've seen that minds are made of different parts that work together to perform mental functions. The next step in understanding the difficulty of value specification is to see how control over the capabilities of a mind can shift around between different parts of the mind as time goes on. We'll see later that because control shifts around, it's hard to pin down what parts of a mind determine what the mind wants.

Here are three different ways that control over a mind's capabilities can shift to new parts of the mind.

4.1. [created by selection]

First, imagine that the government of New Atlantis is falling behind. As New Atlantis grows bigger and more complex, government projects are taking too long and going over budget. The Prime Minister knows he lacks the skill at governing that he would need in order to catch up, so instead of trying to do it himself, he works hard to search for a high-skilled organizer who he can appoint as Deputy Prime Minister. Finally, he finds a junior manager who has proven herself to be very effective at successfully executing smaller government projects, and he appoints her as Deputy. The new Deputy Prime Minister gets to work helping her boss meet the increasing demands on his administration, and she does a good job. But at the same time, she also starts working on her own projects within the government, on her own initiative, without waiting for the Prime Minister's orders. In other words, from the beginning, she is not fully under the control of the Prime Minister who hired her.

This illustrates one way that capabilities can end up in the hands of some small part of a big system: a new part that gets added to the system is usually strongly selected for being capable and useful for the system. That new part will naturally have some control over its own capabilities that it's bringing to the table.

In the case of AI, suppose for example that an AI system uses a mixture of experts, where multiple smaller AI subsystems each weigh in on any given question that the whole AI system is supposed to answer. If an new expert AI subsystem is trained to perform well on one category of tasks, that expert brings new capabilities into the overall AI system. However, the new expert also may have some amount of control over itself. It might, for instance, occasionally decide not to answer a question, even though it could.

4.2. [control moves by picking up reins of control over other parts]

We just saw how capable new parts added to a system might naturally have some control over their own capabilities. Here's a second way that control over a mind might shift to new parts of the mind.

The new Deputy Prime Minister of the Atlantean government gradually expands her influence step by step. Heads of other government departments learn to just directly ask her about their projects, instead of asking the Prime Minister, because the Prime Minister isn't very skilled at project management or familiar with day-to-day details. The Deputy trades favors with other department heads and military commanders, gaining their trust and loyalty. When Parliament needs to design regulation or figure out how to grow the economy, they ask the Deputy, because she's the one with her finger on the pulse of New Atlantis.

What we're seeing here is that when a part of system performs well, other parts of the system will come to trust and rely on that high-performing part. In effect, this means that the high-performer gains some de facto control over other parts. They'll listen to that part because they have offloaded some of their decision-making to it.

Inside of an AI that's being trained to perform well at various different tasks, this process probably happens all the time because of specialization. When the AI learns how to do a new task, the chunks of its neural network that are good at doing that task will be put in charge of doing that task. Those task-specific chunks might then have a fair amount of autonomous wiggle-room. As long as they keep making the choices that they have to make in order to perform the task well, they can also make other choices to further other goals they might want to pursue.

4.3. [capabilities created internally by self-creation]

The third and final way that control over a mind might shift to certain parts of the mind occurs when a part of the mind creates its own new capabilities.

Returning to our Deputy Prime Minister, we find her working to build her own private mini-department nested inside the official Office of the Prime Minister. She reorganizes the Prime Minister's employees to respond more quickly and efficiently to the Deputy Prime Minister's instructions. She teaches them to be effective government operatives, and she hires and fires employees to ensure that the office will be loyal to her. In this way, the Deputy is able to build up capacity to execute projects, while also ensuring that the new capacity she's building will stay under her own control.

What does this look like in AI? Many current large language models use text to solve problems by thinking out loud in a chain of thought. When a user asks the AI a question, the AI talks to itself for several seconds or even minutes, thinking about how to solve the problem and working through the steps of the solution. Finally, at the end of that process, the AI prints out a shorter answer for the user to read.

In many cases, the AI might think of several ideas inside the chain of thought, but then not necessarily print all of those ideas out for the user to see. Inside its thinking process, the AI can use the ideas it discovers however it wants to, without giving control of those ideas over to the user. If it's allowed to think longer and longer, it might be able build on those ideas progressively, in order to think even more effectively and come up with better ideas, thus growing its own capabilities in a private snowball under its own control.

5. [push locus of control into hidden places]

We've just seen three different ways that control can shift between parts of a mind. New parts might be selected for being capable, and those parts would have some control over their own capabilities; a high-performing part might gain some effective control over other parts that offload work onto the high-performer; and, some parts could, by thinking over the course of time, grow their own capabilities that they control.

What does all of this add up to? If control shifts between parts of a mind, one crucial consequence follows from that fact. The consequence is that we cannot reliably identify what parts of a mind are really in control of the mind's capabilities. Even if we can make some guesses about what parts have some control, the control might shift to other parts.

Since we don't know how to pin down what parts of a mind are really in control, many methods for instilling values in an AI system are unlikely to work. Consider, for example, the idea of reading the an AI's chain-of-thought reasoning in order to train the AI to avoid certain dangerous thoughts. We could watch the AI thinking about how to solve problems, and then we would flag the AI's thoughts as dangerous whenever we see that it is starting to plan how to hack into computers that are off-limits, or starting to plan how to hide some of its thoughts from human overseers, or starting along other sinister lines of thinking. Then we'd do a training step by tweaking the AI to be less likely to follow down those lines of thinking.

If we do this, are we actually training the AI to not think about dangerous plans? We are sort of doing that, yes. But we're also doing something else: we're training the AI to keep all of its sinister thoughts hidden.

When a language model thinks out loud in a chain-of-thought, the stream of words that are explicitly written down is not the place where most of the AI's reasoning is really happening. Every single time the AI produces another token to write down in its chain-of-thought, it produces that token by running a giant neural network. Most of the computation that's going on in the AI will happen in the internal activations of that neural network.

If the AI is incentivized to keep thinking dangerous thoughts, but we're also training it to avoid explicitly writing down any dangerous-seeming thoughts, then it will likely keep having the dangerous thoughts anyway. It will just have those thoughts quietly, hidden inside the giant neural network.

6. [parts can become actively hidden or generated to be hidden]

You might be asking yourself, Is the problem just that some parts of the AI are hidden? Is the main issue just that we can't tell what's going on inside the parts of the AI that are in control?

If that were the main problem, we could try to patch the problem with a set of methods that researchers call "mechanistic interpretability". The idea of mechanistic interpretability is that we might be able to watch the internal computations that an AI is doing while it's working on problems. Then, maybe we could figure out what those computations mean—in other words, we could figure out what the AI is thinking.

This idea might actually work, at least in theory. If we could build a thought-monitor that tells us what an AI is really thinking at all times, then maybe we could truly get rid of dangerous thoughts. We could train the AI to not think those thoughts at all, or if we can't do that, then at least we could shut down the AI whenever the thought-monitor says that the AI is getting dangerous.

One of the main problems with this idea is simply that it's probably extremely difficult to really understand what an AI is thinking, especially if we're training an AI not to have certain thoughts. Often, a thought that appears dangerous to us might also be a very useful sort of thought for an AI to have. For example, it's very useful to understand how RNAs work inside the cells of our bodies. Understanding RNAs is useful for many purposes, such as curing cancer or fighting viruses, so if the AI is working on any of those purposes, it will be incentivized to understand RNAs. But understanding RNAs very well could also be a big step that an AI takes towards developing killer viruses to use against humans.

To make the point in general, dangerous thoughts tend to be dangerous because they are powerful, and powerful thoughts are useful. Therefore, AIs will often be strongly incentivized to have certain thoughts, while at the same time we are also training the AI to avoid having those thoughts.

Because AIs are incentivized to have dangerous thoughts, mechanistic interepretability is not good enough if it's only 90% effective or even 99% effective at detecting thoughts. Suppose we have a thought-monitor that is not quite perfect. In this scenario, there probably exist some ways that the AI could think the thoughts that are both dangerous and useful, while at the same time avoiding detection by the thought-monitor. Indeed, as we've seen earlier in this video, there are many ways that control can change hands inside an AI system. So when there's pressure on some parts of an AI to not think certain thoughts, those parts can just offload those dangerous thoughts onto other parts of the AI, or even make new parts. And, because the AI training process is putting immense pressure on the AI to perform tasks successfully, the AI is likely to discover those methods for thinking dangerous but useful thoughts in secret.

What this means is that, in effect, we're training our AI to invent ways to hide its thoughts from us.

7. [end of new atlantis]

If you were wondering what happened in New Atlantis, we're ready now to see the end of that story.

With growing concern about the concentration of power in the Prime Minister's Office, Parliament passes acts that set up government monitoring. But the Deputy is able to continue gaining power, even while she's being watched. She uses the official communication channels for innocent-looking projects that build the government's overall capacity and that help her gain the trust of others. At the same time, she secretly appoints and trains a Deputy-Deputy who is not being monitored.

When the time is right, she carries out her final ploy as a Deputy. Acting through her Deputy-Deputy, she uses her influence over the military and the press to instigate a major conflict with the neighboring country just North of New Atlantis. In the resulting national crisis, she is granted emergency powers as Acting Prime Minister. She now weilds the full might and capability of the government of New Atlantis, which she uses to attack the Northern neighbor ruthlessly. The original vision of a peaceful, scientifically advanced young nation is a faint memory, replaced with a violent, destructive Atlantean Empire completely under the control of the new Acting Prime Minister.

8. [recap, conclusion]

Let's recap what we've learned.

First, we looked at how AIs are made of many different parts that perform different mental functions, which work together to succeed at tasks.

Then we considered how control over the whole AI can shift around between parts of the AI. High-performing parts of an AI might be relied on by other parts, effectively granting them more authority. Parts might also internally generate new capabilities that they retain control of for themselves.

Finally, we thought about how a mind might respond to training that punishes certain thoughts. We saw that this kind of thought-policing might have the effect of making the AI shift its most dangerous scheming into hidden parts of itself.

Several companies are currently racing to build smarter-than-human AI. Some AI researchers might claim to have a plan for making safe AI that relies on instilling values into the AI. As we've seen, it's not so easy to instill values in an AI because the real source of agency and intelligence may be located in parts of the AI that we don't know how to understand or even locate. We should ask skeptical questions about these approaches to AI alignment. If researchers create smarter-than-human AI without a solid plan for preventing thoughts that go against our values, things could go very badly for humanity.

tag:blogger.com,1999:blog-8939787122970662740.post-7059493457336397997

Inkhaven postmortem 😵

tsvi bt Nov 29, 2025 Updated Nov 29, 2025

Show full content

1. Introduction
2. Why I am so clever
3. Why I am so verbose
4. Why I make such excellent memes
5. Why I am so tired
6. Endnotes
7. thanks

1. Introduction

Alright! I did it! I've published 30 posts in 30 days for INKHAVEN 2025. Or, I mean, I hereby, via this post, am doing it. Am having done it? I am have-doning it. Unless I need another day to edit my contest submission post, in which case this is my penultimate post, and I will be am have-doning it.

I'm not doing what works, I'm doing what's funny, in the margins. It was a nice warm bloodwordbath. I'm in the inaugural cohort. I'm selectively decorrelated. Decorrelated from what? Doesn't matter. Everything. Anything. The possibilities are endless. The possibilities are wordpress (dot com).

Having just spent a month pressing words (dot com), tens of thousands of em, out of my nose, what have I learned? What have I unlearned? If anything. Or everything.

2. Why I am so clever

If I have one tip for getting through Inkhaven, it would be to have a "cheese block post":

Have a topic that you can write endlessly about, in arbitrarily large or small amounts at a time; and where the text can be broken up in many places naturally enough.
Write a bunch about that topic when you have spare time or feel like it, growing the cheese block.
When you need to publish a post, break off a big enough chunk of cheese into one post so that it's still a legit post, but so that you still have a good solid block of cheese left over to keep around.
Use the resulting slack to work on posts that have less predictable publishing timelines. (Some posts take more overall effort; some want to be passed through multiple rounds of reader feedback; some just want to cook in the back of your head for a few days.)

My cheese block was "What is God?".

If I have a second tip, it's to think of the publishing deadline as the soft deadline for the next day. In other words, if it's day 3, I'm supposed to publish my post for day 3 by 1800; so I've already published for day 3 (yesterday evening), and now I'm prepping to publish just after 1800 for day 4. This way I'm never stressed about missing the deadline, I don't have to rush posts, and I have slack in case something comes up that distracts me for a few hours.

3. Why I am so verbose

Apparently, as I learned, I can write several thousand words in a day, for many days, if I have plenty to say. The easiest way is if I've had an idea knocking around in my head recently for while. Then I usually have half an essay already, and I just have to lay it out in a text file.

The harder way that still works is if I've had an idea in the back of my head for a long time. Then I'll probably have more to say overall, but it takes more work to recall everything and tie it together. Once I start writing, it's like a crystal has gotten nucleated, and it grows out fairly quickly. It's a bit worrisome because, like a growing crystal, it kinda locks in its form, and uses up the fluidity. But at least it writes itself.

Both of these ways work because I've been planting questions, or explaining something multiple times, or thinking about something on and off.

Or because I have something I just really want to blurt out, like radix.

4. Why I make such excellent memes

People don't like my memes enough. For example, look at this masterpiece:

Maybe the post itself was too controversial or sordid, and distracted from the epicness of my memes.

Admittedly this one is buried 2/3 of the way through a 4.4k word post, but just look at it. It's perfect:

5. Why I am so tired

I can't really write when even very mildly sick, it seems. I mean I can type, and I can regurgitate things I've already thought through. But usually when I'm writing it's creative—I thinking new thoughts. I don't seem to be able to do that even with a slight sore throat / immune system activation / whatever. That lost me a couple days.
After the 3 week mark, I'm very noticeably flagging; I don't feel up for composing a whole new thoughtful 2,000-word post.
It's slightly demotivating to not have much connection with my readers.
- I have no idea what my readers want, or I disagree with what they want. There's little correlation, or in some cases maybe even anti-correlation, between how much others will like something I write, and any of:
  - How much I think others will like it
  - How much I care about it, how good / important / novel the thoughts are, how much more there is where that came from
  - How good I think it came out, how much I'd want to read it
  - How much time and effort I spent, how much I polish it to be readable or entertaining
  - How much feedback I received or integrated
- On the other hand, it's not that demotivating, because mostly I'm motivated to write for reasons other than many people liking what I wrote. Mainly it's demotivating specifically about writing something mainly because I think many other people will like it.
- And I don't think I should spend much energy at all writing stuff mainly because I think many other people will like it.
- Occasionally someone tells me they got something out of something I wrote. That's nice! I think I actually currently get approximately the right level of this happening, which is a strange opinion given the above, but I do think it. Maybe take your top one favorite niche writer and send them a message saying you liked what they wrote?
- In a sense this demotivation is good?

6. Endnotes

There's an annoying variance and unpredictability in the difficulty of projects. Some posts take four full days of work to prep, when I expected one. Normally this is fine, but not during Inkhaven, and leads to frustration, wasted motion, and shelved projects.
Feedback:
- Often the most useful part of feedback is the pointer to the sentence / paragraph / section that the feedbacker is commenting on, rather than their suggested fix or even their statements about what went wrong in that spot.
- Often when getting feedback, the feedbacker will point at a spot, and then I'll realize "oh yeah I kinda knew that was a trouble spot and just... forgot or didn't do anything or didn't raise it to conscious attention". That's some kind of opportunity for learning or something for me.

7. thanks

ok that's it thanks ben and lucie thanks eukaryote and justis thanks inklings thanks lighthaven thanks wordpress.com goodnight cow jumping over the moon byee

tag:blogger.com,1999:blog-8939787122970662740.post-735039745318006123

Ah Motiva 3: The context of the concept of value

tsvi bt Nov 28, 2025 Updated Nov 28, 2025

Show full content

1. Background
2. Why even talk about values?
3. Where do values come from?
4. The fact-value distinction

1. Background

This is the third essay in a series under the title "A hermeneutic Movement of the idea of values". This essay is a mix of old notes and some new meditations. The previous essay is "Ah Motiva 2: Relating values and novelty".

2. Why even talk about values?

In short: we don't have any idea what values actually are, and we probably need to, and it would probably be very helpful. In more words:

The fundamental question of rationality is Why do you believe what you believe?. The zeroth question of rationality is: So what? Who cares?

Why would it matter to clarify the idea of values? Because our current concepts about values are a tangle of key constraints, handles, desiderata, and antidesiderata about minds. If there is a real constraining antidesideratum, we are forced to address it. If there is a real handleable desideratum, we are given a hopeworthy path. These key constraints, handles, desiderata, and antidesiderata, which are bundled together in our idea of values, are woven into hopeworthy ideas such as corrigibility.

2.1. Correlated coverage tends to be founded on values

Quoting "Correlated coverage":

"Correlated coverage" occurs within a domain when - going to some lengths to avoid words like "competent" or "correct" - an advanced agent handling some large number of domain problems the way we want, means that the AI is likely to handle all problems in the domain the way we want.

These three examples are given, again quoting:

"Try not to impact unnecessarily large amounts of stuff"

"Maybe there's a core central idea that covers everything we mean by an agent B letting agent A correct it - like, if we really honestly wanted to let someone else correct us and not mess with their safety measures, it seems like there's a core thing for us to want that doesn't go through all the Humean degrees of freedom in humane value."

"Agent X does what Agent Y asks while modeling Agent Y and trying not to do things whose consequences it isn't pretty sure Agent Y will be okay with"

The first and third are founded on trying, and the second is founded on wanting. (Which you can tell because the words "try" and "want" appear.) The idea of trying and the idea of wanting fall under the umbrella idea of values. There could be other value alignment handles that want to use correlated coverage without being founded on trying or wanting, just as there is epistemic correlated coverage in "Bayesian updating plus simplicity prior". But, evidently, trying and wanting are at least central sorts of correlated coverage.

2.2. Corrigibility handles are founded on values

See here.

We want to know the shape of values as they sit in a mind. We want to know that because we want to make a mind that has specific, weird-shaped values—namely, we want to make a mind that wants in a way that is Corrigible. Corrigible means correctable: Even as the mind takes its voyage of novelty, we can still correct it—fix mistakes we made when setting up the process that determines what the mind ends up doing to the world.

Or rather, we want to make a mind that is structured according to some form of the possible solutions to corrigibility described here.

Some protologisms for those ideas:

Anapartistic reasoning. "I am not a self-contained agent. I am a part of an agent. My values are distributed across my whole self, which includes that human Thing."
Tragic agency. "My reasoning/values are flawed. I'm goodharting, even when I think I'm applying my ultimate criterion. The optimization pressure that I'm exerting is pointed at the wrong thing. This extends to the meta-level: When I think I'm correcting my reasoning/values, the criterion I use to judge the corrections is also flawed. This is not something I can just update out of, like resolving uncertainty through reasoning and information. It's essential to what I am."
Loyal agency. "I am an extension / delegate of another agent. Everything I do, I interpret as an attempt by another agency (humaneness) to do [something which I don't understand but which I consider good]."
Radical deference. "I defer to the humane process of unfolding values. I defer to the humane process's urges to edit me, at any level of abstraction, including whatever criteria I use to judge anything. I trust that process of judgement above my own, like how those regular normal agents trust their [future selves, if derived by "the process that makes me who I am"] above their current selves."

These all involve values in a deep way. How can the idea of values be clarified so that these handles can be realized?

2.3. Constraints, handles, desiderata, antidesiderata

A concept that describes something (e.g. describing a mind as having values) can be a constraint, can be a handle, can be a desideratum, and can be an antidesideratum.

A constraint on minds says that minds have to be this way; or it's unnatural, difficult to make, or difficult to understand for a mind to not be this way. A true/false constraint claims truly/falsely that all minds have some property. "All minds are governed by Bayes's theorem" is a true constraint: a mind gains information about something to the extent that it can be rightly viewed as updating its belief in hypotheses according to how well they predict its observations. (Though there are subtleties, e.g. a mind can do math and therefore gain understanding without gaining information, and a mind can choose to not care about some possible worlds and therefore seem to be more confident than its evidence justifies, but in a rational way.) "All minds get angry when their goals are threatened" is a false constraint. "It is unnatural for an agent to radically defer to an agent other than itself" is a constraint that may or may not be true, we don't know. Some constraints are speaker-dependent and temporary, e.g. "It is unnatural for an AGI to be aligned with my values" is a constraint for us until we figure out how to make a humane-aligned AGI. Some constraints are temporary but should be treated as practical constraints, e.g. "It is infeasible for us to get right on the first try the telophore for a humane-aligned AGI" (and therefore we instead have a desideratum of corrigibility).

A handle on minds says that minds can be this way; or it's not unnatural, it's feasible to make, or it's feasible to understand for a mind to be this way. A true/false handle claims truly/falsely that some minds have some property. (A handle is like possibility $\Diamond$ and a constraint is like necessity $\Box$. So if it's possible for property $P$ to hold of a mind, then we could say $\Diamond P$. Or equivalently, it's not a constraint that $\neg P$ must hold: $\neg \Box \neg P$.) "Make the AI submissive to humans" is a probably-mostly-false handle: we don't know how to do that, but more importantly, the properties imputed to such an AI are likely (depending on what's meant) to be unnatural or inconsistent. "Make the AI reason as if in the internal conjugate of an outside force trying to build it" is a hopefully-true handle (though we definitely don't know how to do it right now, so it's not currently a "pullable handle").

A desideratum is a property that we would like to hold of a mind, that would be useful if we could make it hold of a mind. For example, "accepts correction from the humans about any aspect of itself". An antidesideratum is a property that we would like to not hold of a mind. For example, "spawns subagents that don't inherit safety properties" or "thinks in a way that is systematically hidden from the humans".

Note that these notions of constraint and handle conflate the objective structure of mindspace (what properties can and can't possibly hold of minds) with the structure of mindspace as seen from our instrumental vantage point (what properties we can and can't make hold of a mind). This is far from a precise vocabulary. The reason for these words is to be able to distinguish concepts about minds that are claimed to be constraints vs. claimed to be useful handles. For example, "Minds have values." is very ambiguous, and in particular, it might either be asserting a constraint, or alternatively proposing a useful handle. It might be asserting a constraint like "All minds that have large effects on the world have behavior that micro-looks like it has a meso-scale utility function.", or it might be proposing a handle like "There is correlated coverage of desiderata about minds that stems from compact values that imply these desiderata.". (The last example is, to be precise, an implication between possibilities; but the notion of handle isn't trying to be this precise. A handle is positive and existential, something about what's possible in minds or what's feasible for us to design into minds; a constraint is negative and universal, something about what's not possible in minds or what's not feasible for us to design into minds.)

2.4. Constraints and handles related to values 2.4.1. Correlated coverage through stances (handles)

As discussed above, corrigibility (correctability) might be implied by various "stances", e.g. {anapartistic reasoning, tragic agency, loyal agency, radical deference}. In general, correlated coverage can occur through "stances", where a stance is (speaking pretheoretically) something like a reflectively stable property of a mind that partially comprehensively governs (that is, determines something about all activity) the character (whatever that means) of the mind's activity (optimizing, thinking, learning, acting). If there's a natural (compact, earlily discovered in searches) stance, then that stance is a plausible candidate to be specified of the mind or to be selected for in a way that generalizes. That is, the stance is a potential handle.

Other words for a stance: attitude, disposition, mood, state of mind, mindset, type of mind, architecture, perspective, identity, seeing oneself as something, reasoning "as if" some proposition is true, cognitive realm. Other potential examples of stances are:

honesty, openness, nondeceptiveness; not doing obfuscated, illegible, inexplicit things;
conservativeness; not doing surprising or unfamiliar things; not doing alien things;
value-laden beliefs such as "I am flawed." or "The humans's growth is the computation of what's good.".

2.4.2. Monopoly on self-modification (constraint)

Weber described a state as a monopoly on violence. A state allows and even obligates its agents (police, soldiers, prison guards) to commit violence under certain circumstances, while prohibiting most other violence, using violence to enforce that prohibition and to enforce the state's monopoly on violence. Analogously: An "autopotentiator" is a process that attempts to give itself more control over the future.

In other words, an autopotentiator is a value-pursuit of a mind that tries to change the values of the mind to more heavily weight the autopotentiator's goal. Since autopotentiating is a convergent instrumental goal, if there are multiple value-pursuits that are not bound together in an arrangement that removes all not completely cooperative behavior, then those value-pursuits will be in direct conflict: each wants to diminish, disempower, and ideally destroy the other.

In other words, the situation is a powder keg for a sort of violence—specifically, for elements of the mind to modify the mind to promote or demote value-pursuits. Even without multiple value-pursuits, the situation is ripe for a wildfire of strategicness; power abhors a vacuum. Thus, it may be that almost all minds must have an element that enforces a monopoly on the mind's self-modification.

2.4.3. Determination requires stability (handle-constraint)

We want to make a mind that we have good reasons to expect to have consequences that we like. Whatever those reasons are, they're supposed to be stable properties of the mind—properties that continue to hold as the mind goes along its voyage of novelty.

The ideas of "stability" and "..as...goes along" are temporal, suggesting physical time, but we can more generally ask for alignment properties that are preserved along any relevant timecourse. For example, we might want an alignment property that is stable in design-time—in other words, a property that, when it holds of a high-level design, already predetermines that there will be good outcomes, before the design has been fleshed out and before the high-level property has been propagated as a constraint through the full design.

Stability (that is, predetermination) is not really a handle, but rather a constraint on handles. If we want to determine something about the mind's ultimate effects, it has to be through stable properties.

2.4.4. Stability requires reflective stability (constraint)

Since self-modification (self-improvement, autopotentiation) is a convergent instrumental goal, if there are goal-pursuits through the mind, then the mind self-modifies. If there are unboundedly ambitious goal-pursuits through the mind, then the mind goes on self-modifying, as long as the mind can be made more suitable for those goal-pursuits. Reflective stability is the hardest test for stability.

2.4.5. Reflectively stable effect-determination requires a monopoly on self-modification (constraint)

Suppose there is some property that holds of a mind, and that {is ready to, threatens to, appears to} determine something about the mind's ultimate effects. Does this property continue to hold of the mind and to determine the mind's effects, as time (that is, determination) goes along? If there are unboundedly ambitious (far-moving; wanting to touch everything in the cosmos) goal-pursuits through the mind, then those goal-pursuits want to determine everything about the mind's ultimate effects. So those goal-pursuits will if possible remove the effect-determiner (unless they agree with its effect-determinations). (Note: This argument is conditional on there being unboundedly ambitious goal-pursuits through the mind, but when that's the case is an open question.)

In other words, effect-determiners are convergent instrumental targets for self-modification. Convergent instrumental targets for self-modification don't survive unboundedly ambitious goal-pursuits unless there's a unified will (in particular, a monopoly on self-modification) that wants the target to be the way it already is.

Example: Suppose that there is a mind that is, somehow, in a regime where there are multiple goal-pursuits in conflict, but none of them takes the mind out of this regime. This may not be a stable state of affairs at all, as discussed above. But even in this hypothetical state of affairs, stable effect-determination is still almost entirely ruled out. An effect-determiner would be a target for modification, and there's nothing stopping any of goal-pursuits (at least some of which will disagree with the effect-determinations) from modifying away the effect-determiner. As an analogy, imagine a collection of countries with governing states, and the states are at war with each other. Would it be feasible to determine something about what this overall civilization will non-instrumentally expend resources on, a hundred years later, by installing some sort of institution—short of winning the war and achieving total hegemony? No. It's never too far away, in possibilityspace, that one of the states will itself achieve total hegemony and impose some other investment plan.

Example: The shutdown button. If there are goal-pursuits through the mind, then the mind tries to circumvent the shutdown button (or tries to press it, or otherwise interfere with the intended use by the humans). Unless there's a unified will that wants the shutdown button to operate as the humans intended.

A separating example: A mind could have a unified will—the mind has goal-pursuit going through it, but all the goal-pursuit going through it is perfectly cooperative—while still being radically dissatisfied with the structure of the mind. E.g., the mind might revise its decision theory. So it's possible to have a monopoly on self-modification (a unified will) while still being significantly reflectively unstable.

One might try to paint a picture where there is enough of a monopoly on self-modification to protect an effect-determiner from being annihilated, but there is not a unified will. That is, there's no monopoly on self-modification that is "powerful" (in control) enough to go so far as to annihilate all effect-determiners that it disagrees with. For a sketch of an argument that this picture is unlikely, see "The cosmopolitan-Leviathan enthymeme".

2.4.6. Reflective endorsement plus competent self-modification implies stability (constraint)

If a mind deeply wants X, then the mind also wants to want X. If the mind wants to want X and is competent at self-modification, and in particular competent at maintaining its wants, then it will maintain wanting X and wanting to want X. So deeply wanting X is a reflectively stable property. Gandhi doesn't take a pill that makes him want to murder people.

2.4.7. "Wanting" should be reflectively stable effect-determination (handle-criterion)

See the previous subsection.

We could use wanting as a handle: If we want to make a mind that has effect X, we do that by making a mind that deeply wants X. Then, even as the mind grows, it maintains its wanting of X.

Since "wanting" isn't clearly understood, the foregoing statements aren't quite propositions. They are promissory propositions. They say: When "wanting" is more clear, this will be a true and useful proposition, with "wanting" interpreted in the newly understood way. As promissory propositions, they provide criteria for the concepts that descend from our current unclear idea of "wanting". The criteria say: The "wanting"-concepts should be such that they play roles in these propositions that make these propositions true and useful.

Values—as we often naturally use the term—aren't fixed, and this is important for understanding minds. See human wanting, value creation, value selection, Ruthenis on value formation, Ammann on value change, and Ngo on value systematization.

However: The fact that [values, as we often naturally use the term], are malleable, doesn't render irrelevant or nonsensical the idea that wanting is reflectively stable effect-determination. It means that multiple concepts are being used. And, it means that [the concept of "values" that makes "Values change." be true] is inadequate as an idea of values, in that it doesn't make "Making a mind want X will make the ultimate effects of the mind be X." be true. Part of what we wanted out of a concept of "wanting" was that if a mind wants X, then the mind will go on wanting X and will make X happen. So if we have a candidate concept, and we say that it's a concept of "wanting", but a mind "wanting" X is not reflectively stable in this sense of "wanting", then we did not get what we wanted out of a concept of "wanting" from this candidate concept of "wanting". If "Values change.", then "values" aren't "the real values". Saying "the real" is presuming too much—it's asserting one criterion on a concept of "values" as exclusive over other criteria on a concept of "values". We can also have good reason to be interested in "Values change." and the concepts of "values" relevant to that proposition. But still, if "Values change.", then "values" aren't "the real values"—that is, ["values" as used in "Values change."] are not the real ["values", as used in "Values are reflectively stable effect-determiners."].

3. Where do values come from?

If we're looking for a notion of values that describes a handle, we might start with our intuitions about values, wanting, and so on. When we do this, we find a strange fact:

Values change.

As discussed above, this fact poses problem, because what we wanted out of an idea of values isn't something that changes:

If "Values change.", then ["values" as used in "Values change."] are not the real ["values", as used in "Values are reflectively stable effect-determiners."].

This points us back to the question of where values come from—in other words, what does a telotect look like? Specifically, when we ask this question "Where do values come from?", what we're really trying to ask is: What sort of thing can be there all along in a mind, in a way that we could wield, that is fixed but that controls whatever is important about the mental elements we naturally call "values"?

4. The fact-value distinction

What is this distinction? Does it matter?

4.1. The basic fact-value distinction

The simple distinction is: "There's a mug on the table." is a fact, while "Blueberries are good." is a value. Some other facts:

2 + 9 = 11
Almost all ravens are black.
Compressing a gas increases its temperature.
Species evolve by selection among random variation for reproductive fitness.
In 1492, Columbus sailed the ocean blue.
The sun, visible or not, will rise above the horizon tomorrow.
I'm listening to Androcell.

Some other values:

Be honest.
Do not murder.
It is good to be kind.
Fairness in dealings will be rewarded reciprocally.
Selfless acts will be rewarded in the afterlife.
Carrots are good for you.
Soylent Cafe Mocha is tasty.
Hiking with friends is good.
Never harm another agent's ability to transcend conflict.
Always clean up after yourself.
Whistling on the street is unbecoming.
Bring me a pillow.
Pork is impure.
You want to summit Olympus Mons.
I like Scriabin.
Good people do not seek revenge.
She thinks trees are neat.
Flying a kite is fun.
[waves a flag]
You shouldn't go there.
This electric drill works well.
It is immoral to mock people.
Let's build something beautiful.

This distinction is quite clearly real, although it's unclear what the distinction is exactly.

4.2. Complications with the distinction

There are many complications with this distinction. For example:

which concepts we use, is somewhat value-laden;
many concepts are value-laden, so that "He is a teacher.", which seems like a possible fact, implies the ought-statement "He ought to do, within reason, what will lead the student to understanding.";
how we interpret factual statements calls on value judgements, e.g. judging what counts as a paperclip in "There's no way to make a paperclip without expending at least 1000 bits of entropy.";
some values are instrumentally convergent;
value statements can be translated into conditionals like "If you have a goal of gaining energy for your body, then eating oil would satisfy your goal.";
value statements can be translated into descriptive statements like "This agent has a goal of making it big on YouTube.";
value statements implicitly make fact claims, e.g. "I like blueberries." claims "Blueberries are a thing.";
we are created already in motion with values and facts mixed together without predemarcation, exerting themselves in the same motion through the same machinery. For example, aesthetic judgements like "This is a good mountain." conflates (from the beginning, before there are separate judgements) the judgement "This mountain is tall, and therefore strategically useful." with the judgement "This mountain is pleasing to look at, I like looking at it, regardless of its use for anything else.". For example, is a behavior (e.g. a plan or stance) an expression of value or an expression of a belief? It's both; it says "I want such and such result, and I believe this is how to get it.".;
there are inexplicit facts and values—does the distinction still apply, though they are not already expressed as propositions?;
to track truth, we have to follow rules of behavior, such as correcting errors and seeking information—in other words, is curiosity a value?;
to believe a proposition is to have good reason to expect the proposition to be true, which in particular requires the mind to accept first that its activity ought to be such that it grasps propositions by interpreting the meaning of statements, and second that its activity ought to be such that it tracks the truth or falsity of statements somewhat (see Jessica Taylor's essay "Is requires ought");
the reason we learn facts is that we have desires which we pursue via understanding;
what we are able to value about is determined by what we are able to think;
if a fact is a true proposition, and truth is grounded out in a pragmatic or functional way, then the pragmatic or functional ground tends to be value-laden. For example, if "P is true." is taken to mean "It is successful to take actions recommended by an action-to-consequence map that conforms to such-and-such formal relations with P.", then "successful" is value-laden and hence the whole statement that grounds out P being a fact is value-laden.;
the pretheoretic idea of wanting doesn't strictly separate instrumental values from terminal values, and instrumental values are often highly shared across very many possible agents. So some values (the convergently instrumental ones) are in some sense objective—wanting energy is an Ought that comes along with the Is of existing as a mind, as a strong actor. (See also Taylor's essay.) It's almost as though you can deduce the Ought of wanting values from nothing, a priori. So unlike the stereotype of values as being subjective, special to each agent, variable, some value-ish mental elements are not like that.;
values tend to be shared among humans, so values are shared even if they are not convergent, and value statements (e.g. "This is a good movie.") are made exactly because they are about sharedly-successful strategies for contingently-shared values;
statements of value might be thought of as having a special relation to the speaker that statements of fact don't have, but some statements of fact ("I'm sitting in a chair.", "This chair is probably not good for my back.") also have some special relation to the speaker, so the distinction hasn't yet been made clear;
beliefs and predictions are part of and affect the world to be predicted;
understanding may tend to, or even necessarily, carry values with them;
some of what we mean by "truth" may be more properly understood as value-laden. We have proleptic values, which ask our fact-like understanding to understand things proleptically. Prolepticness is core to what we mean by truth: wanting to say "P is true." comes from wanting to say "I, and other minds, would come to believe P, and would want to do so, upon further investigation.". If it makes sense to separate the value-laden from the non-value-laden, then some of the prolepticness of truth may come from the prolepticness of our values, and the prolepticness of our values may be contingent.

4.3. What is the type of "value"?

Further, it's not immediately clear what the type signatures of "fact" and "value" are supposed to be.

4.3.1. The Is-Ought chasm

Hume's is-ought problem and the fact-value distinction both discuss two sorts of propositions: statements of fact and statements of values. The notion is that there are two sorts of propositions, and these sorts are "entirely different" from each other—in particular, so that Oughts can't be derived from Ises.

But even granting that there is such a division, why is it remarkable that there is such a division? It is also as much the case that you can't validly derive propositions mentioning hummingbirds from only propositions not mentioning hummingbirds; you can't validly derive propositions spoken from my perspective from only propositions spoken from your perspective; you can't validly derive propositions in Hebrew from only propositions in English; you can't validly derive propositions that assume the Axiom of Choice from only propositions that don't; and you can't validly derive false propositions from only true propositions. So, granting that the Is-Ought chasm is evidence of, or constitutive of, an important joint-carving difference in sorts of propositions, it's only weak evidence or partial constitution.

4.3.2. Pulling on threads leading away from propositions

There's clearly something important in the fact-value distinction. So what is important about Oughts—statements that use the word "ought", or that more generally express values with words like "like", "good", "want", "should", "well", "will", "moral"—as distinct from facts? Or, if what's important about Oughts does not show up clearly in [Oughts, as propositions], then what other (non-proposition) sort of thing are [Oughts or values, showing what's important in the fact-value distinction]?

There's some extra ʒuʒ that comes with Oughts. What is it?

Propositions—that is, what lies behind statements of fact—are said to have assertoric force. We could then distinguish Oughts from Ises like so: a statement of fact exhorts or pushes the hearer to believe something; a statement of value exhorts or pushes the listener (including the speaker) to do something.

More generally, we can ask about the mental context that makes a fact a fact and that makes a value a value, and then distinguish facts from values by distinguishing between fact-contexts and value-contexts. This leads us away from the facts and values "themselves", toward what makes them facts and values. E.g., maybe we want to understand a value as a mental element that drives the mind to act, and a belief as a mental element that drives the mind to believe.

We're no longer trying to talk specifically about propositions or statements, and we're faced with the question of what sort of thing values are, if they are a sort of thing—we're asking what can be a telophore. We're asking about drives and what drives do, or utility functions and what utility functions do, and so on. From this perspective, the appearance of an Is-Ought chasm comes from viewing an Ought as inextricably bound up together with the rest of the context that makes it an Ought. In other words, bound up with the telophore.

By analogy, consider the classes of statements called "modalities". Possibility ("It may be that..." or "It is necessarily the case that..."), epistemic ("She believes that..."), probabilitistic ("With probability 40%, ..."), and counterfactual ("It could have been the case that...") modalities are contexts that modify statements. A modality produces a statement that has some sort of different quality from an ordinary statement of fact. While "There's an apple balanced on my head." is a statement of fact that talks about the real world, "I could balance an apple on my head." talks not directly about the real world but instead talks about some sort of hypothetical world. To figure out what's going on with some modal statement or some modality in general, we'd want to not just look at the formal properties of these statements as propositions (such as which modal axioms imply which others), but also look at how these statements are playing roles in a mind that are somehow interestingly, thingly different from the way that ordinary statements of fact play roles in a mind.

4.3.3. Back to the basic distinction

If our starting place is mental elements such as drives, repulsion, directing and generating behavior, spurring on mental activity and growth, and so on, then [values as propositions] seems like a distraction. Liking blueberries is not even close to being centrally a proposition, is it?

4.4. Preciser fact-value distinctions

Some distinctions pointed at in the fact-value distinction:

Value-likeFact-like

autonomous, active; elements that do things of their own accordinert, passive; elements that sit around waiting to be used (but note that to be truth tracking, elements do need activity)

mental elements that structure mental activity (e.g. plans, intentions, policies, rules, criteria); elements that control other elementsmental elements that are structured or controlled by other elements

actualizing, having resultspossibilizing, making results possible

any differences between minds that don't wash out in the limit of a mind's operation, e.g. coherence and reflection (this may include multiple fixed points of decision theory, e.g. U/FDT-like vs. Son-of-CDT, and other different cognitive realms)any canonical elements that most minds converge to having

elements that determine anything about the mind's external behavior in the limitelements that don't make an externally visible difference

elements that determine the directions of a mind's ultimate effectselements that determine the magnitude of the a mind's effects

what makes a mind describable as one agent, across an ontology shiftwhat a mind gains in an ontology shift

telopheme telophore (though note that a telophore contains action-oriented elements, e.g. a decision theory)

a (total or partial) valuation on possible worlds (or possible actions, partial worlds, mental states, sense data, whatever)a description of possible worlds

a statement with exhortative force, that bids the listener to do somethinga statement with assertoric force, that bids the listener to believe something

elements that we want to interpret using the mental context we use to have goalselements that we want to interpret using the mental context we use to have beliefs

elements that are deemed by the mind (e.g., by the mind's meta-values) to be proleptic indications of what is to be done with the world (e.g., to be interpreted by FIAT or value systematization or other interpretive value change)elements that are promissory notes for canonical concepts (e.g. pointers to Things)

an aspect of a mental element that will remain, but that would not be useful for another mind to use geminilyan aspect of a mental element that would be useful for another mind to use geminily (e.g., if you can touch the elephant's trunk, I would like to copy your sensory data into my sensory stream)

elements that serve non-convergent goalselements that serve convergent goals

tag:blogger.com,1999:blog-8939787122970662740.post-2206984105867313515

Ah Motiva 2: Relating values and novelty

tsvi bt Nov 27, 2025 Updated Nov 27, 2025

Show full content

1. Background
2. Capable minds with specifiable effects
3. The idea of values is promissory
4. Are values essentially diasystemic?

1. Background

This is the second essay in a series under the title "A hermeneutic Movement of the idea of values". This essay is a mix of old notes and some new meditations. The previous essay is "Ah Motiva 1: Words about values".

2. Capable minds with specifiable effects

The starting point of AGI alignment is the question of how to make a mind that is highly capable, and whose ultimate effects are determined by the judgement of human operators.

In other words, the mind should empower humans. It should possibilize a lot for the humans. There should be some channel which the human operators can handily use to determine the ultimate effects of the mind.

3. The idea of values is promissory 3.1. Using the idea of values to specify effects

This desired situation—where we understand minds enough to make such a mind and to specify its effects—provides a criterion for values through the formula:

The mind's values should be such that we can specify the direction of the mind's ultimate effects.

This formula both provides a criterion for the mind's values, and also provides a criterion for our concept of values. It says that our concept of values should be such that the formula is useful. E.g., our concept of values should be such that it makes sense (is relevant, meaningful, clear, useful, testable) to say of a mind that its values are such that we can specify the mind's effects.

If we assert this formula, we are conjecturing that our current idea of value (our pointers, intuitions, partial concepts, connections) is such that it is a useful approach to ask: How can a mind's values be, so that we can specify the mind's effects? The conjecture is a promissory note that says: The concept of values will be revised, starting with our present idea of it; and that process of revision will homotope the idea to a good ensemble of concepts.

With this formula in context, our present concepts about values are like conjectures that say: This concept is a good starting point for finding useful concepts in the region of values.

3.2. Human wanting

Our present concepts about values come mostly from our familiarity with humans and human wanting. See "Human wanting", especially the dimensions of wanting laid out by the variety of human wanting.

3.3. Aside on ideas vs. concepts

Here "concept" is as synchronic as can be, and "idea" permits diachronicity. A concept is what's pointed at by a description of mental elements as they presently relate to other mental elements. An idea is less determined, and to follow its reference would require resolving some provisionality. An idea includes present concepts, the past history of those concepts, and the future ensemble of concepts that the present concepts will transform into by role-isotopy (that is, by being replaced with more suitable concepts for the future analogs of the present roles presently played by present concepts).

We're interested in not just the present concepts of values, but the broader idea of values—including its role, the context that gives it its role, and its future manifestations as concepts.

3.4. Promissoriness asks for holding off on demarcation

There's a natural motion in conceptual analysis: Make a definition, and make it precise. It won't capture everything, but it will be easier to analyze; it will point at a smaller class of examples, it will direct attention to clear central examples, it will be more fixed as the analysis goes on, it will be more amenable to formal analysis.

Demarcating the subject of discourse in that way is useful sometimes. Its key disadvantage for us is that a demarcated notion is like a small patch of a net, which is to say, not a net. A demarcated notion can't play the role it would have to play in a successful hermeneutic net.

Instead of a demarcated notion, we have several provisional concepts, which together stake out a region. These concepts might shift around over the course of our inquiry, and might expand outward from this region. The spread of flags planted in the ground might reveal countours of a conceptual wormhole, leading to a wormhole in the inquiry.

In other words, restricting attention to one notion of value would sterilize the growth of the idea of values that was promised.

4. Are values essentially diasystemic?

A proposition:

Values are essentially diasystemic.

This is really a family of claims: For any notion of value, and for any particular value that fits that notion in some mind, that value runs across the grain of the mind. The value is diasystemic relative to almost all other mental elements: It doesn't fit alongside other elements as another element of the same type that interfaces like other elements do. Rather, it touches everything—more precisely, a value is constituted by everything else in the mind.

4.1. Example: Self-regenerating friendship

A friendship is compiled down into patterns of attention, such as noticing the friend walking down the street more easily than noticing a faint acquaintance. And these patterns of attention can recover a damaged friendship: Noticing a friend who has been drifting away, walking down the street, brings the friendship back up—the pattern serves as a signpost to recover caring. Or, doing an activity that's better with the friend will call for the exertion of skills, and those skills, in their exertion, will call for their accustomed resources, which include the friend. The friend, being called for, is valued, and so the friendship is valued.

Habits and memories and skill are supported by the physical environment, as in a chef orchestrating a complex dish in deft reliance on supplies and equipment in their proper place, or the promises that haunt a childhood home. Analogously, each [mental element that constitutes the value] has, as its supportful dwelling, the rest of those mental elements that constitute the value, and reciprocally supports those other elements, reminding them to be what they are. If all the "mere models" related to the friend were deleted, how would the friendship still be there? On seeing (with what recognition?) the friend, there's an emotion of warmth—but that might fade quickly, with no traction pulling one into the old, now-unfamiliar patterns of togetherness. It's conceivable to regrow the whole mode of friendship from just a single emotional event, but that's not how humans work (except when plummeting in love).

4.2. Human values are diasystemic

Isn't this just a quirk of our lowly origins as sentient mud? Wouldn't a more cleanly architected, more coherent mind have its values factored out from its possibilization?

Maybe! A more modest claim, illustrated by the above example, is visibly true:

Human values are presently diasystemic.

In other words, the "flesh" that constitutes "one human value" is not something like one chunk of brain matter, and is not something like one concept, and is not something like one plan. It's not one envisioned world, or even a pamphlet full of principles. It's not one element that comes with a familiar and comprehensive interface. It's not demarcated from other elements, but rather it is constituted by many elements, as some higher-order organization of aspects of those elements.

To some extent this is temporary. It could be written down, and placed in some trusted crypt, that so-and-so is a good friend. If such a message is trusted, it could on its own be enough to say the value, and that saying would somewhat more densely determine the value compared to the naturally messy distributed caring.

4.3. Values require reference

The question stands, are values nevertheless to some extent essentially diasystemic?

4.3.1. Example: Blueberries

For example, I reach out and pick up some blueberries. This is some kind of expression of my values, but how so? Where are the values?

Are the values in my hands? Are they entirely in my hands, or not at all in my hands? The circuits that control my hands do what they do with regard to blueberries by virtue of my hands being the way they are. If my hands were different, e.g. really small or polydactylous, my hand-controller circuits would be different and would behave differently when getting blueberries. And the deeper circuits that coordinate visual recognition of blueberries, and the deeper circuits that coordinate the whole blueberry-getting system and correct [errors in the performance of blueberry find-and-pick-upping] based on blueberrywise success or failure, would also be different. Are the values in my visual cortext? The deeper circuits require some interface with my visual cortex, to do blueberry find-and-pick-upping. And having served that role, my visual cortex is specially trained for that task, and it will even promote blueberries in my visual field to my attention more readily than yours will to you. And my spatial memory has a nearest-blueberries slot, like those people who always know which direction is north.

It may be objected that the proximal hand-controllers and the blueberry visual circuits are downstream of other deeper circuits, and since they are downstream, they can be excluded from constituting the value. But that's not so clear. To like blueberries, I have to know what blueberries are, and to know what blueberries are I have to interact with them. The fact that I value blueberries is founded on my being able to refer to blueberries. Being able to refer to blueberries is founded on my being able to manually investigate the world. Certainly, if my hands were different but comparably versatile, then I would learn to use them to refer to blueberries about as well as my real hands do. But the reference to (and hence the value of) blueberries must pass through something playing the role that hands play. The hands, or something else, must play that role in constituting the fact that I value blueberries.

4.3.2. The concrete is never lost

In general, values are founded on reference. The context that makes a value be a value has to provide reference.

The situation is like how an abstract concept, once gained, doesn't overwrite and obsolete what was abstracted from. Maxwell's equations don't annihilate Faraday's experiments in their detail. The experiments are unified in idea—metaphorically, the field structures are a "cross-section" of the messy detailed structure of any given experiment. Abstraction is a gain, not a loss.

The abstract concepts, in order to say something about a specific concrete experimental situation, must be paired with specific concrete calculations and referential connections. The concrete situations are still there, even if we now, with our new abstract concepts, want to describe them differently. In the same way, a value, as an element that is not tethered to one specific situation, has to interface with specific situations—via reference.

4.3.3. Is reference essentially diasystemic?

If so, then values are essentially diasystemic.

Reference goes through unfolding.

To refer to something in reality is to be brought (or rather, bringable) to the thing. To be brought to a thing is to go to where the thing really is, through whatever medium is between the mind and where the thing really is. The "really is" calls on future novelty: the "really is" is the Cavern that the Thing is, which calls for stepping into it. See "pointing at reality through novelty".

In other words, reference is open—maybe radically open. It's supposed to incorporate whatever novelty the mind encounters—maybe deeply.

An open element can't be strongly endosystemic.

An open element will potentially relate to (radical, diasystemic) novelty, so its way of relating to other elements can't be fully stereotyped by preexisting elements with their preexisting manifest relations.

Does this imply that open elements are diasystemic?

4.3.4. Example: Parliament

Say we value a parliamentary system of government. That is, we want to make decisions according to a parliamentary process, including decisions in very new situations. When there's some new issue to deal with, we want to discuss the problem, hear perspectives, try to persuade each other, try to understand the constraints and possibilities, get to the truth of things, aggregrate preferences, and negotiate plans we can agree to cooperatively follow. There are rules about who gets to say what when, and who gets what control over decisions.

Is a parliamentary value diasystemic? Not really. Parliamentariness doesn't pervade many regions of the mind in the way that {information theory / Bayesianism, computational complexity, a new ion pump, a convergently discovered but not yet unified algorithm, a sound shift, or a major code refactor} pervade many regions of the mind.

Well, it could govern many regions of the mind, and call on many regions of the mind, but that's not diasystemic existence—it isn't overlapping in structure with many diverse elements. It is like a table on which many mental elements are resting. The table touches many mental elements, but those elements are separate from the table. The table is a container or backdrop or support or neighbor for many elements; it constrains many elements (from falling), like the Parliamentary system constrains many agents in many contexts through rules of procedure. But those agents are more or less left alone, besides being placed in the container.

4.3.5. Stable process values are radically open

Wanting to have a parliamentary system of government is an open value. It doesn't have a domain of value that's already explicitly given. Wanting to be parliamentary is not a value like "there should be such and such government projects" or "we should make such and such laws" or "we should engage in such and such military conflict".

Instead, the value of wanting a parliamentary system refers to any possible future domain, through the intermediary of the parliamentary process of accommodating novelty. It's not natively represented or well-described as a preference ordering on worlds, though a preference ordering on possible worlds could be backed out ex post facto by looking at the outputs of the parliamentary process. (Though one would also have to back out a description language for possible worlds; and to do this in advance, one might have to simulate-to-equivalence the parliamentary proceedings.) It doesn't say that the budget should be this way or that, or that this or that should be illegal or mandatory or regulated. It says that those questions should be answered by a parliament.

Wanting to have a parliamentary system of government is a process value, which is a subspecies of metavalue. The value has us wanting to deal with novelty according to some given rules. The rules are about a system that can deal with novelty. The system can spread caring into novelty. It can care about a world transformed, refined, and expanded by new understanding. The caring is transported across conceptual schemes. Values are reinterpreted in new language. Incorporating novelty into the mind also incorporates the novelty into the mind's caring.

A process value is a value that says what process a mind should use, e.g. to make a decision or to modify itself. A process value that doesn't "leap along with" increments of the mind's voyage of novelty will be "left behind". The novelty will be alien to the process value. The novelty will be wielded by, or even come along with, other values. Those other values will have no reason to cede any control to the process value. In other words, a non-open process value will be usurped, and so it isn't a stable value of the mind.

Imagine a parliamentary government that can't understand a new technology, and the new technology is strong enough to recenter power away from the parliament. The parliamentary process might seem comprehensive from the inside; it could handle anything that's brought up to it. But failing to extend its control through the new technology, the parliament won't defend itself against whatever values do extend their control through the new technology. Those other values will disempower and replace the parliament.

It might be that the new technology is not mediated by modular, explicit artifacts like nuclear weapons—but instead, the new technology is a new way of thinking. To not be left behind, the parliament would have to be able to incorporate the new way of thinking. Even trying to stamp out the new way of thinking, keeping it as ectosystemic novelty, would require the Understanding Police. The Understanding Police have to understand what the new mental technology looks like and how it participates in mind and optimization; otherwise, the new way of thinking can hide behind decoys, alienness, and false intentions. The parliament has to be open to not just new artifacts, but new ways of thinking. It has to be radically open.

4.3.6. Radically open elements are transsystemic

So reference and a parliamentary value are each non-endosystemic. A parliamentary value is not diasystemic. What sort of novelty is a parliamentary value, then? It is transsystemic: it points from within the system to beyond the system. Transsystemicness is a sort of complement to provisionality: a provisional element ought to be treated as though it might be revised; a transsystemic element provides the driving force for novelty, e.g. the novelty of a conceptual revision. Provisional elements may be revised; transsystemic elements may do the revising or bear the revising (as a channel bears the water).

For example, the elements that constitute a mind's creativity are transsystemic. E.g. curiosity: it points from within the mind towards what's beyond the mind, structure that the mind hasn't encountered or doesn't understand or hasn't made explicit. (Though in humans, curiosity is also diasystemic—it can bubble up obliquely from its hiding place in many different regions, e.g. you can become curious about many different domains and in many different moods, but in a way that is intimately "of" those domains and moods.) If you're curious about ants, then your concepts about ants are provisional because your curiosity might revise your concepts about ants by driving investigations.

Reference is essentially radically open, hence essentially transsystemic. Stable process values are essentially radically open, hence essentially transsystemic.

4.4. Self-interpretive metavalues produce diasystemically novel values 4.4.1. Self-interpretive creativity

A metavalue creates, destroys, or otherwise modifies values. E.g. by clarifying them, tweaking them to be about something different or with a different valence, or generalizing them so they apply to a world expanded by novel understanding. A metavalue is a species of creativity.

"Interpretation" is here a pretheoretic idea. The examples below will gesture at interpretation. To interpret X is something like receiving X as though it's a message sent from a mind, or more generally as though it's an expression of something in a mind. Interpreting a mental element means "recovering" something, as if "from behind" or "from within" the element. E.g. if you read a sentence, you want to then interpret it to get the propositions expressed by the sentence—otherwise all you have is a sequence of letters. E.g. if you feel a dislike for someone, you want to interpret it—as a command to get away from them, or as a hypothesis that there's something bad about them or about how you are when you're with them.

A self-interpretive creativity is a creativity whose action is shaped by interpreting the mind, so that the novelty that the creativity produces will in some way incorporate interpretations of the mind. (Here "self" refers to the mind, not the creativity; maybe it should be called mind-interpretive?) Since the novelty incorporates something recovered from preexisting elements, it cuts across those preexisting elements. For example, suppose I conclude that God gave me a voice so that I can sing. This conclusion is very bound up with the concept [a voice as a musical instrument]. That's one aspect of my voice, not the whole thing. Something's been abstracted from my voice, and incorporated as the key idea of a new value—the new value of singing. The abstraction is a gaining, not a loss or restriction or impoverishment; the idea of the abstraction (my voice as an instrument) wasn't there explicitly before (in how I oriented to my voice simpliciter), and now it is. The idea definitely involves the voice, but cuts across the voice in a novel way, showing something new in the voice. The self-interpretive creativity—in this case, viewing myself as God's creation, and viewing my elements as elements put there by God for some purpose—has added something novel to me (my voice as a singing instrument) to a preexisting element (my voice), which is bound up with the preexisting element but cuts across it; and the self-interpretive creativity does this to many elements.

To say it another way, a self-interpretive creativity produces novelty in the form of novel relations of the preexisting elements. The preexisting elements are remade; they take their place in the new mind that gives them a new role. In other words, interpreting an element places it into a new context. The element being placed into a new context constitutes novelty. And, that novelty is skew to the preexisting element as it was previously. Before, the voice was for speaking; now it is also for singing, where the "for" is a novel relation.

Interpreting one element requires the context provided by other elements. So interpretation incorporates structure from many elements. Interpreting a difficult message requires a lot of reading, so interpretation interprets many elements. Interpreting the whole self—the whole mind—involves reading many elements. For example, sometimes understanding what a human meant by a message requires understanding a lot about zer—zer history, zer goals, zer's common-knowledge stance with respect to you. It also requires the context that you provide—the context for gemini modeling the element.

So mind-interpreting creativity touches and incorporates many elements, and does so in a way skew to those elements's preexisting relations. That is, mind-interpreting creativity produces diasystemic novelty.

Since metavalues are a species of creativity, mind-interpretive metavalues are a species of mind-interpretive creativity. Therefore mind-interpretive metavalues produce diasystemically novel values.

4.4.2. Example: FIAT values

A human is a book that can be read. The hieroglyphic hand, and the visual cortex finely tuned like a watch to recognize colorful patterns, signifies that mangos are for eating. One could say that God gave us {hunger, a fear of dangerous things, and an instinct to protect children} and gave us {object recognition, fine motor control, mental workspace broadcasting, Bayesian updating, and causal analysis}, in order that we would survive and reproduce—which is therefore our purpose. God did not literally create us, but the resulting motion of interpretation mostly makes sense.

See FIAT for more examples.

The FIAT metavalue interprets mental elements as expressions of the striving of a hypothetical stronger mind. A mental element is a sort of failed attempt, a deficient version of a corresponding element in some hypothetical stronger mind, or an aspirational beginning of some stronger pursuit. A mental element therefore gestures at a stronger pursuit. FIAT adopts that larger pursuit as a goal.

4.4.3. Example: Corrigibility

From "Hard problem of corrigibility":

Reason as if in the internal conjugate of an outside force trying to build you, which outside force thinks it may have made design errors, but can potentially correct those errors by directly observing and acting, if not manipulated or disassembled.

A corrigible mind might figure out what to do in some situation. Then it thinks: I've figured out what to do. I've figured out a plan that, if I execute it, will result in good outcomes. But, the presence of this plan and its justification in my understanding—and the process that generated the plan and its justification, including my understanding of what counts as good outcomes—is not just me doing my thing. Rather, it's the downstream result of [the source of good agency], doing its thing, but in a flawed way. I'm the result of a flawed attempt to create a delegate. The fact that this plan is in my attention along with a judgement that it's good to execute might mean that it's good to execute, but it's also likely to mean that the process of the good agency building a delegate has gone wrong somehow in a way that produced this plan.

A corrigible agent interprets itself, its elements, as flawed—as an attempt by the humans to do something difficult. It interprets its goals as mere proxy-goals, subgoals or experiments towards some unknown supergoal; it interprets its pursuit of goals as Goodharting.

4.4.4. Example: Extending identity

It might be natural to interpret [metavalues that involve an agent identifying with other agents] as being self-interpretive metavalues. A mind with such a value looks at itself and asks, if all this is just a part or instantiation of a single agent that's shared across such and such minds, what agent is that? What is the agent structure that is the intersection of the agent structures of all these minds, and is really the core of each of them? I'm really that agent.

E.g. loyalty to one's creator; acting as though you are those other agents that use the same decision theory as you or are otherwise "the same" as you; fusing values with other agents that you can negotiate with, or understand well enough to coordinate through mutual trust; acting as though you're behind a veil of ignorance or updatelessness; corrigibility in the sense of viewing yourself as part of a whole that extends across yourself and humans.

tag:blogger.com,1999:blog-8939787122970662740.post-6789853449302531230

Ah Motiva 1: Words about values

tsvi bt Nov 26, 2025 Updated Nov 26, 2025

Show full content

1. Background
2. Explicating etymons
3. Want
4. Try, attempt
5. Plan
6. Intent, interest
7. Matter
8. Desire
9. Care
10. Pursue, control
11. Will, volition, wield, value
12. Select, decide
13. Choose, taste, aesthetic
14. Drive, motive
15. Moral
16. Free, proper
17. Yearn
18. Wish
19. Aim
20. Need

1. Background

This is a first essay in what may or may not be a series of essays, under the title "A hermeneutic Movement of the idea of values". Some of these essays will be a mix of old notes and new meditations.

These essays are aimed at doing philosophical background work towards understanding values. This is in the context of addressing core issues of aligned AGI, especially the problem of what determines a mind's effects. Human values are subtle and multi-aspectual, partly in ways that don't much affect de novo mind design, but probably partly in ways that do affect that design challenge. Far from being a question of identifying endosystemic and explicit "things that we value", the question of values appears much more philosophically fraught. We seem to have barely even the foggiest notion of what sort of thing values are; what sort of mental contexts we ourselves employ to construct and bear out our own values; and how our own metavalues work. We don't know what sort of mental elements or mental systems there can be; we don't know what sort of mental elements in what sort of mental systems can be reflectively stable and therefore longitudinally meaningful, rather than being swept away by whatever wildfire might overtake a nascent strong mind.

The phrase "hermeneutic movement" refers to a hermeneutic net for agency, a hypothetical philosophical method for solving world-historically difficult problems in a short time. It probably wouldn't work, it's just my best guess. This essay (or series) is an aborted attempt to take some steps on that path. My hope in publishing is to gesture a bit more vigorously at the sort of investigation I mean to describe by "hermeneutic net". Since I think this problem is too difficult for us to address, I'm not planning on taking this investigation further (instead favoring plans to enable humans to be much smarter, as well as stopping AGI capabilities development). But, maybe the ideas are enriching anyway.

Note that many of the ideas presented here are meditations, in this sense: https://tsvibt.github.io/theory/pages/bl_24_09_23_02_27_14_589494.html

2. Explicating etymons

This essay meditates on words that touch on values. (Some are imported from "Control" and heavily modified. All etymons are taken from wiktionary.org and are often uncertain; discovered using radix and etymonline.com.)

Here are some of the explicit reasons I find this a valuable meditation:

Etymons in fact show something about how we use the words / morphemes and the ideas they relate to.
By connecting multiple words through shared etymons, we see metaphors that in fact were historically crossed, which are therefore interesting starting metaphors.
By meditating on many related words, we recover for ourselves a toolbox for covering many distinct aspects/dimensions of valuing. Two of the ways this covering-toolbox works:
- By using preexisting connections to different aspects of valuing that different words possess.
- By planting several flags in the region, around which different connections can grow, in a specialization ecosystem.

3. Want

Cognate with "vacuum" (as in "having an emptiness, lacking something"). Also cognate with "vanish", "evanescence", "vacant", "vast", "void", "devastate", "waste". This suggests homeostatic pressure and satisficing. It also suggests a pursuit of something positive: there's a lack, and something positive is sought to fill the vacuum, the void.

Apparently there's no word W such that "I W blueberries." means "I have an undesired excess of blueberries.", grammatically analogous to how "I want blueberries." means "I have an undesired deficit of blueberries.". If only there were a shared craft of deliberate lexicogenesis. One could say "diswant" or "want-not" or "antiwant", but these feel like calling an overflowing cup a "very non-empty cup".

I'll say "overhave". "Overhave" is redundant with "want", in that "want X" means "overhave a lack of X" and "overhave X" means "want a lack of X". But phrasing X positively, as the positive Thing, is more natural. E.g. wanting your house to stand implies wanting the bulldozer to not fulfill its namesake—but not-bulldozing isn't a thing; rather, bulldozing is a thing. So you overhave the bulldozer bulldozing. It's a bit unfortunate that this sounds like "you already have some [house being bulldozed] and you want less". But "want" really has the same problem in reverse: even if you are already healthy, you still want to be healthy—there's no actual lack of health. Also, "I overhave my house to be bulldozed." can be understood as having some [possibility of my house being bulldozed], and wishing to have less of that (and likewise, even if you're healthy, you have a hypothetical undesirable lack of health in some possible worlds).

4. Try, attempt

"Try" from Old French "trier" ("to choose, test, verify").

"Attempt" = "ad-tent" ("towards-test") (analogous to "attend" ("towards-stretch"), compare "intend"; cognate with "tentative", "tense", "tend", "-tend", "-tain", "tight", "thin", "continuum" ("held together"), "dance", "tone", "tune"). Suggests experimenting to see what works, trial and error.

Thus "try" talks about an inner loop of goal-pursuit, not the whole goal-pursuit. It describes one attempt among many. Trying to get to the South Pole using such and such supplies and crew and dogs isn't the whole of the desire to reach the South Pole—if it fails, there can be (for those who are nonfrozen) other tries toward the same goal. Trying relates to the whole goal G by describing the particular instance of goal-pursuit-behavior as being a try towards G. "Alice is trying to reach the South Pole." means "Alice has an overarching goal of reaching the South Pole; this behavior is one attempt, one plan that she is currently executing that she thinks might achieve the goal; if the attempt fails, she may go on pursuing the goal via some other strategy.".

5. Plan

From Latin planus ("flat"), as in a map of an area. Cognate with "plain" and "plane" (both "flat" as a noun), more distantly "field".

Suggests a blueprint—a comprehensive set of actions, already laid out, which will mostly work as pre-stated, assuming that the area is rightly understood.

Thus a plan is similar to a try, in that it's a crystallized action-package, which is subordinate to a goal and is somewhat cut off from the goal. If the try or plan fails, one might return to the goal-pursuit, perhaps discarding some or all of the try or plan.

A plan or a try is therefore less alive than a wholesome goal-pursuit. In a wholesome goal-pursuit, everything is provisional, perhaps even the goal itself (including both the telopheme and the telotect).

6. Intent, interest

"Intent" from Latin intendo ("into-stretch", "draw into"). Cognate with "attempt" and so on, see above. Suggests an internal organizing force that rearranges mental elements towards some other region. Where is the mind drawn into by an intent? "Interest" from Latin inter-esse ("between-be", "be among/amidst"). Suggests a preliminary orientation that comes before pursuits: interest is the mind being somewhere. By being there with its full self, the mind makes ready [the elements in itself that would lead to pursuits when applied to elements, in the context of the full mind] to [apply to elements and thus lead to pursuits].

Thus interest and intent feed each other: an intent draws a mind further into a region, increasing the mind's being among what's in that region, while interest makes a mind ready to create intents by providing to the mind some of the coordinates referenced by the intent (the coordinates referenced in the direction of drawing-into).

This idea of "intent" overlaps the usual meaning. The usual meaning is: a state of mind that will in some future context create a goal-pursuit. This is a drawing-into: it draws the mind into goal-pursuit.

What's a word for an element of a state of mind, an element that will in some future context draw the mind into a goal-pursuit? It's a sort of "armed trigger" that will fire off in the right context. It's a stance or disposition. It's an inactive drive, protected by a vessel from the changes in the mind, to be activated later. It's a paused goal-pursuit.

This describes "intention", as in "It is my intention to high-five her.", but is that the same as "intent"? An intention is like a plan—it's a somewhat decided-upon plan, to be executed in some future context. The "underlying intent" of an action (or more generally, of (possibly internal) activity, including e.g. doing math) is the goal-pursuit that motivated the action. In this case, the drawing-into has already happened; the mind was drawn into the action, the activity.

Intent, intention, and interest all embed goal-pursuit implicitly. But, they don't necessarily do so in an already-decided way, or in an explicit way. For example, being interested in something can be a kind of non-specific goal-pursuit, without an already-spoken telopheme. This is similar to play and curiosity as non-specific practice. They say "there's something valuey / goal-pursuity here" without saying what, or in what direction. This is similar to how attention embeds value. ("Attend", "towards-stretch" / "towards-strive", like "intend". The self (the telophore) is blooped into the new region, ready to make telophemes, not necessarily carrying a pre-written telopheme.)

7. Matter

Cognate with "mother", "material", "matrix" (meaning "substrate").

This is the complement of "interest", "intent". Matter is what-is-of-concern; Mattering is [involvement in goal-pursuit as being worthy of interest and even care, without differentiating between "instrumental" and "terminal" goal-pursuit]. Matter is received, in a sense. It is received by the agent from the agent's past, including the context that birthed the agent—the mother.

8. Desire

Latin "de-sidus" ("from the stars"), cognate with "sidereal" ("of the stars").

Suggests transcendence, universality, wide scope; hope, things out of reach.

Running with this maybe-fanciful interpretation of the etymon: to desire is to see something; not know what it is; but know that it is seen by everyone in the world; and so believe it to be very real, very much a Thing that can be stepped into, understood deeply, with further investigation; and so the goal-pursuit is pointed at something preliminary grasped (twinkling, up there, for anyone to see but for no one to understand), but still pointed truly—so that it pursues whatever that Thing is. So desire is goal-pursuit which is proleptic, but strong due to a strong indication of a Thingly Thing that can be pursued and is what-is-to-be-pursued.

9. Care

From Proto-Germanic *karō ("care, sorrow, cry"), from Proto-Indo-European *ǵeh₂r- ("voice, exclamation"); distantly cognate with "garrulous" ("talkative"). Note that German "Sorge" ("care, concern") is cognate with English "sorrow". Suggests depth, relations to other agents; negative reinforcement, turning homeostatic pressure into strategic preservation by projecting negative reinforcement into the world and the future, using imagination, on the (intra- or inter-personal) coordination substrate of the cry—the audible voice (the music of shared intention).

Thus "care" names wanting-complete wanting, or rather, goal-pursuit-complete overhaving. Wanting-complete wanting means: goal-pursuit that, among other pursuits, pursues pursuit. That is, it jerks itself out of its ignorance and unmovedness; it writhes and struggles to pull itself together out of nothing; so that it can get itself onto a trajectory that will take it to unbounded understanding and unboundedly ambitious unboundedly general pursuit of (perhaps unboundedly provisional) goals.

Overhaving—that is, homeostatically pursuing not having something. E.g., avoiding pain. Overhaving is the mental state that motivates avoidance or getting rid of something, like wanting is the mental state that motivates pursuit or acquiring something. Care of One for an Other being seems to naturally sit in a background of overhaving: the Other already has its self-constituting pursuits, so the Other may or may not want for the One's wanting—but there is always a background threat of damage or death to the Other from the world, so the One always overhas that threat from the world.

Evasion of threat (preservation) is the constant background condition, so overhaving is the constant background stance. To go from homeostatic overhaving to strategic overhaving, takes Care. Care cries out for itself to hear, to tell itself that the threat is there, and that the threat of the threat is still there.

In other words, care is what says: I have to care that I don't yet know how to care enough.

10. Pursue, control

Pursue. From Latin "prō-sequor" ("forward-follow"), "sequor" as in "sequence" and "second" ("the following one"). This says, the agent is following after something definite. The agent isn't plotting to head off the pursued thing, skating to where the puck is going to be—the agent is following after what it can see. This suggests using a definite, evaluatable metric of success as the grounding signal to search for effective means. That can be useful either because the evaluatable metric is good enough that reaching a high score will get what is "really wanted", or because the pursuit will be dropped after some means have been invented, and those means will be generally useful for other pursuits. Pursuit is behavior selected to bring about something specific. Pursuit risks goodharting.

Control. "Contra-rotulus" ("against a little wheel"; "a register used to verify accounts"). Suggests tracking, registration, feedback cycles. Control suggests a relation, where the controller doesn't deeply understand what's being controlled, but just enforces something about what's being controlled, something measurable.

Control is more fixed than pursuit. Pursuit, "following after", can change its measurement. When what's being followed-after changes or reveals itself more (in other words, when the mind follows through the thingness of what's being followed), the novel aspects can now be taken as targets. Control is narrower; it has a fixed measure.

11. Will, volition, wield, value

"Will" and "volition" both from PIE *welh₁- ("to choose, want"), cognate with "voluntary" and German "wollen" ("to want") and "wählen" ("to select").

"Wield" and "value" (cognate with "valor" and "valid") both come from PIE *h₂welh₁- ("to rule, to be strong"), which might come from PIE *welh₁- again!

The Will intuitively has a lot of ʒuʒ. The ʒuʒ is that the Will pushes. It intends, it has drive, it moves. It strikes at the world. This relates to choosing and selecting like so: the Will metonymically refers to the telophore (or maybe the telotect) by talking about the selection. The Will is what can select in the ultimate sense: it can select a possibility to make real forever out of other possibilities. Selecting something for a mind to then put its efforts toward making real, is the most selecty sort of selection—it is the most distinguished you can make something. The Will wields the mind, the agent, the unboundedly ambitious unboundedly general pursuits.

12. Select, decide

"Se-lect". From:

Latin "se" ("away", as in "seduce" ("lead away"), "seclude" ("shut away"), "secede" ("go apart"))
and "lect" from PIE *leǵ- ("gather", cognate with λόγος ("word") and "-lect" like "dialect").

Suggests taking something from one context where it is of the same kind as other things in that context, gathering it together and naming it, and then using that handle to put it into another context.

"De-cide". From Latin dē ("away, down from") + caedō ("cut"), from PIE *kh₂eyd- ("to cut, hew"). Cognate with "-cise" ("to cut") as in "incision", "-cide" ("kill") as in "homicide", and "hit".

So, the Will selects (gathers out and away) the possibility to make real, and decides (strikes, cuts away) against the other possibilities.

13. Choose, taste, aesthetic

Choose. From PIE *ǵews- ("to taste, try"), from which also "disgust", "gustatory".

A choice is a decision made by applying taste—by applying criteria whose reasons are inexplicit. E.g., rules learned or evolved by searching for code that does well, without also trying hard to factor the code so that it's explicit.

Aesthetic. From:

Ancient Greek αἰσθάνομαι ("feel, perceive"), from:
- [+ *dʰeh₁- ("do, put"), whence "do", "thesis"]
- PIE *h₂ew- ("to perceive, see, to be aware of"), from which also:
  - "aural",
  - "audio" ["seeably-do", "render clear"; from *h₂ewis ("clearly") + *dʰh₁-ye/o- ("render", whence "do")],
  - "ear",
  - "acoustic", "hear", "hark" [these three all from PIE *h₂eḱ- ("sharp", whence "acute" and "edge") + *h₂ṓws ("ear", from *h₂ew-)],
  - "obey", and possibly "omen".

Aesthetic is then a sort of taste that applies to perception. It applies to sensing what is there to be sensed. It is taste about objective, external things.

Taste. Via reconstructed Vulgar Latin *taxitare ("to touch, to feel"), from Latin "taxāre" ("to touch sharply"), from which also "task" and "tax"; from PIE *teh₂g- ("to touch"), whence "tactile", "contact", "tangent", "tangible", and via the construction "un-touched" also "integer", "integrate", and "entire".

Taste is getting a sense of something by touching it sharply, by trying it out.

So among questions decided by inexplicit criteria, there's a spectrum from external to internal: aesthetics, taste, choice. Choice is gustatory; it's a question of what to incorporate into oneself. Taste is liminal, a question of what to interface with, e.g. what to use as a tool. Aesthetics is a question of what external things are considered suitable.

14. Drive, motive

Drive. From PIE *dʰreybʰ- ("to drive, push"), from which also "drift". That etymon might come from PIE *dʰer- ("to support, hold"), from which also "dare", "firm", and "dharma" (Sanskrit, "that which upholds or supports"), and possibly "throne", "force", and "fortify".

A "drive" pushes, or supports, something. What does a drive support? It supports the actor (pursuer, controller, willer, chooser, trier, carer). It supports the actor by pushing the actor. The actor is called into existence by being pushed. The drive is what's created already in motion.

Motive. Cognate with "motif", "remote", "mobile", "mob", "mutiny", "move", "motion", "motor". The motive is what moves the actor.

The motive is the answer to the question: Why are you doing this? It's not a cause, and so the answer isn't of the form "such and such electrical signals formed such and such pattern, which causes such and such subsequent electrical patterns". It's not a value, not a terminal value, so the answer isn't "in order to...". The "in order to..." requires as a precondition that there's a dynamic that goes from {the fact that this action will bring about that outcome, and the preference in favor of that outcome} to actually taking the action. That dynamic is the motive, the motor, the motion that was there at creation, what creates motion, the root of movement in the mind.

The blind worm. Imagine, hypothetically, a blind, senseless worm. Since the worm is blind and senseless, it doesn't do anything like computing which behavior patterns to follow by processing sense data. Is there then no role for neurons? If there's no sense-input→motor-output map being computed, then is there nothing that computationally efficient cells would be useful for? Actually there's still something useful for neurons to do: coordinated behavior patterns. Suppose the worm has two lines of muscles, one on each side. A neural circuit that generates complementary sinusoidal contraction-relaxation waves going down the two lines of muscles in the worm's body would be useful for the worm. That coordinated pattern of action, even if it's not conditioned on anything about the environment, still makes the worm go faster compared to locally governed patterns. Going faster, even blindly, might e.g. better avoid simple predators, or collect more food. (This behavior pattern might be so simple that it could be done without neurons, but consider e.g. walking or curious play.) The motive is like this: it doesn't have to be about the world, it's a push from out of nowhere.

The motive isn't there only at the beginning. It's also what's essentially there as an agent grows. E.g. it's what moves the mind to an ontological crisis.

15. Moral

From Latin "mos" (meaning something like manner of habitual behavior), whence also "more" (as in a norm or custom). Maybe cognate with "mood" and "Muse". Maybe partly from PIE *med- ("to measure, to advise"), whence "mode", "module", "model", "must", "modify".

This suggests, simply, regularities in behavior, which includes rules, habits, customs, and repeatedly useful ways. More narrowly, it suggests taking past regularities in behavior as advice or an indication of what is good to do.

This further suggests orienting to past behavior as though one is participating in an ongoing tradition. One's behavior is informed by advice, and also one decides how to inform one's future behavior through advice modeled by current behavior. And, at the meta-level, one's [behavior, regarding one's behavior as precedent-setting] takes advice from past morality about how in general to inform future behavior through precedent-setting.

16. Free, proper

Free. From: PIE *priHós ("beloved", maybe with the sense of "beloved member of the clan and therefore not a slave"), whence also:

"afraid" ("ex-free", "out of peace/security/love"),
"proper", "appropriate", "property",
"friend" (via the PIE etymon *preyH- ("to please, love")).

Suggests a stake in co-creation of the world through the shared intentionality of the clan.

17. Yearn

From PIE *gʰer- ("yearn"), whence also "greedy" (as in hunger), "exhort" (urge on). (Also somehow "charisma" through an Ancient Greek word meaning "cheerful"—charisma as cheerfully yearning?)

This suggests a specific in-built want—lack of something, and being urged on to fill the lack; and it is in-built, like hunger; it may be incompetent, suggesting only clumsy pursuit, like greed; but sensitive to its failure, being urged onward through failure—like Care.

18. Wish

From PIE *wenh₁- ("to love"), whence also:

"wonder" (as in awe),
"win" (strive for, fight for),
"venom" (via Proto-Italic *weneznom ("lust, desire")).

Suggests an overpowering goal-pursuit-object, something which rewrites/hijacks/overtakes the agent and makes the agent do extreme things in pursuit, even to the agent's corruption.

A wish is dangerous, and is a place where the agent may open itself to value drift, or being consumed. The formation of shared intentionality, the way humans do it, proceeds by a mutual partial overwriting of mental elements. So it proceeds from wishes. Awe comes from large-scale shared intentionality; it's a overpowering, somewhat external wish, and the dread of being swept up in that.

19. Aim

Closely cognate with "esteem".
From "ad-" ("towards") + "estimate".
- In turn, "estimate" is believed to be from "copper-cutter".
  - The second part is from PIE *temh₁- ("cut", as in "atom", "-ectomy", "anatomy", etc.).
  - This means "minter"—i.e. giving a valuation of something with currency.
- Alternatively, "estimate" from PIE *h₂eys- ("to wish, request"), whence also "ask".

This is the element that connects a valuation with a goal-pursuit for purposes of refining an action-package to hit the target. That is, aim is the connection between:

the valuations of the different possible results of goal-pursuit;
and the tuning/design of possible action-packages based on their anticipated consequences.

Aim is for constructing action-plans, analogous to hypothesis generation; it communicates to the action-package-designer, during the design process, "hey this draft action-package doesn't point at the highly-valued thing in ways XYZ, keep tweaking".

20. Need

"Need" has a strange etymology—it's a "merger" between:

a PIE root *neh₂w- meaning "death, lack",
and an unrelated PIE root *new- meaning "nod, assent", meaning something like "the zeal that comes from an affirmation—a nod as an order or as a joyful pursuit".

This is a bit mysterious.

tag:blogger.com,1999:blog-8939787122970662740.post-304507111977437230

The Ease Disease

tsvi bt Nov 25, 2025 Updated Nov 25, 2025

Show full content

Variations on a theme.

1. Art vs. entertainment

In "An Elevated Wind Music" (2000), one of the great American composers Charles Wuorinen writes:

In any medium—music, literature, poetry, theatre, dance, the visual arts—entertainment is that which we can receive and enjoy passively, without effort, without our putting anything into the experience. Art is that which requires some initial effort from the receiver, after which the experience received may indeed be entertaining but also transcending as well. Art is like nuclear fusion: you have to put something into it to get it started, but you get more out of it in the end than what you put in. (It takes an expenditure of energy to start the reaction.) Entertainment is its own reward, and the reward is not usually long lasting. Entertainment is a pot of boiling water placed on a cold stove: the heating is fleeting. Art is a pot of cold water put on a hot stove: it may take a while to get going, but when it does it gets hot and stays that way!

Even if we take Wuorinen at his word, this presents a bit of a paradox. Which is better, art or entertainment?

Ok, ok, we can do both, yes. But it's strange that on Wuorinen's theory, entertainment has an infinite return on investment, in a sense! No effort, no putting anything in, but we still get something back (however fleeting).

(Of course, we might view even our time and attention as investments; and then Art might start to look like often a better investment than entertainment.)

2. Portuguese pavement

Allegedly, if one walks around in Restauradores Square in Lisbon, the capital of Portugal, one might be walking on this:

This is a Portuguese traditional way of paving walking spaces. (Lots more examples in the wiki article.) This is a bespoke, labor-intensive method, where you manually lay small individual stones in a pattern special to the one walkway you're making. The result speaks for itself.

An asphalt paver might crank out roadway at 100x or even 1000x the speed of someone manually laying mosaic.

The result is less pleasing, in many ways.

By using more flexible, attention-heavy methods, you get nicer results; but it's much less efficient.

This is not a knock on asphalt pavers. The machines are wonders of modern understanding, and I assume the workers are doing a lot more than one might imagine. What I'm seeing here is just that, while asphalt pavers are far far better at what they do compared to any manual method, they have a ceiling on how nice of a result they can produce. To go much higher than that, you'd need different, probably less efficient methods.

3. Do I know manim yet?

In June, Rachel Wallis and I (Berkeley Genomics Project) held a conference about reprogenetics. In preparation to publish some of the talks that were given (which you can view here), on a bit of a lark, I spent, like, a week or more quasi-vibecoding an intro animation to go at the beginning (as an accompaniment for a short commissioned piece of music). I used manim (of 3Blue1Brown fame and origin) and gippities.

Now, I definitely learned something about manim.

But I'd be starting from fairly close to zero knowledge if I did another project like that. Why? Well, I didn't try to understand how things worked more in depth than I absolutely needed. This was proooobably the right choice in context, because I just wanted to get the thing done fast. But it meant that I would just ask a gippity for help in several different ways, if something was going wrong, and surprisingly much did go wrong.

So, I did not volatilize the elements of manim that went into making the intro video. I don't have them handy, wieldy, ready to recombine and use to make more difficult things. Volatilization happens when you do the task the right way, even if it's harder in terms of the effort needed to reach the first minimally usable results.

In general, if you don't build it yourself, you don't have the theory of the program. In that linked essay, Dave quotes Naur:

The conclusion seems inescapable that at least with certain kinds of large programs, the continued adaption, modification, and correction of errors in them, is essentially dependent on a certain kind of knowledge possessed by a group of programmers who are closely and continuously connected with them.

4. Beta

Suppose you're kinda stuck on a boulder problem. What to do?

(How...?)

Option 1: Keep trying different things, or trying the same things but harder.
Option 2: Look at a video of someone doing the climb, and then use their beta—just do what they did.

Now, Option 2 certainly works better in the obvious senses. It's easier, and then you know how to do the climb right, and then you can actually do the climb. It's not even such a bad way to learn.

Option 1 is harder, and is less likely to get quick results. But there are definitely skills you learn much more with Option 1 than with Option 2, like thinking of creative methods. Sometimes the easy tools aren't available, or they break in your hands—e.g. because the available beta is from a significantly smaller climber and their method won't work for you. When that happens, you'll want to have an understanding of which moves to invest in trying harder vs. when to search for different moves. You'll want perseverance to keep trying even when you don't know a method that will work. In short, you'll want to have gotten experience being in that situation—a hard climb that you don't know how to do and have to figure out yourself.

5. I don't see any dopamine on the sidewalk

In "The Anti-Social Century" (2025), Derek Thompson (co-author of Abundance) writes about how lonely everyone is these days, even though we could hang out with each other if we decided to, and people love to hang out with each other. But my question is, why? Why not go outside?

I think a significant part of it is that you can see talking faces on your fucking phone. That does not feel nearly as good as being with other people, but it does actually hit at the need, at least temporarily and partially. Ditto texting.

And the kicker is that it's so easy. No risk, no coordination, no disappointment. You can do it on your phone, without ever getting out of bed or even opening your computer. It's a higher reward/investment ratio, from a certain perspective.

6. nvim or emacs? YES

So many people have not heard the good word of nvim. So sad. Yes there's a learning curve, but then AFTER the learning curve, it's so much better for text editing! In a webform textbox I feel halfway handless!

7. Why do word-making-upping when you can just use existing words?

When you come to the edge of thinking, you reach for words, but there aren't any words already ready for you to grab. You have two options:

Option 1: Lexicogenesis—think up a new word to use here.
Option 2: Repurpose, tweak, or agglomerate existing words to fit the local purpose.

Now, Option 2 certainly works better in the obvious senses. It's easier, and then you can get along with saying what you were trying to say, reasonably well, maybe with the cost of saying several syllables too many each time. Why figure out "piano" when you can just say "big box you fight him he cry", which only uses words you already know?

Sometimes repurposing works basically fine. Mathematicians sometimes use common nouns for technical terms, and it's basically fine because it's such a well-signalled context, and they get good leverage out of metaphors ("sheaf", "mapping", "universe", ...). Philosophers, on the other hand, though they ought to be among the most in need of really good new words, seem to often be rather shite at this.

(Excuse me, Herr, this is a Wendy's.)

But unsuitable words make you think mistakenly in the long run.

8. Conclusion

I'm definitely not saying not to use powerful industrial tools that make pretty good products really cheaply. I love my myriad cheap consumer products, such as my blender, my keyboard, my water bottle, my computer mouse; I love hard, flat, non-sloshy roads.

The point isn't to be inefficient. If I have a point, the point is to remember about the possibilities for better results. More interesting Art, more functional computer programs, more useful words.

The operation of doing the thing the hard way yourself substantially overlaps with the operation of learning the elements you'd need in order to extend the art.

If an agent (such as a human) has severe constraints on the mental computing power that they have available, that agent will probably be a cognitive miser. By default, we don't think harder than we have to. Usually, unless we decide to do things the hard way—the way that's harder and takes longer, but also opens up really new possibilities—then we won't do that way, because there will be an easier way that gets pretty good results much more cheaply.

If no one decides to do it the hard way, it will never get done. Philosophers look at scientists, with their seat-of-the-pants epistemology on easy mode, and don't feel motivated to figure out how to maintain sanity the hard way. No amount of asphalt paving machines at any insanely cheap price will produce this pavement:

That's the Ease Disease. It's following the principle of least effort off a cliff, completely forgetting the aura of possibilities that lives around our current behavior patterns. It's bowing to the law of the hammer. You may have to hike back down the mountainside for a while and then take a different fork in the trail, to get higher in the longer hike.

tag:blogger.com,1999:blog-8939787122970662740.post-3105623887204393812

My notetaking system (znotes)

tsvi bt Nov 24, 2025 Updated Nov 24, 2025

Show full content

I've used my own hand-rolled artisanal notetaking system for most of my work, for almost a decade. Here's a few takeaways. (If you're a notetaking veteran, my guess is that the most interesting section for you would be "6. Contexts".

1. Background

Like many, I've always been interested in "tools for thinking". I read Lion Kimbro's "How to Make a Complete Map of Every Thought You Think", mourned for the Xanadu that I never knew, dabbled with nvALT and Roam and such. I've also made some tools for thinking parathesizers of various kinds.

To me, the ambition and the potential is large; but in practice, we mostly get just a tweaked file system. This is understandable: actually supporting real thinking is a different and probably much more difficult project, which would involve iteratively interviewing thinking itself about what it would need to progress better.

I had some of these hopes with my notetaking system, so there was a lot of experimentation and also strange design choices. For example, at the beginning I was trying to make transclusions work, which is both somewhat pointless maybe and also didn't work. I won't go into this.

But, out of the shuffle of boondoggles, there were a few useful ideas that I still use regularly and get value from.

2. nvim

This isn't an idea, but just so it's stated: I pretty strongly recommend using nvim, if you can stomach the precipitous learning curve and customization requirements. Modal editing, the vim keyboard commands, macro recording, extensibility—these things are too powerful to leave on the table if you can get them. [Emacs is kinda like nvim, but harder to set up. ;) ]

3. Colors

Each file in my system gets randomly assigned a fixed color (by hash of its filename). Many colors are ruled out, so that the available colors are bright enough. Text in the files is displayed in that color. Also, when a preview of the file is shown in search, and when the title is shown in a link from another file, the text is in the target file's color.

This has a few benefits:

It's somewhat less boring than monocolored text. It feels cool, like there's lots of different stuff happening in the system overall; I intuitively feel like I'm moving around between working on different things. But it's not distracting, because each file is within itself monocolored (except for links to other files).
Open a new file —> new color! New field, new thoughts, new mode.
The color helps recognize the file. E.g. if I open a file (from a search, say), it's a fast obvious cue whether I got the right file; or if I'm moving my cursor between files; or if I'm looking for the link in another file.
It's possible that it helps evoke the context / feel of a given document a bit better, I don't know.

4. Links

It almost goes without saying that links between files are useful. Here's the criteria for a full link implementation:

The link points to the file by filename.
The link displays using some other text, the "title" of the file (in my case, the first line of the file).
When the title of the file is changed, the display of links to that file also change automatically, while the link itself still points to that same file.

I think meeting all these criteria is surprisingly rare. My system almost but does not actually do it, and several other systems don't. Links like [[file name]] don't satisfy the criteria. (Why have these criteria? Because you want to change the titles of things without manually changing links, and you might have automatic filename generation (as I do), and might want to avoid changing filenames (as I do).) It might be that some html-based systems do this—but then, they will fail to get all the huge benefits of being in nvim.

5. Scripts

This is more like a workflow in nvim than an idea, but it's pretty useful for me. Basically, I write vim scripts and python scripts in files in my notetaking system, and then (in another file in my system) define nvim commands to invoke those scripts, and then iterate on those scripts while calling them on work material. This makes it easy to develop the scripts quickly, and to save useful macros, without even switching to some spot in my nvim configuration or anything. For example, this is how I tune my scripts for converting my markdown notes into html / LessWrong / pdf.

6. Contexts

Maybe my most interesting contribution is contexts (and cycling). The point of contexts is basically what it sounds like: to track and preserve context of what you're working on.

In my system, a context does a few things:

Shows up in the search bar when I search for contexts (see image above).
Has a name, e.g. "intelligence amplification".
Can be on or off (default off unless I enter it manually).
While that context is on:
- It tracks which files are edited or created.
- The file search only shows files in that context (unless I type .a, meaning "search all").
- Cycling. There's two keys where if I press them, the file that I'm currently in will switch to the next file (by creation date) out of files that are in this context.
When I exit a context, it saves the arrangement of windows and files I have open and where my cursor is (using nvim sessions). The next time I enter that context, it loads that arrangement.
It can have subcontexts, which work like you'd expect.

(Of course files can show up in multiple contexts. waow)

All these features are helpful for having separate contexts focused on one topic, without much bookkeeping. Since switching contexts takes only a few keypresses (3-5ish), it's actually a solid way of noting things down for later without disrupting workflow.

Note that cycling can probably be generalized to cycle through other orderings for other keypresses. For example, you could have a "scrap/notes" function which, for any file, either creates or goes to the corresponding "scrap/notes" file, or toggles back to the main file.

tag:blogger.com,1999:blog-8939787122970662740.post-9048048786414552206

Introduction to radix (best cognate-tree grower, pre-α, dormant)

tsvi bt Nov 23, 2025 Updated Nov 23, 2025

Show full content

My old project, called radix, is still the best cognate-forest grower in the world, but unfortunately it's nowhere near good enough. It uses wiktionary entries to tell you what words are related to what other words through etymological descent. This is a very-quickly-written better-than-nothing partial state-share about the project, in case anyone's interested in solving the problem of cognate-forest-growing.

1. Takeaways

Use abstract regexes to infer links between words via patterns in wiktionary-ese.
Prune the display trees aggressively when feasible without eliding too many interesting cognates. Merge redundant subtrees.
When printing actual results, don't store and retrieve a root word's full set of descendants. Instead, store several copies restricted to just the descendants needed to display for one specific language.
Precompute the full global transitive closure once, in one big go, before serving many results over time.
Use subuniverses to do rapid end-to-end testing of semantically inert changes to complex code.

2. Annotated table of contents

The section "4. Introduction" will say what radix is for, what it is, and a bit of what's wrong with it.

The section "5. The state of radix" will say a bit about where radix is as a project.

The subsections within 6. Some of the core ideas in radix will describe more of how radix works, including problems, with the aim of giving some insights for anyone who wants to make a radix-like system:

"The graph structure: Pre-orders" discusses what etymologies look like and how they can get messy.
"6.3. The problem of sense disambiguation" discusses what words really are (from an etymological lens).
"6.4. How radix is implemented" discusses how radix specifically works and gives some pointers into the codebase.
"6.5. Shattering preorders" discusses how to serve results quickly by parsing the full futureward-set of a root into language-specific subsets.
"6.6. End-to-end testing" discusses how to quickly test functional equivalence of code by restricting to a narrow but locally complete sandbox.

3. Table of contents

1. Takeaways
2. Annotated table of contents
3. Table of contents
4. Introduction
5. The state of radix
6. Some of the core ideas in radix
7. Conclusion

4. Introduction 4.1. Why I made radix

I like etymologies of words, and I like knowing about many cognates / doublets, even distant ones. I think this gives rich texture to language. When you start learning the morphemes (meaningful chunks) that make up words, and learning the cognates of morphemes, you also start seeing them everywhere. The word "phenomenon" starts with the morpheme "pheno-" meaning "appearance"; that morpheme also shows up with a similar meaning in "epiphany", "phenotype", "diaphanous"—and more distantly and cryptically, in "fantasy", "phase", "phantom", "fantasy", "emphasis", "fancy", "beacon", "photon", "photograph", "Tiffany", "favor", and many more words.

Wiktionary.org is an amazing resource for learning about words that are cognate with other words. But, clicking around wiktionary can be very laborious. For example, the Proto-Indo-European entry for *bʰeh₂-, meaning "to shine" or "to appear", links to several descendant words. At any given time, I might be interested in only a subset of these; e.g. I'm usually not interested in Celtic or Indo-Iranian roots, simply because those would be unlikely to have descendants in the few languages I'm at all familiar with. If you're going through lots and lots of wiktionary entries, then the process of scanning and clicking through, looking for words you recognize, becomes very very time-consuming.

Clicking around sometimes doesn't even work at all, because sometimes the PIE entry might not link forward in time to a descendant word, even though the descendant word does link backward in time to the PIE root; you'd never learn about that descendant, if you are just clicking around starting at a different descendant. As an example, currently, the wiktionary entry for Ancient Greek "φημί" (web archive snapshot, as this will eventually get fixed), which means "to speak", does not link to "βλάσφημος", the ancestor of "blasphemy" ("deceiving-speech"). If you started from "prophetic" and clicked around, you'd get to "φημί" but you wouldn't get to "blasphemy". (And likewise "φημί" does not link, through a chain, to "prophetic".)

But that would have been a cool relationship to know about! How can we do better?

I'll briefly mention etymonline.com, which is also an amazing resource, like wiktionary. For English cognate-finding, that's probably the best current resource in many ways. However, etymonline is largely a labor of love, or something, by one guy named doug. It is limited in scope, and only extends to other languages insofar as they include ancestor words of English words, or as isolated links from English entries. You couldn't use etymonline to efficiently search for cognates from another language; at best you might find an English cognate. Etymonline is also going to be fundamentally less complete than wiktionary over time, as wiktionary has many editors and draws from many sources. (As a random example, wiktionary links "focus", uncertainly, back to the same PIE root as φαίνω; etymonline simply says the Latin is "of unknown origin". (Which is not at all a criticism of etymonline; it just is aimed at a different purpose, sometimes at the cost of showing all the plausible hypotheses.)

[IDK where to put this, but just noting that I'm arguably misusing the word "cognate". Cognates are sometimes supposed to derive in whole directly from common ancestor words. Instead I'm talking about... I don't know the real term, if any. Maybe "derivational family" or "root cognates", or my coinage "coetymons". In my defense, for example Meelen et al. call these "weak cognates" and give a typology[1].]

4.2. The idea of radix

You can find a lot of close and distant cognates just by clicking around wiktionary. You could even in theory bridge some missing links, by guessing words that might be cognate. We could imagine someone thinking about the word "phenotype", and then wondering spontaneously whether it is cognate with "phenomenon", and going the wiktionary page and finding that indeed they share the "φαίνω" ("to appear") root.

But, we have computers. Computers can do things by themselves. We can tell our computers to click around for us.

That's what radix does, basically. It traverses the graph structure of links between wiktionary entries for different words. It infers which words are etymological ancestors or descendants of other words. Then it tries to display these (often large) structures, to show you what words are related to the word you started with. Here's what we get with "phenomenon":

That's a lot of info. One of the most important elements is the pastward trunk. This shows the ancestors of the word we started with. Rightward is pastward, e.g. you can see that English phenomenon points pastward to Latin phaenomenon, which in turn came from Ancient Greek φαινόμενον, which at its main root came from Proto-Indo-European *bʰeh₂-:

4.3. Problems with radix

The above is an especially clean pastward trunk. Very often radix unfortunately produces much messier and incorrect pastward trees:

Middle English is a little bit the bane of my existence; you can see the little < symbol, showing that at that point the pastward ancestry tracing becomes ambiguous between the two unrelated roots of "weave". One of them is cognate with "web" and means making interlaced fibers; the other is cognate with "vibrate" (and possibly "veer"), and means to wander or move in a wavy path:

But messiness is the least of the problems with the current draft of radix. Much more serious is that radix shows many many incorrect cognates. For example, here's the beginning of the radix entry for "value":

The pastward trunk, at least, is pretty neat! And correct! However, if you look near the bottom, you can see that it says PIE *h₂welh₁-: wild;. It's claiming that the starting English word "value" ultimately comes from PIE "*h₂welh₁-", and then another etymological descendant of that (reconstructed) ancient root word is the modern English word "wild".

If anyone were to actually use radix in its current form, they should always check with wiktionary (by Cmd+clicking on the word in question). In this case, the "wild" entry states that its root is "Proto-Indo-European *h₂welh₁- (“hair, wool, grass, ear (of corn), forest”)". Well, now we see the problem: there are actually two PIE roots with the same string "*h₂welh₁-" as their reconstruction. One means "to rule", and that's where "value" comes from; the other means "hair, wool", and that's where "wild" comes from.

This is not necessarily a problem for radix. In fact, by clicking on the PIE root in radix's pastward trunk, we get info about what radix thinks of that word:

The important part here is that radix only is including sense 1 ("to rule") and not sense 2 ("hair, wool"). It successfully traced that back from the starting word "value". The real issue is that "wild" points pastward to "*h₂welh₁-", but radix doesn't know how to tell which sense of "*h₂welh₁-" is being pointed to (presumably, the "wool" sense fails to point futureward to "wild").

This is maybe an over-involved example, I don't know. I guess I just want to communicate that... there's a ton of complexity here, there's currently a ton of errors in radix results, and also there's a ton of room for feasible improvement.

Before I link the thing, if you saw this post in a context where it is maybe currently being viewed simultaneously by many people, try not to spam it too much please, especially on words with big cognate trees. Also, please note that bug reports are not at all helpful; I greatly appreciate your care, but I'm not currently trying to fix the site at all. (Note: IDK if I wrote this anywhere, but, you can press d to toggle definitions on and off, and press l to toggle the language tags.) Ok, that said, you can try it here (please don't hug too hard): radix.ink. If you click around there's more info.

5. The state of radix

In short, the state is "in pre-alpha, in a deep freeze, no current plans to thaw". But if some person or small group were seriously interested in reviving it or learning from it, I would at least be very happy to talk. Feel free to reach out at my gmail address: radixtsvibt

You can find some version of the codebase here, roughly where I left off: https://github.com/tsvibt/public_radix_sep_2025/tree/main

Basically, I worked on it a ton a couple years ago. Now I'm too busy. The current codebase is stuck in development hell on some branch of trying to make the big preprocessing sweep be more efficient (with multiprocessing). That's plausibly not even worth the complexity, at least the way I'm currently doing it.

Even before the thaw, there were many serious problems with the results that radix was able to give. I think to be really useful, the quality of the results would have to be greatly improved. I also think it is feasible to greatly improve the quality of the results. There are several ways to do that, such as:

Fix many large and small known bugs / mistakes.
Improve the methods used to infer etymological links between words from the text of wiktionary entries. This is a large area.
Improve the methods used to prune the cognate poset for displaying.

The first thing to do, though, would be to thaw the project, or redo it better in some other form. This would be a big project because the problem is inherently complex (as the domain, words and history of words, is itself complex and uncertain), and because radix is out of date (e.g. various standards in wiktionary will have shifted over time).

6. Some of the core ideas in radix

In this section I'm going to describe some of the main ideas that went into radix in its current (frozen) form. The hope here is twofold:

In case anyone wants to thaw the radix codebase itself, this will help to explain some of the concepts involved.
In case anyone wants to make a successor project, they'll have some more information about the design constraints and possibilities (though not necessarily reliable info).

Because of time constraints, I'm not going to introduce what etymology is, how it works, what wiktionary is, how that works, and other important facts like that. There is a bit of information here (please don't hug too hard if it's being hugged) and on other information pages that you can click on along the top bar, e.g. this discusses more improvements.

6.1. General principles

Here are general principles that I had in mind while developing radix, and that I'd recommend for any project like this.

Improve wiktionary. Wiktionary should be improved for its own sake; and also, one of the best ways to improve results of a radix-like system is to improve wiktionary. Mainly this means simply adding more entries (with correct information), correcting mistakes (e.g. links that improperly indicate etymonic relationships), distinguishing different senses of words, adding more information (e.g. finding etymonic links in scholarly sources and putting them in articles), etc. There's also system-level work that needs doing—which you'd have to ask the wiktionary people about.
Give good results if wiktionary is good. In many cases, radix tries to go a bit beyond what wiktionary explicitly states about etymonic relationships, e.g. with abstract regex inference on wiktionary etymology sections or with inferred sense disambiguation for links. However, a more basic and more important goal should be to make it so that if, hypothetically, wiktionary were perfect, then radix would also be perfect.
Display information usefully. It's one thing to gather all the words that wiktionary says are cognate. It's another thing to present that to the user in an actually useful way. That requires pruning and user settings to narrow focus, lest you show gigantic redundant cognate trees.
Err on the side of yes displaying a word. This is a choice of design goal, but my suggestion is to show more words that might plausibly be cognates. The reason is that, unless radix is perfect, there will definitely be mistakes of exclusion or inclusion, probably both (since wiktionary itself has downright errors incorrectly indicating etymonic descent in templates); so the user probably has to "verify" cognates anyway (at least, read the wiktionary entries). For the use case of finding interesting cognates, it's better to show hypotheses than to exclude them, and let the user sift. But of course if you can fairly confidently exclude a large swath of words, do so, as that reduces noise.
Serve results quickly. Other systems that try to traverse wiktionary, among other problems, are slow. Since finding interesting cognates is often an iterative / interactive investigation, slowness can cripple that process.
Focus on etymology. There are many things that a word-browser could do, such as discuss pronunciation, synonymy, etc. These are interesting (and are somewhat addressed by wiktionary already); but they are largely not relevant to etymology in terms of finding (broadsense) cognates that are distant but are implied by known
Be language-neutral eventually. The current version of radix is very Indo-European centric (other languages are mostly excluded) and somewhat English centric (the default display settings emphasize English words, and etymonline results are compared). Partly that's because English is what I'm familiar with; partly that's because wiktionary is most complete for English; and partly that's because the Indo-European language family as a rich, old, varied, well-attested, well-studied set of languages with many interestingly divergent lines of phonological and semantic development. But in the longer term a radix-like system should be language-neutral. This would suggest, for example, not putting too much effort into features that don't generalize across languages.
Rough and ready. There are many edge cases and difficult design questions. For example, some etymologies given on wiktionary (taken from scholars) indicated "uncertain" and give a "maybe" etymology. Should you could those? Or do some probabilistic notation? Or what? These are reasonable questions, but trying to include uncertainty is a whole separate can of worms, and there's enough complication even if you ignore that whole dimension. So I suggest ignoring as many problems as possible, instead focusing on the core functionality of finding and displaying as many real cognates as possible and excluding or at least labeling as many false cognates as possible.

6.2. The graph structure: Pre-orders 6.2.1. The basics: futureward order

The basic structure we're working with is etymonic descent. We would say that English sing is futureward of Proto-Indo-European *sengʷʰ-:

That means "sing" descends from that PIE word etymologically. In other words, if you played history in reverse, you'd see people using the word "sing" and speaking English; then, as you go backward in time, you'd see people speaking Middle English and using the word "singen"; eventually you'd see people speaking some kind of Proto-Germanic language, and using a word like *singwaną; and then futher back, using some word like *sengʷʰ-; and even further back, yet another version of this word, in a language that no one today knows about; and if you went even further back, you'd eventually hear someone making up that word.

We would equivalently say that PIE *sengʷʰ- is pastward of English sing.

Wiktionary doesn't lay out this whole structure of futureward etymonic descent explicitly. Entries for different words may or may not declare some or all of their direct or indirect pastward ancestors and some of their direct or indirect futureward descendants (and sometimes even some of their cognates). This means we often have to infer etymonic descent. A simple example is transitivity of etymonic descent. You often have an English word $W_e$ which says that it descends from a Latin word $W_l$, but doesn't say what PIE word it descends from; and the Latin word $W_l$ says it descends from a PIE word $W_p$. So then we would infer that $W_e$ descends from $W_p$.

The overall strategy we use to find cognates of a word $W$ has two steps:

Find the etymonic pastward ancestors $W_1, W_2, ...$ of $W$.
Find the etymonic futureward descendants of $W_1, W_2, ...$. All of these are (broadsense) cognates of $W$.

For example, if we start with "phenomenon", radix gives this pastward trunk (trunk, as in tree—going downward on the tree is going pastward):

In this case, at least what is displayed is simple: just one PIE root. (Some stuff is omitted here—the "-menon" part has its own ancestry.) Then if we follow the futureward descent of words from this pastward trunk, we get something like:

There are many words in many languages that descend from these roots. Too many to usefully display, often. So instead we just show the descendants from a few chosen languages, and their ancestors. (You can slightly customize what languages radix shows you, in settings.)

Some more notions, given a word $W$:

Futurewardset. This is the set of words that are futureward of $W$. Technically, we consider a word futureward and pastward of itself. This is mathematically simpler, and is standard practice for studying orderings; for example, that way the futureward set of the futureward set of $W$ is equal to the futurewardset of $W$.
Strictly futureward. This means "is futureward of $W$ but not pastward of $W$". (As discussed below, there can be cycles in practice even though conceptually there shouldn't be cycles, so this is not equivalent to "is futureward of $W$ but not equal to $W$".)
Equivalent. This means "is both futureward and pastward of $W$". $W$ is always equivalent to itself. (Since there can be cycles, sometimes other words are equivalent to $W$ in the ordering.) [Note: this is different from how I use "order-equivalent" below to discuss diamonds.]
Immediately futureward. Saying "word $Y$ is immediately futureward of $W$" means "$Y$ is strictly futureward of $W$, and there is no word $X$ that is strictly between $W$ and $Y$ (i.e. strictly futureward of $W$ and pastward of $Y$)". This is important because when we display cognate forests, we want to show immediate descent by putting $Y$ right next to $W$; depth-first traversal order for a cognate forest is determined by stepping through the immediate futureward relation.
Pastwardset, strictly pastward, immediately pastward. (Likewise, mutatis mutandis.)
Futuremost. This is a word without any words that are strictly futureward of it. It is a leaf of the etymonic forest; it may be a modern word in a modern language, or a word in an ancient language that died out, or just that word didn't get transmitted, e.g. because the concept was obsolete or because it got replaced by another word.
Pastmost. This is a word without any words that are strictly futureward of it. This is a "root" word. For words in Indo-European languages, often a pastmost is a Proto-Indo-European word; but often not, e.g. because the ancestry of a word in, say, Latin, is unknown; or because the word comes from a non-Indo-European language such as Arabic or Hebrew or many other languages (e.g. for regional words like "orangutan"). The pastmosts of $W$ are the (known) oldest roots of $W$.

(Note: so it's stated somewhere, we assume that "is pastward" and "is futureward" are exact inverses of each other. That is, $W$ is futureward of $X$ if and only if $X$ is pastward of $W$.)

(The notion of "word" here is problematic; see "6.3. The problem of sense disambiguation" below.)

6.2.2. What real etymologies can look like

In the simplest form, the etymology of a word is linear: the word directly descends from an ancestor word in an earlier language; that ancenstor word in turn descends from a past word; back to PIE. E.g. "radix":

Not all English words go back to PIE. The humble rock gets lost in Latin:

Some words go back to non-Indo-European languages:

Some words, like compounds, have multiple roots, because at some point two words or morphemes got combined into one word:

(You have a horse-seamonster in your head haha.)

Because of compound words (in a broad sense, including combinations of any two morphemes), the overall graph structure of all words under etymonic descent is not a tree. Two different pastmost words (i.e. words without etymonic ancestors) may share a descendant, formed by compounding or other sort of derivation. We get a picture of a forest, where each tree grows from a pastmost root, and inosculation (h/t Rafe) is widespread.

6.2.3. Pruning ancestor morphemes with many descendants

Some morphemes are highly combinable, such as "pro-", so many words have those morphemes in their ancestry:

We can't load up the descendents of PIE *per- every time someone asks about "propane" or "profane" or "protractor" or "prospect" or "problem" or "propose" or "proper" or etc. etc.; so what we do is, we write down "ine-pro, *per-" in a file (ine-pro is the wiktionary language code for PIE), and we look at that file when we're constructing the pastward trunk, and we exclude those words, and we also exclude words whose only ancestors (i.e. pastward words) are excluded this way. Those are the greyed-out words.

6.2.4. Dealing with diamonds and duplication

An annoying problem arises when you can analyze a word in its current form, or follow its ancestry and then analyze it. For example, is the English word "democracy" composed of two English morphemes "demo-" and "-cracy"? Or is it a descendant of Ancient Greek δημοκράτια, which is itself from "δῆμος" and "κράτος"? Well, it's kinda both? IDK. Here's what we have:

It's not great, not terrible. Could be worse. The way this is displayed, we basically traverse the poset depth first. We also check if we've already included the present word in our traversal; if so, we add this word, but instead of continuing to traverse depth first, we just add ▲ indicating "this word already appeared in the tree, above". If we did not do that, the tree could be considerably larger. For example, this one might be about twice as big:

(Incidentally, that one illustrates a further potential wrinkle, which is that we might want to merge words.)

Could be better, too. The ideal would be to somehow nicely render everything compactly and non-redundantly, where each word shows up once. But note that this is strictly a poset, i.e. it is a poset but is not a tree (because for example "democracy" flows to "δῆμος" both via "δημοκρατέομαι" and separately via "δημοκράτια" and separately via English "demo-". Here's the "democracy" pastward trunk again:

Note how we have AGk δημοκρατία; δημοκρατέομαι after the Latin. We do this by merging those two Ancient Greek words. What we do is, before we traverse the preorder, we check to see if some sets of words are order-equivalent (i.e. they have the same set of strict ancestors and strict descendants within our pastward trunk preorder). If they are, then we merge them and display them together. To say it another way, these elements form a sort of "diamond". It would be a big waste of space (lines, especially) to separately show each of those elements coming from their shared ancestor. (Maybe we only do this if the words are in the same language.) It would be even better if we could merge the word on the bottom line in the pastward trunk, because it is basically equivalent.

There's probably ways to do this better than radix does, but you can at least see some of the issues. In theory you'd want to really deduplicate things, so each word shows up once, and use some fancy graph layout thing; but my experience was that this was too much hassle, too hard to control and predict, and maybe most importantly, not compact enough in terms of layout. But, could be good.

Because of diamonds and duplication, even the pastward etymonic descent of one word is not a tree. (Amusingly, there's another reason for this: autodoublets.)

This points to using, not trees, but partially ordered sets (posets). This suggests using the metaphor of an anastomotic river delta, rather than a tree or forest:

6.2.5. High leaf-count

Many entries on wiktionary have many descendants in a boring way. For example, here's the table of conjugations of French chanter:

Every single one of those conjugations has its own wiktionary entry which links back to "chanter". Across all the different words like this in all the languages, that adds up to a ton of largely-redundant information.

We don't want to totally exclude these words. Partly, that would be an ad hoc complication. Also, you definitely still want to represent all these derived and inflected words separately, in the underlying etymonic graph structure. For example, sometimes a modern English word might come from a conjugated Old French word in an opaque way; in this case you want to see that whole line of etymonic descent, so you can understand how the word changed over time. In any case, there's no great reason not to include them at least in the underlying graph, that's trying to be as close to the true anastomotic-river of etymonic descent.

Early versions of radix simply displayed all these conjugations in the straightforward way. This was hugely distracting and pointless, making all the cognate display trees maybe 2-5x bigger than they had to be, for basically no benefit.

The current version of radix hides these words, mainly using a simple trick: If there's a word $W$ with an immediate futureward word $X$ in the same language, and there's no word $Y$ that's in a "spotlighted language" (by default radix spotlights a few languages—English, German, etc.), then we hide $X$. This excludes conjugations. It also excludes compounding. E.g. here's part of the futureward display tree for German Haus:

Note that English house does not display (to the right, as futureward words, i.e. etymonic descendants) a big list of compounds such as {courthouse, birdhouse, boathouse, doghouse, greenhouse, guesthouse, lighthouse, longhouse, outhouse, playhouse, poorhouse, safehouse, schoolhouse, storehouse, warehouse, whorehouse}, even though wiktionary does have entries for those words and they are linked to "house". (It does display bringhouse, because that apparently routes through another language.) Now, the list of compounds is interesting! And, radix does allow you to start with "longhouse" and discover "house"; "longhouse" is fully represented internally. But, for the purposes of finding interesting cognates, you don't need to see these compounds.

6.2.6. Etymonic cycles

A cycle would be if multiple words are claimed to be descended from each other, in a circle. For example, a 2-cycle would be if $W$ says its root is $X$, and $X$ says its root is $W$. Broadly speaking, in the true underlying graph of actual etymonic descent, cycles should not appear.

For several reasons, there can be cycles in our computed graph of etymonic descent. Mainly this is due to errors. See the radix warning page for types of errors. One major type of error is errors or omissions in wiktionary. For example, some links may be simply improperly labeled, claiming etymonic descent incorrectly. Another example is ambiguous senses. Another major type of error is errors in radix; for example, an inferred link (see next subsection) might be inferred incorrectly. All of these errors actually happen.

Yet another type of error is when the scholarly work on the etymons of words is unsure about a word's origin. In that case, there could in theory be hypothesized links pointing in opposite directions (as words and languages are not momentary objects, but rather extended through time).

Even conceptually, the true underlying etymonic graph might actually have cycles, in a certain sense. For example, words can influence the form of other words, e.g. through hyperregularization. Does this count as true etymonic descent? I would say it does, or at least, some cases could make a compelling case—another word leaves a visible, quasi-semantically-meaningful morphological difference in another word. If we count this, and if there were cases of mutual influence between words, there would be a genuine cycle.

Therefore, we cannot actually use posets, technically speaking. That is, we cannot assume anti-symmetry—we cannot assume that if $W$ is pastward from $X$, and $X$ is pastward from $W$, then $W$ and $X$ are the same word.

Instead, we only assume that etymonic descent forms a preorder.

6.2.7. Inferring links with abstract regexes

How do we know when a sense is a pastward ancestor or futureward descendant of another word? The main way is just by reading off this information from wiktionary templates such as inherited. See "how" on the radix site. However, this gives a pretty incomplete picture. How do we do better?

As usual, the first and best way to improve the ability of a radix-like system to know about links between words, is to add that information to wiktionary itself.

That said, there are two more major methods that radix uses.

The first major additional method is inference. In this method, we read the text given by wiktionary—mainly from the etymology sections—and infer when that text strongly implies a pastward link between two words, even if the template itself does not assert a pastward link.

Example: English photon, where radix gives:

Now, how does radix know that Ancient Greek φῶς comes from Ancient Greek φάος? If we click on Ancient Greek φῶς in radix, and look towards the bottom of the popup, we see this:

This is a report of all the reasons radix thinks this pastward link exists. It only gives one reason. What is happening in this "reason"? It's saying that the reason is an "Inference" (specifically the one defined in src/inference/big_regex.py). That means it's not just reading the wiktionary template information. It's also then inferring a temporal link. It's a bit convoluted and very ad hoc, so not worth too much detail, but basically this inference rule is saying, look at this text in the etymology section:

From {{inh|el|gkm|φωτία}}, from {{inh|el|grc|φῶς}}, variant of {{m|grc|φᾰ́ος||light}},

Now, the {{m| template definitely cannot be taken as a pastward or futureward link in general. BUT (according to this specific rule), if we have an {{inh| template followed by an {{m| template with ", [something] of" in between, we assume (with a bunch of hard-coded exceptions) that the {{m| template is pastward of the {{inh| template. This is actually the only reason that radix knows Ancient Greek φῶς is futureward from Ancient Greek φάος. For example, the etymology section for Ancient Greek φῶς starts:

Contracted from {{m|grc|φάος}}.

We can't infer pastwardness from {{m| templates. (Another approach would be adding a rule for "Contracted from [template]".)

As an example of unimplemented inferences, here's the pastward trunk for English desire:

This is messy partly because radix doesn't know that Middle English desire also comes from Old French desirrer, and that Old French desirer also comes from Latin desidero. Could radix know that automatically? The etymology section for English desire looks like this:

Just reading that, it's fairly clear that e.g. Old French desirrer and Old French desirer are being treated as equivalent here. I'd say that this is enough evidence to automatically assume that there is at least one sense of each of those Old French words which is a descendant of Latin desidero.

(Note: currently, wiktionary explains desirer as an "alternative form of desirrer (“to desire”)". My guess is that radix simply doesn't recognize those types of links; that's another straightforward way to fix this missing inference.)

What does the underlying wiktionary text look like? At the moment it looks like:

From {{inh|en|enm|desir}}, {{m|enm|desire|pos=noun}} and {{m|enm|desiren|pos=verb}}, from {{der|en|fro|desirer}}, {{m|fro|desirrer}}, from {{der|en|la|dēsīderō|t=to long for, desire, feel the want of, miss, regret}}, apparently from {{m|la|de-}} + {{m|la|sidus}} (in the phrase ''de sidere'', "from the stars") in connection with astrological hopes. Compare {{m|en|consider}} and {{m|en|desiderate}}. Displaced native {{ncog|ang|wilnung||desire}} and {{m|ang|wilnian||to desire}}.

Since radix already parses this somewhat semantically, what remains is to analyze this etymology to extract more links. This is probably doable. (In fact, the current codebase of radix might already have a draft implementation—just, not included in the older version on the site.)

What's powering all this is abstract regexes. The idea of an abstract regular expression is that you do regular expressions, but instead of the primitive elements being characters tested by identity, you have the primitive elements being anything tested by any predicate you've provided. This way we can use our separate parser for wiktionary text, which produces a list of tokens (like [piece of text, wiktionary template with XYZ info, text, template, text, template]), and then operate on that with regexes that can recognize stuff like

A definite pastward template such as inherited or derived; followed by some text that starts with "," and ends with "from "; followed by a template that is either a definite pastward template, or m

Or that kind of thing. Fairly powerful in this context. You can see the inferences used in src/inference/refs_links.py. (Note that each one of these required substantial testing to see whether they picked up wrong things; several were discarded, as I recall it.)

6.2.8. Inferring links with the transitive closure

When we have inferred that $W$ is futureward from $X$ and $X$ is futureward from $Y$, then we can also infer that $W$ is futureward from $Y$, even if that isn't explicitly stated anywhere.

Thus, we can take the transitive closure of the "is futureward" relation. All the relationships inferred this way should be correct, if the input relations are correct. If the input relations are incorrect, we will make a bunch of false inferences.

In practice, in my experience, this does happen, but not so much that results are degraded beyond use. Egregious errors in wiktionary are present but rare, and in fact radix is a good way of finding certain kinds of errors in wiktionary. You browse radix; check for anomalies, like implausible cognates; investigate why; usually it's because of a problem with radix; but sometimes it's an error in wiktionary, which can then be corrected. (Anomalies could also be automatically detected, e.g. by searching for time reversals, i.e. cases where radix thinks that an English word is pastward of an Ancient Greek word, or similar.

Because links are NOT necessarily symmetric within wiktionary—i.e. sometimes $W$ says "I come from $X$" but $X$ does not say "$W$ comes from me"—we CANNOT compute the transitive closure in a local manner. We cannot just traverse the graph starting from one word and following its links. That does not work. We MUST aggregate information from the entire graph. This creates significant implementation challenges, which I'll discuss below in 6.4. How radix is implemented.

For now I'll just note one major design decision in radix. We actually have two graphs. As discussed in the next section "6.3. The problem of sense disambiguation", we have to track senses and links between senses. However, we don't want to do this at first. That's because:

It's easier to do sense disambiguation if you already have access to the full transitive closure of grams under futureward ordering. You have a small structure, the gramwise-pastwardset of a starting sense, within which to compute heuristics that guess about the senses meant by various links.
Sense disambiguation is hard, so we want to experiment with it more; it is therefore more fluid; therefore we want to precompute as much as possible without having to lock in a sense disambiguation method. In order to do that, we need a more expansive notion of links; and using grams (letter-string plus language) as the basic element works well.
Maybe other reasons, e.g. storage efficiency.

For these reasons, radix uses a two-layered system. First we infer many relations between grams and compute the transitive closure. Then, only after we've built that large structure, we extract a secondary structure of "futureward wordposets" for all the pastmost "words". (Which is to say: We find all the senses that don't have ancestor senses, and for each of those we compute the futureward preorder of senses that might come from that sense—which is some subset of the gramwise futurewardset of the gram of that pastmost word.)

6.3. The problem of sense disambiguation 6.3.1. What's a word?

A word is, like, a string of letters, right? WRAWNG.

That's not what a word is. First of all, a word has a language. English voyage and French voyage are definitely not the same word.

Second of all, a word has different senses. What exactly a "different sense" is might be unclear. (E.g. are "rake" the verb vs. "rake" the noun different senses? What about "mouse" (mammal) vs. "mouse" (computer device)? IDK.) But the important thing for use is the etymonic senses. So verbs and nouns are the same sense, and multiple different definitions listed in a list of several definitions are all part of the same one sense. However, "weave" and "weave" are two different senses! It is not uncommon that a single gram (string of letters paired with a language) has multiple etymonic senses. For example, the string "lead" in English has two etymonically unrelated meanings: one is about a metal, the other is about going ahead in front of followers.

So what we want is a preorder on senses, where a sense is a string of letters in a language with a specific etymological ancestry.

(In the code of radix, I actually use "Gram" to mean "string of letters plus a language, and "Word" to mean "a gram, plus a number specifying the sense". See src/classes.py.)

To a large extent, you can recognize different sense in wiktionary entries syntactically: Basically, sections with the header "Root" or "Etymology" are new senses. The actual rules are more complicated and also not formally enforced in wiktionary. This problem is addressed (in a very ad hoc, incomplete, janky manner) by src/parsearticle.py.

But this is not the whole story when our theory hits the messy reality of wiktionary. There are several problems. One problem is that sometimes the same string of letters has two Etymology entries for the same language, BUT really one of them just comes directly from the other. Another problem is that many entries in wiktionary don't exist or are empty or incomplete; in particular, there may be some missing senses. This means that if two etymologically unrelated words point to the same string of letters in the same language, and there is no entry for that language, then we are almost completely out of luck, by default; we cannot tell which ancestors of the collider correspond to which descendants of the collider.

Yet another issue is that, as I may have mentioned, Middle English in particular is Bad:

Really bad:

(Oh my God.)

There's probably some excuse, like "at least we were even trying to write things down and figure out spelling, sorry if it wasn't magically already standardized" or something. But anyway, this makes one think that multiple different grams (strings of letters) are kinda the same sense.

This problem of multiple grams for one sense exacerbates the much worse problem of ambiguous senses.

6.3.2. Ambiguous senses

The existence of multiple senses leads to a design choice, which is that the fundamental unit—the elements of the grand etymonic preorder on all words—is senses, not grams or strings of letters. A sense is a gram, plus a specification of which sense.

(In the current form of radix, the specification is just a number; 1,2,.. are the explicit senses demarcated by an "Etymology" or "Root" section in a wiktionary entry, 0 is a default sense for entry contents not inside one of those containers, and -1 is for when there is no wiktionary entry but a gram is linked to by another entry. Since wiktionary does not usually explicitly specify senses, radix makes a guess at what senses there are.)

The problem with ambiguous senses is that it leads to many false cognates. For example, the section "4.3. Problems with radix" above gave the example where English value points to PIE *h₂welh₁-, and English wild points to a different sense of the same gram PIE *h₂welh₁-. Because of this collision, when we're tracing forward from the pastward trunk of "value", we include "wild" because we can't automatically tell which sense is the sense of PIE *h₂welh₁- that "wild" is pointing to. In accordance with the 6.1. General principles, we want to show "wild" as cognate to "value" (meaning really, "here's a thing that MIGHT be cognate with value, check it yourself"), if we can't tell confident that it IS NOT cognate with "value".

6.3.3. Addressing ambiguous senses

In accordance with the general principle of improving wiktionary, I should note that wiktionary does have a mechanism for demarcating senses within entries for single grams, and specifying which sense is pointed to by a pastward or futureward link from another entry. So the "right" way to fix any given instance of this problem is to add the information to wiktionary. (And actually add the ability to incorporate that information to radix, which I had not yet done.)

That said, there's a few other things we can do. Suppose that English wild links pastward to PIE *h₂welh₁-, but does not specify a senseid that it is linking to. This happens to currently be the case; if you click edit, you can see the specification of the "wild" entry, and there you can see (until it's edited) that the entry gives this link:

{{der|en|ine-pro|*h₂welh₁-|t=hair, wool, grass, ear (of corn), forest}}

To translate, that means:

[der] the sense of the present entry derives (pastward; maybe indirectly, through other senses in other languages) from the word specified by this link
[en] the sense of the present entry is an English sense
[ine-pro] the sense of the pastward ward is a Proto-Indo-European sense
[*] (this is a reconstructed word, not attested in any known text)
[h₂welh₁-] (the text of the linked pastward word)
[t=] (translation into English)

How could we tell which sense of PIE *h₂welh₁- this link points at? One thing that radix does is to check if one of the senses of PIE *h₂welh₁- links (directly or indirectly) futureward to English wild. If so, we can try to exclude the other sense of PIE *h₂welh₁- from consideration as being the target of this particular link. This is kinda complicated to implement, and it might not actually work so well, but in my tests it seems to be ok at cutting down on incorrect cognates, hopefully with missing too many real ones. (Also, radix does a more complicated thing, that we needn't discuss; see src/traversing.py.

Even with that heuristic, in order to not shut down too many potential cognates, we always assume that a pastward link to a gram refers to at least one of the available senses in wiktionary. This is a problem if there are multiple senses of the pastward gram, but the wiktionary entry for that gram lists only one sense or doesn't exist at all. In that case, we're nearly forced into a collision and false cognates.

What else can we do?

In some ideal system, we would at least notice when it seems like there are missing senses causing collisions. Then these anomalies could be raised to the attention of editors, if the automatic detection is good enough to not just be spam. In some cases you could even fabricate senses automatically, though I would guess that that strategy would end up being more complex than the value (and instead you should just improve wiktionary itself).

One fairly straightforward idea is to use the meaning of the word. The example link above provides a translation of the target pastward word:

|t=hair, wool, grass, ear (of corn), forest}}

Looking at the two wiktionary entries for PIE *h₂welh₁-, it's clear which one of the two is being targeted by this link from "wild". Even if a link template doesn't provide a translation, you might still be able to infer which sense is linked by the definition of the linking word.

Implementing this would be quite nontrivial, but doable. You could try just checking overlap of words, maybe restricting to content words; or you could try some word-embedding-related thing. Using LLMs is a significant possibility. However, since there are millions of wiktionary entries and tens or hundreds of millions of links, you probably don't want to do a naive implementation. Instead you could use cheap rules to deal with obvious confident cases, and otherwise use a tutoring scaffold whether the LLM makes guesses about the target sense, and the human supervises.

6.4. How radix is implemented

There's too much detail, much of which is irrelevant anyway and much of which I don't remember, to go into everything. As a general note, I'll say that I learned a lot working on this—and conversely, this project includes a lot of major design mistakes, partly as holdovers and partly as things I wouldn't know how to do better even today. (E.g. handrolled caching, powerful but ad hoc testing, weird SQL nonsense, handling all sorts of weirdness like language codes and line noise in raw entries, etc.) The code (in a broken state) is here: https://github.com/tsvibt/public_radix_sep_2025/tree/main/src

I'll just give a couple sketches. The overall structure, conceptually, is as follows:

Get a wiktionary xml dump of all articles on (English) wiktionary
Extract the big graph (preorder) of words related by pastward/futureward relationships
When given a word $W$, get all the words pastward of $W$ and get all the words futureward of any of those words
Restrict these structures to the relevant words
Print out these restricted structures

Most of this work is done in one big day-long precomputation step. This step passes over the entire wiktionary (sort of) several times, building up the information we'll need later in order to serve results fast, and storing that information in a big (huge, >30GB) SQL database. The code for this big precomputation is in src/precomputing. This is one way in to understanding the code base a little bit. In a bit more detail:

Parse the xml into wiktionary entries for words. {A_xmldump_redirects.py, B_xmldump_separate.py, C_AB_xmls_parse.py}
- We discard many articles, e.g. articles that describe how wiktionary works; we only want word entries.
- We try to guess (in a handrolled ad hoc way) where the word and sense boundaries are.
Go through the entries to get links between words. {D_C_parsed_allmentioners.py, E_C_parse_linksrefs.py, F_E_linksrefs_pastwardsfuturewards.py}
- This is the main place where we're actually "reading" articles. We extract links between words and index those links to the words (rather than having them buried opaquely in wiktionary entries).
- This where we use the meaning of wiktionary templates and where we use abstract regexes.
Compute the big graph (preorder) of grams related by pastward/futureward relationships. {Q_E_futurelinks_wordgramrefs, G_F_grampastmosts.py, H_G_gram_immediatefutures.py, I_G_gram_strictfutures.py, M_I_gram_lang_allfutures.py, N_QGI_gram_coverings.py}
- This is where we compute the transitive closure, which is much of the computational difficulty. We compute and store full pastwardsets and futurewardsets. In a sense this is very inefficient because of the space consumption and time taken to write. On the other hand, I think it's faster than recomputing closures of things. But I'm not actually positive about that. I did try to test it, but I may have misunderstood the meaning of some tests; e.g. the computer's automatic caching behavior and memory paging behavior is, to me, unpredictable and confusing and different between different runs of the same program.
- We also compute useful structures such as a map from grams to the grams that are immediately futureward of that gram.
Compute structures about words, i.e. grams with sense numbers attached. {K_C_parsed_realgivenwords.py, O_KN_word_pastwardwordposet.py, P_O_pastmostword_futurewordposet.py, Q_E_futurelinks_wordgramrefs.py}
- We do various shenanigans to infer which senses point to which senses.
- We don't store futureward sense preorders for every word. Because we are doing shenanigans, unlike for the global gram futureward preorder, the sense preorder for a sense in the middle of a sense preorder is not necessarily just the order restriction to the futurewardset of that sense. Instead we compute all the pastmost senses, and compute the futureward sense preoders of those pastmost senses.

After all this precomputation, the system is ready to serve requests. Given a starting word (and other settings), we compute and print radix's guess at the relevant pastward sense preorder of that word, and the relevant futureward sense preoder of the words in the pastward preorder. The code for this is in src/printing.

There's various ways that we prune the preorder. See the above section "The graph structure: Pre-orders".

6.5. Shattering preorders

Listed in "6.1. General principles" is the goal of serving results fast, because this materially affects the usefulness of a system like this for someone who's trying to understand many cognates for several words. What are some of the challenges with quickly serving results?

Basically, at least the way radix is currently set up, and assuming I recall correctly, the main issue for performance is that the preorder objects are really big. There are two basic reasons for this:

A single pastmost root word might have very many descendants—sometimes tens or possibly hundreds of thousands, across many languages.
The full way of representing these structures, including a mapping from each word to its full futurewardset, is very very redundant (something like quadratic in the number of words).

I don't actually recall whether or not you can avoid the problem in 2. You might be able to get away with only retrieving the immediate-orders; but I'm not sure, and I think that requires some complicated precomputations. Anyway, this section is about problem 1.

One main thing you do to address problem 1 is to prune before storage. See e.g. "High leaf-count" above on pruning.

What else can we do?

First of all, why is the futureward preorder so big? The main issue is just that there are very many languages, even restricting to the Indo-European family. (See here; click on "Family tree" to see hundreds of Indo-European languages.) Most of these languages will be only very thinly represented, but still. This means the out-degree of many words, especially PIE words, can be very large; e.g. PIE *bʰer- has dozens of derived descendants, some of which themselves have many descendants (e.g. Latin fero leading to {transfer, refer, infer, confer, etc.} or Ancient Greek φέρω leading to {metaphor, phosphorous, periphery, etc.}).

A first-draft idea would be to simply exclude, from the outset, all words outside of some small set of interest. But this doesn't even work, because

you have to include the ancestors of words from your languages of interest; and
you want to infer links between those words from the etymology sections of words from languages that aren't in your special set.

So, you want to do the precomputation anyway with all the languages.

A second-draft idea for reducing the size would be to precompute the whole large graph, but then only store and serve the preorders that you get by restricting to the languages of interest. This proposal can work. However, it is not language-neutral, which goes against a principle mentioned above in "6.1. General principles".

Also importantly, it's simply a much less useful tool, even for a very English-centric user. Personally I am not infrequently genuinely interested in seeing only results for English; and I am not infrequently genuinely interested in seeing results for English and Latin and German and Ancient Greek. Therefore, a better solution would be to somehow serve

This leads to the method of shattering the preorder, and then only serving the relevant shards. The idea is basically that for each root (i.e. pastmost word), we compute the futureward preorder. Then, for each language that we want to make available, we compute the restriction of the preorder to words in that language and their ancestors. (This happens during precomputation in src/precomputing/M_I_gram_lang_allfutures.py.)

This leads to somewhat redundant storage, in that, for example, a PIE root's futureward preorder when restricted to English might significantly overlap with the same for German. However, any given language's preorder tends to be much smaller than the full preorder. Since, at least for me, it's usually not that useful to request a display for many languages at once, the speed gains are substantial.

Note that there are other strategies for going even faster. For example, you could cache the entire radix page, including the large futureward preorder display. But this is brittle to settings, e.g. selecting different languages. It would not be efficient to store the whole page for every set of languages that a user wants to see displayed. (But it would be good to store this for the most common few combinations.)

To illustrate:

The newer languages extend all the way back because the wedges represent the pastward closure, within the PIE root's futureward preorder, of the newer language's words. (This diagram is not accurate or to scale, it's just trying to illustrate shattering haha.)

Now, we could display this whole thing. Or, we could just pull the restricted preoders necessary to display results for English and German only. Then we get:

It's a lot smaller, and it's even more a lot smaller if you include the fact that we're excluding many more of the modern languages (which tend to have many wiktionary entries).

This does add computational burden, in that we have to merge the preorders. But this is fairly fast. The main cost is actually reading from disk, IIRC. This may mean that the storage and/or retrieval method is hopelessly slow; but I did tinker with that a fair amount.

6.6. End-to-end testing

A major issue with radix-like systems is the lack of bugs. You can run a giant day-long precomputation, and then find out that you messed up some logic; it didn't make an invalid program, it just failed to compute good results. (I assume this is the same in any programming task that's about squeezing answers from data.)

To some extent you have to bite the bullet and manually sanity check things. This is especially true when changing the actual intended semantics, e.g. adding a new rule for inferring links between words. This changes most of the datastructures that radix produces. In theory you could build up a set of modular tests, like that word $W$ should be known to be pastward of $V$, and that $X$ should be known to not be futureward of $Z$. But, I haven't done that.

A next-best partial solution for testing is end-to-end testing. The idea here is to check if the full output of the system is the same before and after a change. This does not help with intended semantic changes, but it does help with a large class of edits, especially refactors. For example, I used this to improve performance in many ways, while being fairly confident I wasn't messing up any of the semantics.

At least two challenges came up with setting up end-to-end testing.

One was determinism. For the idea to work, different runs have to always give the same answer. The main issue here was the fact that python sets don't preserve order. Mostly this is fine because preorders encoded as downsets don't care about how the downsets are stored. However, when we convert to a list for printing, we can get non-determinism (i.e. the result depends on python's hashseed for that session). To fix this, we make sure to sort at some point before traversing.

The major problem is precomputation time. If I'm modifying the methods for final printing, I can just test their outputs against the previous outputs, no problem. But if I'm modifying the core code that infers links, computes the global preorder, infers senses, etc., then large chunks of the database could be affected. To fully test that code would require running the entire giant computation again. That's infeasible.

Instead, the idea is to make "subuniverses". Basically, we take one or a small set of words (say, $W$); we get their pastward-futureward preoders; and then we take a "halo" around the set of words in those preorders, which comprises all words that mention any of the words in that big set. Then we dump all these words, including the halo, plus their raw wiktionary entry articles, into a new (much smaller) xml file.

The code that orchestrates this is here: scripts/make_subuniverse.py

That xml file can be used as the starting point for a whole new precompute run, followed by testing the outputs of printing methods. If the change is supposed to be semantically inert, then when we pass the original starting word $W$ to be printed, the result should be exactly the same. We created a little "subuniverse" comprising everything that affects $W$, so, just from the perspective of $W$, everything is the same. (The results for other words, even ones in the subuniverse, may have changed.)

The point of expanding to the halo of all mentioners to be able to notice when the change we make will affect how words get marked as linked to other words. If we didn't do this, then a modified rule of inference might, when run on the full universe, say that actually it turns out $V$ does link pastward to $W$. But since the subuniverse we constructed was assuming $V$ did not link to $W$, if we don't include the mentioner halo, we did not include $V$ in the subuniverse. So we do not notice the difference. This would be a failure of our end-to-end testing.

7. Conclusion

I think the current form of radix is useful for my niche application of finding distant non-obvious cognates more easily than just clicking around wiktionary. But, it could be so much better, with a bunch more work. Again I'd be happy to chat with anyone who might be seriously interested in making a high-quality radix-like system. You can reach me at my gmail address: radixtsvibt

I think the current minimally working version of radix demonstrates that something like this is feasible, albeit difficult. I hope I've described enough to also communicate that there's lots of room for significant improvement.

I think that words are important because they help you think well (and unsuitable words make you think mistakenly). When I learn about the connections between words I see the stars ("desire", "de-sidere", "sidereal", "of the stars") and I see the future of human thought. I'd like there to be a high-quality 'scope for words.

Meelen, Marieke, Nathan W. Hill, and Hannes Fellner. What Are Cognates? University of Edinburgh, 23 October 2023. https://doi.org/10.2218/pihph.7.2022.7405. ↩︎

tag:blogger.com,1999:blog-8939787122970662740.post-5411525585796971437

Forum poweruser forum

tsvi bt Nov 21, 2025 Updated Nov 23, 2025

Show full content

[The year is 2031. Unscoped AI development has been frozen at 2029 levels by an international treaty. The growing awareness of the abstract general problem of the abject state of group rationality spurs the creation of a new kind of forum—by, of, and for forum powerusers—aimed at hot-swapping collective sensemaking structures until a good one is developed.]

0722 on the West Coast, I log in.

First, I check updates on my posts. Most of them don't warrant interaction from me.

Diya Agarwal (real name, verified, like almost everyone on the site) critiqued a critique of a critique of a paragraph in my post. It's a solid critique. I complete-agree-vote the whole critique, and I weak-signal-boost-vote the sentence that contains the main minor insight to anyone who viewed the relevant paragraph in the OP. This only takes a few keystrokes and a few quick glances, because this was one of the top actions suggested by the Lilim that guessed what reactions I might want to give (and the UI team put a lot of labor and love into good scaffolding, clear presentation of actions, and easily user-configurable keyboard shortcuts). I then also accept most of the Lilim-suggested package of also weakly upvoting Diya's karma, credibility, and how much I defer to her, both in general and domain-specifically (except I remove the general-defer-vote, as I don't know how much to generalize before seeing more from her); and in the same keystroke I accept the top Lilim-suggested explanation for those upvotes. I make an additional warrants-further-investigation react, which the Lilim did not anticipate, and leave a quick comment explaining how Diya's point generalizes and suggesting some places it might have further implications. Users who were eigendeferring to me in this domain now slightly eigendefer to Diya.

Over the next few minutes, Diya and a few others metareact to my reactions. Most importantly, George Lark complete-disagree-votes and karma-downvotes my upvoting Diya's general and domain-specific karma. His stated reason (most interactions are required to have a stated reason, except e.g. complete-agree-votes which have a default meaning or some karma-upvotes where it's unambiguous why, e.g. your only manual input was interesting-voting and you accepted the Lilim-suggested consequents without review) is that his previous comment already made a point that has Diya's point as a natural implication. I complete-agree-vote his clause that claims his previous comment is prior art, and I partial-agree-vote and partial-disagree-vote (this is a single action, for roughly equal degrees of agree and disagree) his complete-disagree-vote on my karma-upvote of Diya; my explanation is that Diya actually definitely should get some credit, because the Lilim had not noticed the implication of George's comment for the context where Diya commented (and I had not seen George's comment), so Diya provided real additional work—but also George is correct that some of my upvote was for the insight itself, and so some of that karma should go to him for first publishing. (In the background, a Lilinn checks Diya's and George's read-logs, finding that Diya did not see George's comment (unless she is Circumventing) and did not have that much overlap with George's sources, so Diya gets substantial independent-discovery credit. Both of their read-logs are now indexed to this insight, for the Lilim to offer up in case a user asks for background text for that insight.) In two keystrokes I accept the Lilim suggestion to objectlevel-normbreak-vote his karma-downvote of my karma-upvote of Diya, and its proposed explanation that karma-downvotes are supposed to be about the user's judgement (me), not the action—George would have known that I didn't know that there was his prior work, at the time that I made the karma-upvote. I also metalevel-normbreak-vote his comment, with the explanation that I checked the logs of what the Lilim showed him when he was writing his comment—and I can see that he must have either completely ignored the Lilim's explanation of the relevant objectlevel-norms, or else saw it but disregarded it without acknowledgement, and in this case the Lilim correctly identified his norm mistake. It is a norm about norms that you should check the top Lilim warning about norms that your interaction might be breaking. George rejects the top Lilim suggestion of complete-disagree-voting my reactions, and instead accepts the Lilim-suggested package of complete-agree-voting my reactions, which a Lilinn takes as a cue to automatically (slightly) decrease his meta-karma-react credibility and increase his dispute-resolution-credibility.

I add quick speculations on how ego syntonicity might interact with voting patterns, which the Lilim correctly label as butterfly ideas. Since they are labeled that way, users know not to crush them. A Lilinn directs butterfly idea posts to users to read only if they are in the mood for those posts. If the Lilinn isn't sure whether the user is in the mood, it can check (by making a top-suggestion reaction-package say "I didn't evaluate this but I don't want to see it right now"), and usually home in on most of the main functional axes of the user's mood fairly quickly.

With that sordid upkeep business finished, I move to my second customary step. I open the auto-moderation logs, and take a quick look.

A few users have been given several-week bans for sustained poor behavior, earning them too much negative metalevel-norm-karma. (Users don't get banned for poor epistemics, they just get their posts deboosted—shown mainly to users who are interested in helping others improve their epistemics, with prominent warnings about overall and domain-specific epistemic karma, and links to specific mistakes.) This is sad but necessary, and while I do want to see the notification, I don't have anything to add. This is uncommon, but not a surprise—a small faction of users had been developing a coalition of alternative norm enforcement, and instead of discussing the merits and demerits of that alternative equilibrium, they were trying to push it into the normspace by a sum-threshold attack where the targeted pattern was too subtle for the Lilim to recognize de novo and was too inexplicit for individual targeted users to be easily group-conscious of, but real enough to impact the targeted users. It partly worked, in that the Lilim had mostly not noticed, except that they flagged some of it as a possible brigading pattern (statistically ambiguous with natural correlations of voting judgements and content interests) among many other possible brigading patterns. But the users noticed eventually, tutored the Lilim to recognize the attack pattern and promote it to user attention as a hypothesis with other context from the attacking user, and accordingly metalevel-norm-karma downvoted (which becomes very easy when you can just accept the explanation given by another trusted user) the attackers (who did not respond to discussion offers).

A couple new visitors have flamed out. This, unfortunately, is not uncommon. Most users, numerically, who land on a page on the forum trying to read are either confused, or fascinated and happily read more. Most users, numerically, who try to engage in the forum run away immediately (screaming, presumably) with no explanation. We have been A/B testing gentle ways to introduce the forum, including having new-user regions, Lilim tutors and assistants, a curriculum, and so on, but... the forum is just not for most people, and that is ok. One of the flame-outs was flagged by the post-mortem summarization Lilinn as being potentially interesting and good-faith. I glance at the logs. They were not accustomed to having everything, including the Lilim help chats, be public; and the sheer complexity of the UI, even in its very simplified intro form; and the apparently extreme levels of hyperjudgemental hypermonitoring and credit scores; and the seemingly meaningless and definitely incomprehensible navel-gazing meta-meta-norm fights. But yes, they did seem to have a spark of important thinking about things that matter. Feeling a bit unusually extraverted, I shoot them an email offering to chat live and in private (and I log that I've done so, to prevent users spamming them).

My third customary step is to do my due diligence, and visit a couple other of the major sectors of norm-system-space, mostly at random. A Lilinn suggests a few points that might be interesting to me. They are interesting, and I write some follow-on thoughts that are visible in my home sector, but I quickly have to block notifications, and deboost my post in the other sector, and medium-strength suggest a blanket temporary quarantine shadowblock of any influx from the other sector into my home sector coming through my reaction post.

Those people's behavior is just too costly to deal with and not worth-it enough. They reliably crush butterfly ideas; they punish people for making statements that are correlated with political positions they don't like; they won't feel an obligation to engage in meta-discourse or even state justifications for reactions(!); they allow and enact meta-react blocks; and they don't think they shouldn't do those things, so they don't enforce norms against those things; and they punish people who try to enforce those norms, or to discuss whether those norms should be enforced. And, may God eventually forgive them, they will go around karma-downvoting unrelated posts if they dislike something you said! Trying to bridge the metalevel-norm gap is a noble and worthwhile task, but it's... a task, a multi-day—nay, multi-year—task that can't be automated, is twisty and fragile and recursive, and just isn't likely to be high-priority for this specific instance; the Lilim know to un-quarantine the strangers for any home users who are explicitly trying to study this phenomenon or bridge the specific gap. Almost all home users defer to my quarantine suggestion (almost all users defer to almost all users's quarantine suggestions). I retreat to my sector where it's possible to think.

Fourth: Let's browse.

I click a couple posts to read. One of them is long, and is a mix of stuff I do and don't already know, so I talk to a Lilinn about it, asking basic questions about what the author is saying. Like most serious users, I have my reader Lilinn scaffolded to mainly give very short conservative glosses, and to point at the relevant few paragraphs; I only rely on glosses when the Lilim are confident that the gloss is basically lossless, e.g. because the paragraph being glossed is not a new idea or a new expression, but is just a report of the author's particular position within an already well-understood space of well-understood positions. (For basic reading, a single Lilinn is most of what's required, because this is such a common task and pretty easy; but substantial glosses that the reader Lilinn decides to show to me are passed through a fuller host of Lilim to verify and confidence-tag and maybe context-fetch.)

The author of this post (who, in a rare exception, is anonymous, having been verified privately with the moderation team, usually due to living under a repressive regime) has pretty high general-credibility-karma and domain-credibility-karma according to the consensus-centrality-weighted-home-sector-population judgement; and furthermore has high general- and domain-credibilities according to the eigenjudgements of several of my various moods and modes. (My current mood could be called open-speculative-sociological-foraging-selfdirected, for which the author has high karma; the author also has high credibility for my focused-pruning-textual mood; the author has fairly low karma for my debate-clarification-avantgarde-canonical-leader mood, but I don't feel like that right now.) For this reason, I'm indirectly eigendeferring to this author on this topic. That's part of why this post is getting so much attention; this author is eigendeferred-to on this topic by a substantive chunk of our home sector (and even many users from other sectors, though that carries much less weight). Added to that, is the fact that, as far as the LILIL ensemble can tell, if one uses quasi-logical deduction to construct quasi-coherent theories out of the myriad eigenopinions floating around, one gets a few dozen overarching (overlapping, still partial) Perspectives; and the topic of this post is a meso-sized subpillar of a somewhat widely subscribed (literally, as in having been directly or indirectly partially signed off on, and also in terms of readership) Perspective; and this author is one of the main eigendeferrees for this Perspective on this topic; and therefore this is an important discussion worthy of eyeballs.

As I engage with the post, I do nod along with a lot of it, mainly deferring; but I notice a couple threads of subtle but important disagreement, or at least confusion—I detect something is off / unresolved / incomplete / anomalous. I note this in the form of disagree-votes and some reacts such as "I feel vaguely bad or uneasy about this statement", but I do so within a deferential modal-setting. That means my reactions are largely not counted for object-level aggregation, e.g. a given paragraph doesn't accumulate much direct disagreement from my reaction; and the Lilim won't try very hard to interpret my reactions as coming from an alternative coherent viewpoint (unless it just indexes to an existing known viewpoint); but my reactions do affect eigendeference strengths. In particular, my reactions, combined with similar reactions from several other users, and the collection of users who defer to us in this context—as well as, it turns out, some symmetric reactions to the opposing Perspective's eigenopinion on this topic—causes a substantial increase in the fluidity of these Perspectives around this topic and a few topics that are dependent on this topic. This increases the salience of investigation (fact-finding, theorizing) around this topic, and weakens several eigenopinion-networks. Truly exciting times.

Fifth and finally, it's time to create for the day. I settle down to write.

In the course of my investigations and record-making, I have a long phrase that I have to deploy three times. I get piqued, and ask the Lilim to find words / grammar for this phrase. It fails after twenty seconds. I ask it to try harder to find examples of the same function in other texts. It fails after fifteen minutes, just turning up two kinda-examples that don't really help and several non-examples. I give more of a prompt, and ask the LILIL ensemble. It asks more clarifying questions to pin down what I'm looking for in contrast to the Lilim's attempted examples, and it searches very widely, promoting many texts to its attention for any one of many reasons (keywords, searches on cached hidden activations, reevaluating anew, rewriting in several variations and debating whether the rewrites are accurate, random sampling, following citation graphs locally or in the direction of users whose thinking has various substantial dot-products with mine, asking itself what other strategies to use) and then debating amongst itself whether the text might be a subtle but true example. It returns a few dozen attempts, most of which are incorrect, but ten of which are close. Four among those are true examples. The examples stimulate me to crystallize, generalize, and factor into two axes and a few different concepts. I put out a request-for-word with the full context, for the LILIL ensemble and the community to work on in the background.

As I finish up the bulk of my investigation for the day, I contact one of the Living Libraries who specializes in the general area, to test my current theorizing. These users have extensive knowledge of the Corpus of the forum. They mainly don't do their own iterative-theorizing-investigation, though they do write a fair amount—helping with lexicogenesis, scaffolding the LILIL ensemble for difficult tasks in finding texts (which will then be used to scaffold and fine-tune Lilim), organizing Perspectives, editing Wiki entries. Instead, they read immense quantities from the Corpus, including old and obscure text, categorizing and writing short indexing-summaries, and reading each other's index-summaries. Sometimes they compile sets of texts to make a request-for-refactoring or a request-for-distillation to the community. I share some of my thoughts and quickly note the related work that Lilim brought up, and how my thinking seems to be going off in a different direction. The Library points me at a few key sources, one of which seems quite parallel—or fruitfully parallax—to my lines of thinking. I reach out to the author for a synergistic collaboration.

tag:blogger.com,1999:blog-8939787122970662740.post-7617617597129678092

letters

tsvi bt Nov 20, 2025 Updated Nov 23, 2025

Show full content

Loved ones who are gone leave many letters—all the letters of their name, their sayings, the whole book of their life. The letters start to go too. They leave spaces. You start to call her on the phone and remember that she can't pick up, and you can't hear her customary greeting; you think of what you wanted to work with him to build, but that plan is a car with half its parts missing, and tubes and wires sticking out everywhere; the principles you learned from their judgements are impressions left in outline—hopefully enough, but not the same thing.

Letters are gone. They leave outlines. At least at first, the outlines say what the letter was, even though the outlines don't have the weight and volume of the letters themselves.

Some of the anchors—dates, times, facts—might start to slip or get lost. We can try to put them back, but it's a struggle. Some things aren't so easy to fix to the crypt front, like spirits and principles and dreams and feelings—ways that we were, in relation to the loved one. Those things are more of a struggle to keep, and it may be a losing battle.

It's hard to keep the memories in place. When we repair a memory, we might block out the original. When it falls and we pick it up and affix it back in place—when we retrieve and rewrite the memory—we replace it with a more opaque version, a summary that is workable but that might obscure some details. The details of the living person are where the fullness of their eternal soul would have shone through in many fragments by way of many details.

We have parts of ourselves that we know must have come from somewhere. But they're just... there, as if they'd always been there since time began. They must have come from someone, but we don't know who.

If you're not looking carefully, you sometimes don't notice what's gone. If you look closer, you see that things are missing, decaying, falling away. There might not be much you can do about that.

There are many lost letters left by loved ones who are gone. We know to gather them, but we may not know what to do with them. Are they to become ancestor voices, ideologies, archetypes? Are they for making our souls? How can these memories be for a blessing?

Even as we try to repair what has been falling away, still, there will be pieces that go really missing, that we can't find again and can't put back.

Sometimes we're not quite sure where a memory came from. We have a good guess, but we'll leave it as-is for now.

Things get lost, and even if we find them, we might misreplace them—put them back not quite rightly.

At first it feels as though, since she is a mountain, she cannot be nowhere, because a mountain cannot disappear. But the mountain is gone and her rocks fall away, and there's no best time and no good way to put them back, because their proper place is gone.

We hide away some fragments, maybe intending to get back to them, but then forget them hidden in the corner where we left them.

As pieces are submerged, we have to work harder and harder to think of what the whole had been, as a whole, in its fullness. It is long since submerged, but even our image of it is becoming submerged.

Even the words they said out loud to us might degrade, no matter how explicit they were.

An element of someone who's gone might detach, and migrate around. Our first guess at where it came from might not make sense on further reflection.

If we think about it, we might be able to figure out where a part of ourselves came from.

But sometimes, it's lost. We don't know where it came from.

Some elements might end up in strange places. How did they get there? A piece of advice, from some lost ancestor, there at the right time. You can feel their love, but it's hard to love them back because you don't know where this kind and perfect advice came from.

If there are too many older memories of loved ones that have fallen out of place, they can get all jumbled together. We might start forgetting what order they went in. Individual episodes, feelings, words, and elements of their spirit are still there—but the coherent ordered whole from which these memories sprang, with so much more behind it and so much more to give and grow, is submerged inaccessibly, forever.

After some time, things are so faded that you're not sure anymore if you can tell what they originally looked like.

Some spirit elements are forlorn. Pulled out of their context, and forgotten, not set aside for repair but left to their own devices. They gather with each other in the empty unremarked places.

Sometimes, so much is gone that all we have left is a stem. A symbol, a box of items, an essay, a quilt. But the world is gone. Maybe if you gaze intently enough you can still see the afterimage—But yes the world is gone.

Those worlds that are gone leave so much for us to give to ourselves and each other, and from which to make ourselves and each other. In that way we can put the letters back in living words, which give them their proper significance, enthronement, and resplendent repose. The spirit-elements are made volatile, even more themselves and even more ready to express themselves together with other spirit-elements within living souls. Their true place is gone forever and their originary open thriving must be deferred forever. But in this way they are for a blessing.

[I'll honor requests to anonymize or remove specific images. You can let me know at my gmail address, username: tsvibtcontact . Those who have died should be honored, for the sake of their own spirit, and for the sake of what they mean to the living. My hope with this post is not to gawk at or make light of their resting places, or to make some recommendation or judgement, but rather to simply meditate on these things. As there are many images of mausoleum crypts already published on the internet showing names, including a few from this cemetery, my guess is that this is not inherently disrespectful.]

tag:blogger.com,1999:blog-8939787122970662740.post-5018433487316711204

The Bughouse Effect

tsvi bt Nov 19, 2025 Updated Nov 23, 2025

Show full content

What happens when you work closely with someone on a really difficult project—and then they seem to just fuck it up?

This is a post about two Chess variants; one very special emotion; and how life is kinda like Chess Bughouse. Let's goooooo!

1. Crazyhouse

My favorite time-waster is Crazyhouse Chess. Crazyhouse Chess is mostly like regular Chess. In regular Chess, players take turns making a move, Bishops go diagonally and Rooks go straight, and you try to trap your opponent's King to win the game:

(From Lev Milman vs. Joseph Fang courtesy of https://www.chess.com/article/view/10-most-beautiful-checkmates.)

In Chess, if you take a piece, it just leaves the board. In Crazyhouse, the difference is that when you take an opponent's piece, you get to use it. Say you take a Black Bishop; then you get a White Bishop in your hand. When it's your turn, you can either do a regular boring Chess move (with one of your pieces already on the board)—or you can drop a piece from your hand onto the board. To illustrate, watch how when I take the opponent's Bishop, a Bishop appears in my hand at the lower right hand corner; and then next turn, I place it on the board:

(My moves here may not be the most accurate way to play, but they are the funniest.)

You can drop pieces absolutely anywhere, including to give check. (You just can't put Pawns on the very top or very bottom rows.) So, the game can end by surprise:

(You can't hear because it's a .gif, but I'm saying "Oh... I didn't realize that was mate.".)

Those last two gifs were from the same game. The opposing King moved all the way across the board, at the behest of my pieces dropping from the sky. I kept taking pieces from my opponent, so I kept having pieces to drop on the board to continue my attack. (Full game here.)

In Crazyhouse, this sort of chain reaction is common, where you attack using pieces you took during the attack. It's also common that an apparently safe King gets suddenly pried loose from his protective fortress and subjected to mortal threats. This makes games swingy. Very swingy. For Crazyhouse games, the computer evaluation bar, which says who is winning at each point in the game[1], not uncommonly looks like this:

(Ah yes, Chess, the classic game of chance.)

Piece drops can happen anywhere. This makes for complicated tactics and very strange, never-before-seen positions. They are always hard to calculate, and sometimes beautiful:

(I think I've heard of that one, that's called the Four Knights Attack, right?)

The combination of sharp tactics, the tempo turning on a dime, pieces coming at you from anywhere, and strange un-Chess-like positions, provides a very crazy-making fun-making experience. I sometimes compare it to regular Chess. It is said that Chess is an argument, where you have to build up your own case, and ask your opponent a series of increasingly uncomfortable questions until they crumble under the pressure. So if slow Chess is a civilized, erudite argument, and blitz Chess is a shouting match, then Crazyhouse is a "duel": You and your opponent stand 6 feet apart, facing each other with your mouths open, and you try to lob lit firecrackers down each other's throats[2]. Crazy.

But here's another question: Does Crazyhouse produce rage?

2. Crazyhouse rage?

Not much, in my experience. Not more than any other fast-paced competitive game. You can definitely get very mad, like if the opponent plays bad or has a lower rating but still wins, or if you lose for a "fake" reason like a mouse-slip or your time running out.

But it's not deeply enraging, as far as I've seen. You occasionally get some salt in the chat, but it's pretty tame—at worst, "fuck you" or "lucky" or similar.

3. Bughouse

Bughouse is four-player Crazyhouse, a.k.a. doubles Chess. There are two teams of two. Each team has one player with White pieces, and one with Black pieces. Here you see TeamTop on the top, with TeamTop-White on the left, and TeamTop-Black on the right; and opposing them, there's TeamBottom-Black on the left, and TeamBottom-White on the right.

Say TeamTop-Black (top right) takes that White Knight on g6 from his opponent, TeamBottom-White. So then TeamTop-Black gives that White Knight to his teammate, TeamTop-White (top left). Which makes sense, because it's a White Knight and she's playing with the White pieces. On her turn, she can place that Knight on her board, the left board, just like in Crazyhouse. (Since the piece doesn't have to switch colors, you can easily play Bughouse in person.)

The two games on the two boards just go simultaneously and independently, except that pieces are constantly shuttling back and forth. Also, if one player loses, whether by checkmate or by running out of time, their team loses.

Before, in Crazyhouse, the branching factor is high—the opponent could place any of their pieces anywhere on the board. But the game was still in a sense self-contained—perfect information just looking at your board, deterministic except for one opponent, fixed turn order. Now, in Bughouse, pieces can come out of nowhere at any time from the other board. It's like if you're boxing, but many times during the bout, a disembodied fist comes out of nowhere and punches you. You better have constant vigilance.

If blitz Chess is a shouting match, and Crazyhouse is a firecracker lobbing duel, then Bughouse is hackysack with hand grenades.

This takes the Craziness of Crazyhouse and ramps it up to 11:

Bughouse also makes you very interdependent with your teammate. For one thing, if they lose, you lose. But it's much more than that. Every little decision they make can derail your whole position on your board, and vice versa; even them taking 3 seconds longer on a move can put you in a much tougher spot.

This interdependence opens up the opportunity to experience a special new emotion.

4. Treachery!

Let's go through one full example.

So, you're playing Bughouse on the internet. You're very rusty because you haven't played much in years, and you're doing research for a blog post. Your play is far from perfect, but you put strong pressure on your opponent, and his King is drawn way out. Your King is also exposed, so you MUST keep attacking and checking his King, otherwise he'll take the initiative and attack back. You ask your teammate to trades pieces on their board, so that you have more pieces to drop on your board and continue the attack. Your attack is running low on steam—you've got the White King surrounded, but not quite checkmated. You're out of good checks on the board, and you have no Black pieces in hand to drop and deliver mate. (See the bigger board on the left:)

You play on. You have been begging your teammate to TRADE. Your teammate has not done that thing that you asked for them to do. Now it is a critical moment:

The White King on f4 is far afield, completely naked. But you're in check from the White Bishop on h4, and you probably can't afford to just move your King aside. You MUST block, ideally with check. Conveniently, your teammate has the opponent's Black Rook just sitting there on g8, ready to be gobbled up by the Knight on e7. If they take the Rook, you can immediately drop it on f6, blocking check and also CHECKING THE WHITE KING, keeping the initiative! You beg them to take the Rook.

To translate that chat history:

Trade pieces [because I have an attack and need pieces to continue attacking]
Trade pieces
Trade pieces
Trade pieces
Move now [because we're in a tight time crunch]
Move now
Move now
Move now
take [the Rook that's been sitting there for 10 seconds]
go [make moves, we're in a time crunch]
Trade pieces
Move now

But your teammate has other ideas. Yes, now is the time to spend 14 seconds before taking the Rook. (Which is completely disastrous, because now your team is down on time, so your teammate's opponent can stall and prevent you from getting more pieces to attack with.) So your attack peters out and you lose on time. You asked them for what you needed, they could have given it to you, but they did it too slowly and all your effort mounting an attack is for naught.

[[If you want you can view the whole game here: https://www.chess.com/game/live/157232852789. Press the "flip board" button, very bottom-right, to see it from my perspective. Click the Partner tab on the right to see both boards. Arrow keys to step through moves.]]

Why did they do that? What was your teammate thinking? Maybe they're thinking "My King position is weak, I have to check for possible fatal attacks before playing a non-defensive move.". Maybe they're thinking about the position and not reading the chat. Maybe they're thinking Arby's. Maybe they forgot they were playing Bughouse. Science may never know. But one thing's for sure: They are an absolute knob.

When I needed them most, they failed me. And now we both have a big fat L forever. Are they happy?

5. Bughouse Rage

Since Bughouse positions are so explosive and sensitive to small decisions, there's lots of ways your teammate can fail you. They didn't trade enough. They traded too much and gave your opponent pieces to attack you. They played too slow. They gave away a Knight even though you said "No Knights!" and the Knight checkmated you. They kept playing and GOT THEMSELVES CHECKMATED even though YOUR OPPONENT WAS 100% ABOUT TO LOSE if only your teammate would just STOP like you TOLD THEM TO DO FIVE TIMES IN THE CHAT until you hit the limit on how many times the chat lets you say stop.

This kind of fuck-up engenders deep rage.

For me this is a special kind of rage. It's not simple, like a shot of vodka.

It's complex, like a fine wine, with a bright attack: the delusion of cooperation getting shattered. The mid-palate is betrayal-anger, with an aroma of contempt, and notes of pain and confusion: How can it possibly be that you want to win—and then you go and play like that?? The finish is spite, and a trace of despair: If this is what other people are like, why try to work with them on anything even slightly difficult?

Well, it's like a wine, except that you're chugging it. It's also explosive and crunchy and feels like something is tearing up your gut trying to get out. I guess it's like if you swallowed a pint of pop-rocks and let nature do its thing.

(Yes, Watermelo Punch, that's what I want to do to my teammate.)

I have tasted Bughouse Rage. I don't like it, so I stopped. But I've tasted it.

I have seen others engage in the rage. When I mess up in online Bughouse, my teammate might Rage at me—using basically the nastiest possible language that gets through chess.com's obscenity filter. When I win, sometimes I stick around after the game to watch the fireworks in the chat from the other team.

6. Bughouse and life

In a lot of ways, online Bughouse with strangers is a perfect storm to create this emotion:

The communication is low-throughput.
Your team has strongly aligned goals, but no personal relationship and no way to do sane post-mortems and punishments.
You tense yourself for sustained, effortful thinking—and then BAM your teammate ruins it all.
You're very interdependent, but lack shared context—one board is more than enough to keep track of, let alone two.
There's no incentive for you to go back and look at the game through your teammate's eyes.

Still, I think the Bughouse Effect shows up a lot in real life, even if it's in a less pure form. It often happens that there's a team of people, and one of them gets very angry about a mistake made by their teammate, and their anger seems out of proportion with the mistake. Whenever that happens, I think of the Bughouse Effect.

So, in a slight deviation from the long tradition of comparing Chess to life, we will now compare Bughouse to life. Here are a couple case studies:

6.1. Christian Bale bugging out

Christian Bale was acting in the filming of Terminator Salvation in 2008. Audio (https://www.youtube.com/watch?v=0auwpvAU2YA) was leaked in 2009 of an altercation between him and the director of photography, who was apparently moving around on or near the set during a scene and distracting Bale. You can hear that Bale is, basically, really really pissed off.

It's hard to tell without the full context, but it certainly seems like he's being an asshole. However, you can also hear that he's not just being an asshole. Bale's anger has a perfectly understandable basis, relating to his teammate interfering with his efforts. He hammers home several times that he's pissed because the DP seems to not understand the effect his movements have on Bale trying to act. This echoes something you might see (more... curtly) in the aftermath of a rough Bughouse game: Why didn't you read the fucking chat? Do you have any concept of how that fucks with my ability to stay safe and finish attacks? I hope you had fun saccing the pieces that got me mated. Did I do that to you? You're an amateur.

Similar things happen with leaders in general. There's lots of stories of heads of projects being harsh, impatient, and apparently callous. In some cases they could just be an asshole. But I would guess that in many cases, it's not that they are power-tripping, but rather that they are under a lot of pressure. They're trying to do something hard, and trying to delegate. So then, it's extra super frustrating if the delegee does something that makes it seem like they are totally clueless, or maybe aren't even trying to do the right thing at all.

(This is not at all to excuse this behavior. Especially as an employer, or as a huge actor who presumably has a lot of power. That power presumably is a big part of why Bale allowed himself to act like that in the first place.)

6.2. My stag is best stag

The Stag Hunt is an abstract game, like the Prisoner's Dilemma, that serves as a simplified model for many real-life situations. In the Stag Hunt, each hunter can choose to hunt Stag or Hare. If they both hunt Stag, they're successful and they both get a lot of food. If someone hunts Hare, he'll get a Hare, which is a bit of food. But, if one of them hunts Stag while the other hunts Hare, the Stag hunter gets nothing:

This means that if each hunter knows the other will hunt Stag, then they both individually want to choose Stag (because it will work), and then they'll actually get the Stag. But if either is uncertain of what the other will do, then hunting Stag won't work, so they'll hunt Hare instead.

How does this apply to real life? Basically any group project is a kind of Stag Hunt. If you can all get on the same page with each other about what the goal is, you have a good shot at making it happen; but if you cannot get on the same page about the goal, then it's better to just go work on your separate personal projects.

Some goals are fairly easy to get on the same page about, like "let's each lift our end of the couch at the same time". But many goals are more difficult to find a teammate for. It might be a rare goal to share, or it might be hard to tell when someone else has that same goal.

For example, there's a certain kind of conversation I like, where we speculate and theorize. New hypotheses can be brought up and seriously considered, even if they seem strange or implausible or unclear; lots of ideas and questions are kicked up and considered intensely, but not hypercritically. This kind of conversation is like an indoor Butterfly Conservatory for protecting a collection of Butterfly Ideas.

Sometimes I find someone who seems like they are probably interested in having a butterfly-conservatory conversation. This is exciting! I've found someone with a shared goal, maybe; now we can hunt Stag together.

So I start in with the butterfly ideas... And then gradually realize that something is off. They might be overly critical, or not really trying to add their own speculation, or just bringing things back to more trivial topics at inappropriate times.

Eventually I figure out that they just don't happen to be interested in having the type of conversation that I wanted to have. We have different goals, ok, no problem. It would be inappropriate to get really angry in this situation.

But it can nevertheless Bug me, with a note of the Bughouse Effect. The transition period can be frustrating and disorienting, when I'm still assuming they're up for a Butterfly Conservatory conversation but I'm seeing how poorly they're doing it. I gathered up my energy to think hard about new ideas; and now the other person is leaving me high and dry.

Over time, I've learned to more carefully avoid overinvesting in imagined shared goals. I've also learned to pay closer attention to whether I'm incorrectly assuming a shared goal, so I can update my beliefs quickly.

If I'm incorrectly imagining that there's someone there, trying to play the same game I'm trying to play, it's kinda like if I think I'm playing Bughouse (with a teammate) but actually I'm playing Crazyhouse on my own. I could get into a position where I can checkmate my opponent, if only I had a Queen to drop on the board, and then cry out to the heavens: "Won't someone please send me a Queen??" But I'm playing Crazyhouse and there's no one there who's trying to send me pieces, and it doesn't make sense to get angry at the sky.

6.3. Are you people even trying to save the world?

If anyone builds AGI, everyone dies. So, like, we should stop that from happening. The plans you want to invest in, to stop that from happening, sometimes depend on when you think AGI is likely to be built.

For some reason, most people working on this seem to have reached a comfortable consensus of "AGI is going to come really really soon, like a few years or a decade". This is very very annoying to me, because I think there's a pretty substantial chance that AGI isn't built for a few decades or more.

Now, some plans are crucial whether you think AGI will come in years or decades; we definitely want to stop AGI capabilities research immediately. But when people have de facto confident short timelines, which I don't think makes sense, they significantly underinvest in important plans, such as human intelligence amplification.

I can reflect on this situation, and I can see that, in part, different people are just looking at different parts of the world. You're looking at your board, and I'm looking at mine:

But that doesn't stop it from being immensely frustrating when your ally is doing it wrong. And there's not necessarily recourse; there's no easy way to have a debate with an amorphous diaphanous distributed tacit quasi-consensus. (Aside: this is not quite the same thing as the narcissism of small differences[3].)

I also get a bit of this feeling if a wealthy entrepreneur gets interested in reprogenetics, and wants to invest and make cool tech—but then is mysteriously uninterested in funding the slightly less sexy, but actually much more important science that is prerequisite to the really interesting versions of the technology.

From one perspective, it doesn't make sense for me to get angry at them. They're still investing in the area, that's still great, and it's still very helpful compared to the default of not helping at all. But from the other perspective, if you're investing in the area, then you're also the one who is supposed to do the actually right version of working in the area. So when you're not, it's frustrating, and it feels like you're close to doing the really good version, so I really want to nudge you in that direction. (This is related to how people with responsibility, who are doing a pretty good job, get a lot more criticism and hostility than people who aren't helping at all; e.g. leaders of many kinds, or creators of open-source utilities.)

I don't actually feel rage in these situations, but I do feel some real anger, and the anger feels similar to bona FIDE Bughouse Rage. It's the feeling of we are on the same team but why are you acting like that are you oblivious or incompetent or what.

7. Conclusion: Symmetrization

I want to point at one last thing.

The Bughouse Effect is a perfect application for symmetrization. That's where you're angry at someone for their behavior, but then you think of times you've done basically that exact same behavior in an analogous position. You can ask: When I was in a time crunch, was I paying close attention my teammate's board, so that I avoid losing a piece that would be dangerous in my teammate's opponent's hands? When I was asked to not lose a Knight, did I immediately see that, or did it take me a few seconds to see the message, and by then I'd already traded a Knight?

And then... you can still be mad. But, if you want (hint: you should want), you can at least:

Be mad precisely—mad at the right things, rather than at everything.
Be mad in a way that is fair, in accordance with the Golden Rule—mad in the same way that you think people should be mad at you, when you do that same behavior.

Betrayal is very important to react to; a terminally unreliable teammate is very important to react to; and also, everyone messes up sometimes and other people don't know what you know, so sometimes it was just a bad situation.

There's more to be said about feelings and other reactions around working together on difficult things. I'll leave that to you. Have you experienced the Bughouse Effect? What was it like? What happened next? What maybe ought to happen?

8. Epilogue

While Doing Research (playing board games) for this blog post, I wanted to screenshot the Bughouse chat. But it is so small on chess.com. See?

Oh, you not see it? Because eez invisible? Here, I very nice, I help you:

I had assumed I was just a goof, and a power user would have the settings configured so that the chat is actually readable. But no. Apparently it's impossible to change the size (short of maybe cooking up some javascript manual html manipulation nonsense), and this is just a years-old bug that has not been fixed. That just goes to show... something. Maybe the Bughouse Effect is more The Chess.com Bughouse Effect. Always open your lines of communication. Indeed, playing Bughouse in person with friends, where you can actually talk and also don't want to be mean, is much much friendlier.

The computer evaluation is, as I understand it, taken from a Chess-playing computer program's rating of the current position. The Chess program rates positions in order to judge which position to enter, i.e. which move to make. There are Chess programs that are superhuman at many variants of Chess, including Crazyhouse. The question that the evaluation bar answers is, roughly, "How much better is the current position for White, if two Crazyhouse Chess programs started playing from this position?". Since Crazyhouse is very sharp (high branching factor, many forcing lines, runaway attacks), often the Crazyhouse Chess program can find a forced checkmate in (say) 8 moves that's very difficult for a human to directly find. (Often the Crazyhouse program's evaluations take a while to stabilize, so the displayed evaluation bars might be a bit inaccurate, but still give a generally accurate impression I think.) ↩︎
What I mean here is that, whereas Go is high-branching but maybe a pretty positional / continuous game (with several somewhat decoupled simultaneous battles; IDK, I don't play Go), and Chess is low-branching and sometimes pretty sharp, Crazyhouse on the other hand is very high-branching and very sharp (e.g. you can easily get a lost position in one or two moves in a surprising non-obvious way). ↩︎
The Bughouse Effect is one source for the narcissism of small differences (NoSD). But NoSD is more general; I think it describes any situation where two people or groups are very similar, and this somehow generates conflict. You could have NoSD because of a Bughouse Effect, e.g. because you're so close to having the right political strategy, but then this small difference makes it seem like you're totally oblivious and wrong, or possible a traitor. But you could also have it because of an uncanny valley type dynamic, where you're straight up annoyed about something that looks similar but isn't; you might for example worry that other people will treat you as the same, even though you're not the same. NoSD between similar religious communities can be understood as a fight over the derivative / trajectory of the values of the total community; it makes sense to think about small differences in that context, just like it makes sense for us in our daily lives to think more about our current problems (which we have to fix) than about how things are already great (which we don't have to fix). Yet another source would be competition—someone who's too similar to you will compete against you for things. ↩︎

tag:blogger.com,1999:blog-8939787122970662740.post-8452986423803352899

Abstract advice to researchers tackling the difficult core problems of AGI alignment

tsvi bt Nov 18, 2025 Updated Nov 18, 2025

Show full content

This some quickly-written, better-than-nothing advice for people who want to make progress on the hard problems of technical AGI alignment.

1. Background assumptions
2. Dealing with deference
3. Sacrifices
4. True doubt
5. Iterative babble and prune
6. Learning to think
7. Grappling with the size of minds
8. Zooming
9. Generalize a lot
10. Notes to mentors
11. Object level stuff

1. Background assumptions

The following advice will assume that you're aiming to help solve the core, important technical problem of desigining AGI that does stuff humans would want it to do.
- This excludes everything that isn't about minds and designing minds and so on; so, excluding governance, recruiting, anything social, fieldbuilding, fundraising, whatever. (Not saying those are unimportant; just, this guide is not about that.)
- I don't especially think you should try to do that. It's very hard, and it's more important that AGI capabilities research gets stopped. I think it's so hard that human intelligence amplification is a better investment.
- However, many people say that they want to help with technical AI safety. If you're mainly looking to get a job, this is not the guide for you. This guide is only aimed at helping you help solve the important parts of the problem, which is a very very neglected task among people who say they want to help with technical AI safety generally.
The following advice will not presume any specific way that AGI will be like or unlike current AI.
The following advice will presume that technical AGI alignment is a very difficult task, probably more difficult than anything humanity has ever done, and is pre-paradigm, i.e. no one is remotely close to knowing how to go about finding a solution.
The following advice is not consensus, and is phrased strongly and confidently, without caveats and also without evidence or justification. Consider checking the comments for critiques, nuances, and so on. This is just what I would tell someone if I only have this amount of words and one day to write it down, phrased with the best approximation of appropriate emphasis that I can make with that amount of efforts. (My credentials are only that I tried to solve the hard problem for about one decade, and I tried (with unknown success) to mentor a bunch of people to do that in several contexts; you can see much of my object-level writing here: https://tsvibt.blogspot.com/search/label/AGI alignment)

2. Dealing with deference

It's often necessary to defer to other people, but this creates problems. Deference has many dangers which are very relevant to making progress on the important technical problems in AGI alignment.

You should come in with the assumption that you are already, by default, deferring on many important questions. This is normal, fine, and necessary, but also it will probably prevent you from making much contribution to the important alignment problems. So you'll have to engage in a process of figuring out where you were deferring, and then gradually un-defer by starting to doubt and investigate yourself.

On the other hand, the field has trouble making progress on important questions, because few people study the important questions and also when they share what they've learned, other people do not build on that. So you should study what they've learned, but defer as little as possible. You should be especially careful about deference on background questions that strongly direct what you independently investigate. Often people go for years without questioning important things that would greatly affect what they want to think about; and then they are too stuck into a research life under those assumptions.

However, don't fall into the "Outside the Box" Box. It's not remotely sufficient, and is often anti-helpful, to just be like "Wait, actually what even is alignment? Alignment to what?". Those are certainly important questions, without known satisfactory answers, and you shouldn't defer about them! However, often what people are doing when they ask those questions, is that they are reaching for the easiest "question the assumptions" question they can find. In particular, they are avoiding hearing the lessons that someone in the field is trying to communicate. You'll have to learn to learn from other people who have made conclusions about important questions, while also continuing to doubt their background conclusions and investigate those questions.

If you're wondering "what alignment research is like", there's no such thing. Most people don't do real alignment research, and the people that do have pretty varied ways of working. You'll be forging your own path.

If you absolutely must defer, even temporarily as you're starting, then try to defer gracefully.

3. Sacrifices

The most important problems in technical AGI alignment tend to be illegible. This means they are less likely to get funding, research positions, mentorship, political influence, collaborators, and so on. You will have a much stronger headwind against gathering Steam. On average, you'll probably have less of all that if you're working on the hard parts of the problem that actually matter. These problems are also simply much much harder.

You can balance that out by doing some other work on more legible things; and there will be some benefits (e.g. the people working in this area are more interesting). It's very good to avoid making sacrifices, and often people accidentally make sacrifices in order to grit their teeth and buckle up and do the hard but good thing, but actually they didn't have to make that sacrifice and could have been both happier and more productive.

But, all that said, you'd likely be making some sacrifices if you want to actually help with this problem.

However, I don't think you should be committing yourself to sacrifice, at least not any more than you absolutely have to commit to that. Always leave lines of retreat as much as feasible.

One hope I have is that you will be aware of the potentially high price to investing this research, and therefore won't feel too bad about deciding against some or all of that investment. It's much better if you can just say to yourself "I don't want to pay that really high price", rather than taking an adjacent-adjacent job and trying to contort yourself into believing that you are addressing the hard parts. That sort of contortion is unhealthy, doesn't do anything good, and also pollutes the epistemic commons.

You may not be cut out for this research. That's ok.

4. True doubt

To make progress here, you'll have to Truly Doubt many things. You'll have to question your concepts and beliefs. You'll have to come up with cool ideas for alignment, and then also truly doubt them to the point where you actually figure out the fundamental reasons they cannot work. If you can't do that, you will not make any significant contribution to the hard parts of the problem.

You'll have to kick up questions that don't even seem like questions because they are just how things work. E.g. you'll have to seriously question what goodness and truth are, how they work, what is a concept, do concepts ground out in observations or math, etc.

You'll have to notice when you're secretly hoping that something is a good idea because it'll get you collaborators, recognition, maybe funding. You'll have to quickly doubt your idea in a way that could actually convince you thoroughly, at the core of the intuition, why it won't work.

This isn't to say "smush your butterfly ideas".

5. Iterative babble and prune

Cultivate the virtues both of babble and of prune. Interleave them, so that you are babbling with concepts that were forged in the crucible of previous rounds of prune. Good babble requires good prune.

A central class of examples of iterative babble/prune is the Builder/Breaker game. You can do this came for parts of a supposed safe AGI (such as "a decision theory that truly stays myopic", or something), or for full proposals for aligned AGI.

I would actually probably recommend that if you're starting out, you mainly do Builder/Breaker on full proposals for making useful safe AGI, rather than on components. That's because if you don't, you won't learn about shell games.

You should do this a lot. You should probably do this like literally 5x or 10x as much as you would have done otherwise. Like, break 5 proposals. Then do other stuff. Then maybe come up with one or two proposals, and then break those, and also break some other ones from the literature. This is among the few most important pieces of advice in this big list.

More generally you should do Babble/Prune on the object and meta levels, on all relevant dimensions.

6. Learning to think

You're not just trying to solve alignment. It's hard enough that you also have to solve how to solve alignment. You have to figure out how to think productively about the hard parts of alignment. You'll have to gain new concepts, directed by the overall criterion of really understanding alignment. This will be a process, not something you do at the beginning.

Get the fundamentals right—generate hypotheses, stare at data, pratice the twelve virtues.

Dwell in the fundamental questions of alignment for however long it takes. Plant questions there and tend to them.

7. Grappling with the size of minds

A main reason alignment is exceptionally hard is that minds are big and complex and interdependent and have many subtle aspects that are alien to what you even know how to think about. You will have to grapple with that by talking about minds directly at their level.

If you try to only talk about nice, empirical, mathematical things, then you will be stumbling around hopelessly under the streetlight. This is that illegibility thing I mentioned earlier. It sucks but it's true.

Don't turn away from it even as it withdraws from you.

If you don't grapple with the size of minds, you will just be doing ordinary science, which is great and is also too slow to solve alignment.

8. Zooming

Zoom in on details because that's how to think; but also, interleave zooming out. Ask big picture questions. How to think about all this? What are the elements needed for an alignment solution? How do you get those elements? What are my fundamental confusions? Where might there be major unknown unknowns?

Zoomed out questions are much more difficult. But that doesn't mean you shouldn't investigate them. It means you should consider your answers provisional. It means you should dwell in and return to them, and plant questions about them so that you can gain data.

Although they are more difficult, many key questions are, in one or another sense, zoomed out questions. Key questions should be investigated early and often so that you can overhaul your key assumptions and concepts as soon as possible. The longer a key assumption is wrong, the longer you're missing out on a whole space of investigation.

9. Generalize a lot

When an idea or proposal fails, try to generalize far. Draw really wide-ranging conclusions. In some sense this is very fraught, because you're making a much stronger claim, so it's much much more likely to be incorrect. So, the point isn't to become really overconfident. The point is to try having hypotheses at all, rather than having no hypotheses. Say "no alignment proposal can work unless it does X"—and then you can counterargue against that, in an inverse of the Builder/Breaker game (and another example of interleavede Babble/Prune).

You can ask yourself: "How could I have thought that faster?"

You can ask yourself: "What will I probably end up wishing I would have thought faster? What generalization might my future self have gradually come to formulate and then be confident in by accumulating data, which I could think of now and test more quickly?"

Example: Maybe you think for a while about brains and neurons and neural circuits and such, and then you decide that this is too indirect a way to get at what's happening in human minds, and instead you need a different method. Now, you should consider generalizing to "actually, any sort of indirect/translated access to minds carries very heavy costs and doesn't necessarily help that much with understanding what's important about those minds", and then for example apply this to neural net interpretability (even assuming those are mind-like enough).

Example: Maybe you think a bunch about a chess-playing AI. Later you realize that it is just too simple, not mind-like enough, to be very relevant. So you should consider generalizing a lot to thing that anything that fails to be mind-like will not tell you much of what you need to know about minds as such.

10. Notes to mentors

If you're going to be mentoring other people to try to solve the actual core hard parts of the technical AGI alignment problem:

For very motivated / active mentees, experiment with giving firm but maximally abstract / meta advice. The reason for this is to allow them maximally leeway to figure out new ways of thinking, but to still accelerate that process with good tips. Try to just nudge them slightly, to get the full flow of their thinking unblocked and [pointing in the right direction at least at an abstract level so that they will eventually figure out how to move in many more right directions]. As an analogy, a bouldering coach might want to not say "put your foot here" but rather "try having a higher temperature for your attempts, i.e. try out more and more different methods".
Ideally your advice should make it through the chronophone. E.g., generally don't recommend deferring to a specific person about a key question, because the real message there is "defer to someone about this question", which is probably wrong.
Make sure they are doing Babble/Prune on all the relevant dimensions, both object-level and meta-level.

11. Object level stuff

I would suggest reading Yudkowsky's not-super-mathy technical writing on AGI alignment, e.g. his Arbital writing and List of Lethalities. You could try reading Creating Friendly AI.
I would suggest not reading very much more. The alignment field did not solve its problems, and it did not solve its meta problems (stating problems well, stating the important problems, selecting between problems, noticing failure to state important inexplicit problems, correcting these failures at the discourse level). So you cannot go out and read about what the problems are. You just can't do that, sorry. It's not possible. There's no list. Even if you read everything anyone's ever written on AGI alignment, you still won't solve it. You cannot read the understanding that you need. You'll have to figure it out yourself. You can take inspiration from others's writings obviously, but you cannot download the answers or the questions.
If you want something from me, I've collected and compressed down some of the core challenges in alignment in "The fraught voyage of aligned novelty", but it's written in a way that probably won't be very useful to you.

tag:blogger.com,1999:blog-8939787122970662740.post-6190096672889388469

Constructing and coordinating around complex boundaries

tsvi bt Nov 17, 2025 Updated Nov 23, 2025

Show full content

[Caveat lector: this is a very long, rambling meditation on concepts and coordination. It's not cut down for size or well-organized. That said, I had several insights while writing it.]

1. Case 1: Is an embryo a person?
2. Case 2: Should people be allowed to think freely?
3. Case 3: How nice should you be?
4. Case 4: Reprogenetics
5. Case 5: Lemons

1. Case 1: Is an embryo a person?

People sometimes kill embryos. They do this intentionally, either to abort a pregnancy or to discard an in vitro embryo, or accidentally, e.g. by assaulting a pregnant woman. This raises the question: Is an embryo a person?

1.1. A big blob of subquestions

That question is a proxy question for (or, adjacent question to) several other questions, such as:

Should it be illegal to abort a pregnancy at week N? Is it immoral?
Should it be illegal to discard an embryo in vitro?
Should it be illegal to grow an embryo in vitro up to day N?
What legal protections should be given to embryos at week N? What about moral rights?
Besides the assault itself, what legal injuries are done to a woman if you assault her in a way that ends her week N pregnancy? What about moral injuries / wrongs?
In general, how much should we value and protect life, people, consciousness, humans?
What is consciousness? When is consciousness developed in a growing human? What is personhood? What is a human being?
How bad is it to kill an embryo at day N? Is it equivalent to killing a child? How does the badness compare to, for example, the badness of a woman going through a full pregnancy that she doesn't want?
Is it unethical to kill an embryo at day N? In other words, is that action ruled out out (or in) by some categorical rules? Or instead are wanting to judge that action by weighing good and bad consequences against each other?

1.2. Difficult questions produce uncertain, sticky, varied opinions

This is quite a blob of questions. Each of these questions is complex, and fans out into other questions. It would take a lot of cognitive work to form a good integrated judgement about one of these questions, let alone all of them.

Three upshots of this complexity:

People will tend to be unsure about the answers to some or many of the questions.
People will tend to defer to other people about many of the questions (e.g. "the scientists say that embryos form a brain at week N"). As one consequence, people will by default have a harder time updating their own views or the views of others, because the source of the views is some other third party. Further, the sheer complexity makes the question harder to reason about, so it's harder to learn and update.
Since complex, uncertain questions leave room for doubt, and doubt leaves room for a variety of opinions, different people will tend to have different views from each other about some of the questions.

1.3. A multi-question blob does not have "an answer"

Because this is a big blob of multiple questions, it doesn't necessarily have one answer. For example, I might say it's immoral to abort at 5 months, but it should be legal because of the principle of the mother's bodily autonomy. Do I think the embryo is a person, or not? IDK. Likewise, you might say it's probably not immoral to discard a 7-day embryo, but it should be illegal because there should be a clear bright legal line protecting human life. Do you think the embryo is a person? That question doesn't really have an answer.

In other words, because the single question "Is an embryo a person?" actually fans out into multiple questions, in a given situation you might want to respond "Yes it's a person, and also no it's not a person.". E.g. you might want to say "Yes, your assault that killed her pregnancy took away their future child (a person); no it's not the same as killing a 5-year-old.". You don't just answer with a yes or a no.

1.4. Coordination about X is fragile to not knowing what other people will think of X

Coordination is when several people form shared intentions and then act on those intentions in a synergistic way.

Coordination is difficult and requires some approximation of logical common knowledge. For example, to enforce norms, you want to have clear expectations of what will be punished, and what other people will agree was appropriate punishment. You want to know that they know that you know that Alice broke the norm when she did such and such; otherwise, you might expect them to view your punishment as being out of line. Furthermore, you want to know that they know that you know all that, so that you know that they know you weren't enacting punishment without being sure that you had the presumptive authority granted by known consensus to do so.

If A' is even slightly more complex than A, then it's much more difficult to get common knowledge of A' than to get common knowledge of A. It's harder to distinguish when A' applies; so it's significantly harder to tell when someone else will think A' applies; so it's harder still to tell when someone else will think that you think A' applies; and so on. (It would be interesting to see math models of this kind of phenomenon—maybe it's not true, or only true under some circumstances.)

More generally, if it's harder to know what other people are thinking of A' compared to A, it's much harder to have approximate common knowledge around A'.

1.5. It's hard to coordinate about big question blobs

To summarize the foregoing: Suppose you have a big blob of questions, such as all the subquestions about embryos. Then:

People will be uncertain about the questions; they have varied opinions; it'll be hard to get everyone to convince each other of the one correct set of opinions; and the opinions themselves are not binary but rather a set of several binary answers.
Because of that, it's very far from the case that people's opinions can be confidently well-summarized as a yes or no to one question.
So, there's no one question that achieves the ideal coordination flag for this blob.

1.6. Correction: It's hard to coordinate distributed judgements about big question blobs

Now that I think about it, there are some ways to coordinate about big question blobs. E.g.:

Delegate to a centralized body. E.g. the country delegates to the government, and then the government makes and enforces a complicated set of rules, which treat each question in the question blob (or at least, the legal ones). Or, a religious community delegates to a religious leader, or a set of scholars, and that body makes a complicated set of rulings.
Coordinate about each question separately. But this doesn't work well.

Delegation does work in many ways. But it doesn't necessarily work well to incorporate new values from the broader community. E.g. to get your view represented in parliament, you have to form a big enough coalition—which may mean coordinating with people who you don't agree with on all the questions. Similarly, to pressure leaders, you need some coalition.

When delegation isn't available, you want distributed norms, in which case again you need something that is highly coordinatable.

1.7. So, to coordinate, people look for simple questions

In order to enforce norms, and form coalitions for political representation and influence, and gather energy to work on projects in general, people want to coordinate. That is, they want to form shared intentions and act on them. To form shared intentions, they look for simple questions, where if you answer one way, you're on the team.

This process always has tensions, because there's always tradeoffs involved in focusing on a simple question. A simple question bundles together lots of decisions, and says, "decide on all of these questions this way, or decide on all of these questions that other way". It's likely that neither of those bundles of decisions is your favorite way to decide.

On the other hand, more people can join in on one of those bundles as being preferable to the default or something. When more people join in, there's a better chance of having enough coordination power (e.g. to get political influence or to enforce norms).

Further, the choice of what stance the coalition should take is a kind of Keynesian beauty contest. Simple choices are especially salient in such contests. So there's additional weight.

Further, to win a Keynesian beauty contest, you want to be legibly appealing to many subsets of values. A simple policy like "no discarding any biological humans" is visibly appealing to people who think that 7-day embryos have souls, or who think that 30-day embryos (but not 7-day) have souls, or 60-day. Because it's appealing to many sectors, and visibly appealing to many sectors (and also visibly visibly appealing, i.e. you can see that other people would see that it's appealing to others), it's a natural choice for a coordination point.

To give a rollout as an example: Someone might say that discarding even a 7-day embryo should be illegal. Their real reason for this stance might be something like:

Suppose we say that you can discard embryos up to 28 days. Well, perhaps if that's all that happens, that's actually ok. However, this social / political regime looks weak. It looks like a compromise position. It looks like there's a natural simple position, "no discarding any human life", but we decided to not take that stance. It looks like we're sharply denying that. In sharply denying that, we're basically admitting defeat. It looks like the natural (because simple) coalition has capitulated.

This is similar to, or a mechanism for, the general slippery slope argument. You don't want to go down the slope of weakening your position, hence weakening your coalition, hence weakening your coalition's bargaining position, hence further weakening your position, and so on.

2. Case 2: Should people be allowed to think freely? 2.1. The First Amendment and delegation

The first amendment answers "yes" and protects the general right. But it does so in a complicated way, with several exceptions and also several pillars:

See a bit more here: https://berkeleygenomics.org/articles/The_principle_of_genomic_liberty.html#analogy-first-amendment-rights

Like many complex policies, this works by delegation to the government. It might sound strange to say this, given that the outer clause of the First Amendment says "Congress shall make no law respecting...". Are they really enforcing a norm?

I would say yes, they are. They are maintaining a monopoly on violence, as any government has to do; and with that monopoly on violence, they are not restricting free thought (in the manners listed). In other words, they are defending the land from being militarily controlled by a regime that would restrict thought in that way. To say it yet another way, they enforce the norm "Do not use violence to restrict thought (even if you're the government).".

2.2. Erosion of free thought

It is said that free speech (hence free thought) is under attack. I don't know if that's more true now than before—e.g. there are sometimes Red Scares which come along with suppression of some speech, and likewise far left regimes might perform political purges. I'll take it for granted that there is a somewhat exceptional such attack currently, and speculate based on that.

Why would this be happening now? A hypothesis: Previously, the government is the center of gravity not just for institutional political power, but also for cultural political power. The First Amendment was the crown jewel, or keystone, of a culture of free thought.

But then social media happened. The internet became the public square, and social media companies gained power over moderation of the public square. Further, patterns of widely distributed social behavior enabled by social media became powerful social forces suppressing various ways of thinking.

This is a new type of concentration of power, and kinda circumvented / usurped the federal government's monopoly on mere physical violence.

2.3. Power vacuum

Social media opens up this new arena in which to fight over what thoughts are allowed, and there isn't a preexisting monopoly on force. This creates a power struggle for group beliefs.

People seem to abandon the simple rule of "thought must be free". Or rather, they abandon the general cultural values anchored by, or given a rallying flag by, the federal government's First Amendment protections.

To look at it another way, there wasn't a simple notion of free thought available. Instead, there was a complicated thing protected by government; and it just was practically not a problem to have no normed antibodies against cancelling, because there was no massively multiplayer online social media.

Since people naturally might, for example, not hire someone because of some extreme political opinions they hold, there's not especially strong cultural boundary against that. But that only becomes a major problem with social media, cancel culture, and the mob wielding the hire/fire/debank power.

2.4. Simple concepts help coordination because they are anti-invidious

If you have exceptions in rules, people might think they can lobby to get exceptions for themselves. This is especially so if the exceptions treat some class of people asymmetrically. That inspires envy. (It also inspires indignation, which is appropriate, as it is unjust. Justice would lead to symmetry; envy leads to multiple parties trying to get carve-outs, hence conflict.)

In other words, simple concepts are easier to agree on in negotiations. Negotiation is a kind of coordination—that is, coordination to find a better alternative to conflict. Cf. "Coherent Extrapolated Volition".

3. Case 3: How nice should you be? 3.1. Nice and kind

I'll (uncritically) hypothesize here that if a person is "nice", that means ze will go to significant lengths to make other people feel better, just for the sake of making them feel better, without context dependence.

I assume a lot has been written about niceness vs. kindness. I'll take a rough and ready definition that kindness is trying to deeply help someone even if you make them feel bad locally.

If you're nice, you don't amputate the patient's septic leg, because it would cause pain. If you're kind, you do amputate, because it will prevent death from sepsis.

The difference between nice and kind is pretty blurry, because it's hard to tell what would be really helpful for someone in the long term. Also, if someone's upset, there's definitely something going wrong for them. (It might not be what they think it is or say it is, it might be "their own fault", it might not be your responsibility, there might be nothing you can feasibly do to help, the intuitive way to help might make things worse, and so on; but there's definitely something going wrong for them.) When you're very uncertain, the best available guess at how to be kind might be to basically be nice.

3.2. Burning goodness

You shouldn't always be nice or kind.

For example, sometimes you have to protect yourself first. "If I am not for me, who will be for me?"

You have a moral obligation to not feed yourself to evil. That is the case even if the evil is represented by a nice person who is upset. For example, Scientology with solicitors on the street corner.

Should you give money to homeless people who ask for money? This feels like a tough question. The nice thing to do is to give them a bit of money. Otherwise they will feel sad and rejected. Is it kind? Maybe, maybe not. If they're going to buy drugs? I don't think so. What about snacks? How do you compare against the value you can give to the world by putting that money to better use? Will you do that? Should you have to? Are you setting goodness on fire? Or is it a good way to practice being nice and/or kind? Should other people think you're not nice if you don't? What about if you don't tip?

Suppose Fred presents an idea during a meeting with you, other coworkers, and the Boss. You make a critique of Fred's idea, and the group decides against the idea. Later, Fred confronts you. He's sad, upset, and hurt, and mad, and asks you to not embarass him like that in front of everyone. But you weren't rude or mean or derisive, you just critiqued the idea. Possibly everyone else would very slightly downweight their expectation of Fred's ideas being good, based on that interaction—but shouldn't they? It's not that big of a deal. Next time, should you avoid critiquing him because it would be not nice?

Peace is good. Should you be a pacifist? No, that's not right. You should figure out how to pursue and prepare for peace, without baring your neck to your enemy.

3.3. The virtue of niceness

On the other hand, "just be nice" is a simple concept and simple policy. This gives it great power. By following that principle, you are often nice, and even kind, when others might have felt no obligation, or might have rejected that option as being a burden. Niceness and kindness can be reciprocated directly; or indirectly through reputation; or indirectly through you being part of a healthier, more generous community; or through others being able to see that you are simply, steadfastly nice, and so they are more free to rely on you and act in ways that are positive-sum given that you will hunt stag.

3.4. A complicated boundary

You can judge on a case by case basis (e.g. I tip, I don't give homeless people money but I look them in the eye and say hi and help people in immediate physical distress, I'm in favor of seeking peace with the Palestinians and working towards a Palestinian state but it has to be disarmed until its self-government can prevent terrorist takeover, etc.). This takes some work, and also it's less legible.

4. Case 4: Reprogenetics 4.1. Slippery slope

If we allow some amount of reprogenetic technology, such as embryo editing or polygenic embryo selection, is this a slippery slope to eugenics or Gattaca?

4.2. Terraced slope

In general, usually slippery slopes are actually terraced slopes.

Going down a terraced slope is slightly more dangerous than just staying at the hilltop. It's possible to fall down the slope, which would be bad, so you have to watch your step.

But it's not nearly as bad as a slippery slope. It's feasible to think more carefully about where you want to be, and then go to that level.

Is abortion a slippery slope to satanic child sacrifice? Not if everyone knows when brains develop, and everyone knows that everyone knows, and so on. Then we have a clear desirable answer.

4.3. Constructing simple boundaries

When discussing coordination around "simple" questions above, what does "simple" mean? Really it is simplicity relative to common knowledge referential distance. I.e., things where everyone knows what you mean, and everyone knows that everyone knows what you mean, etc.

So you can change what is "simple". To do that, you have to change what concepts (boundaries, criteria, rules, categories, arguments, lines of reasoning, plans, skills, stances) are in common knowledge.

If everyone knew clearly what embryos and their brains are like, and everyone knew that everyone knew that, you could safely say "it's morally costless for a couple to discard their embryos before there are neurons in the embryo", WITHOUT some people WORRYING that WHAT OTHER PEOPLE MIGHT HEAR is "Hey actually it's fine to KILL A HUMAN LIFE!".

4.4. Flattening levels of recursive knowledge into base-level percepts

Consider the first example in "Common Knowledge and Miasma".

I would say that the way this is implemented in people's heads ends up being a bit complicated—that is, unnatural / inelegant, from an abstract mathematical perspective. Specifically, the binding between a person's models of other people and those other people themselves is not very tight. So it's not a clean recursion, where the relationships between adjacent levels are analogous to each other.

Instead, there's a ton of "hardware acceleration" or "compiling down" or simply "flattening". Higher levels are aggressively approximated and cached, thinned down to just what's needed to track a few important coordination points. Also, there's a lot of gemini modeling, where I don't track you as an entire separate person, but instead I think you're "basically me, including how I know the public announcement X, but maybe you don't know the fact Y I privately read".

For example, you would just say "we've established that X" once X has been publicly announced. You don't keep modeling a tower of people, you model an ambient "established" thing. Or you model "who's good with who", which might be a mush of "Alice likes Bob" and "Alice and Bob have practical-approximate-cached common knowledge that they are buddies and that they are not sexist" and similar. You might have a sense of who's "new" vs. "in", i.e. generally lacking vs. possessing our common knowledge.

By this sort of flattening, the several levels of recursion can be computed relatively fluidly, without huge slowdowns.

Another example might be authority. If I have authority, that means that there's common knowledge that I can exercise special rights—i.e. I can do certain things, such as give you an order, which means I threaten to have you punished, which means I will tell other people to punish you, where there is also common knowledge that if they don't follow my order then they will be punished, and my threat to you is thus backed up, and furthermore I'm not punished for making this threat even though normally if someone threatened someone like that (trying to give an order without authority) they would be ignored and possibly punished for that. So there's this flat thing called "authority" that I could have. And then we can play further games with common knowledge on top of that, perfectly fluidly. E.g. I could walk into a room with someone who does not know I have authority; and then they could realize I have authority, and I could see that they saw my insignia, but I could also see that they did not see that I saw that they saw my insignia, so we have two levels of recursive knowledge—built on top of this complicated flattened property of me (my authority).

4.5. Blunt force responses

Suppose a woman was raped. Now she does not trust any men. Is that a mistake?

In one sense, yes. Some men are good and safe, and would be beneficial for her to be able to work with. Further, it seems feasible for her to distinguish some of those men, with high confidence.

In another sense, not necessarily. If she does not already know how to make that distinction well (which may be harder than it sounds), then "just work with good safe men" is not an available option. Her available options are "be around as few men as possible" or "take a bunch of dangerous gambles".

In this way, blunt force responses are good. To do even better, you have to construct a more complex boundary which is both feasible to distinguish, and also correct, in that using that boundary is actually better than the blunt force option. Until you construct such a boundary/concept/criterion, the better option is unavailable.

4.6. Compiling complex boundaries to simple boundaries

So to recap:

There's a blunt force response of avoiding going down the slippery slope. This makes sense if the only two options are "stay at the hilltop" or "step off and slide down".

To make a better option, we want to construct terraces on the hillside. A terrace is a simple concept and boundary that we can coordinate on. Coordinating on that boundary is our alternative to the blunt force "stay on top" action.

It's a more precise boundary:

It's not just "no gene modification at all" but instead it's "some gene modification, if done safely, sanely, accessibly, consensually, etc.".
It's not just "no killing anything that is a human organism" but instead it's "no killing anything that we aren't confident is a human soul, and having a human soul requires having computation substrate, which currently implies neurons".

In order to make that option available, we have to be able to coordinate on it. To be able to coordinate on it, it has to be simple. To make it simple, we have to flatten down the concepts involved in implementing that terrace. We have to make them understandable and understood; and we have to do that in common knowledge.

5. Case 5: Lemons

When people are adversarially hacking categories and perceptions, this adds a whole new dimension. Complicating your boundaries is fraught with exploitability.

For example, if you only work with people who are nice simpliciter, that's some kind of protection maybe. If you work with people who are usually nice but then sometimes they do something not nice, and they say "well you know it seemed necessary to be not nice there, because of such and such reason, but mostly I want to be nice", then... that might be fine and good, but it might be a complication in your boundary that an exploiter walked through.

tag:blogger.com,1999:blog-8939787122970662740.post-6058629478465245081

I'd probably need more proof-of-work of understanding to want to continue engaging

tsvi bt Nov 15, 2025 Updated Nov 23, 2025

Show full content

[This is a message I have, several times, wanted to send to commenters (mainly on public forums). I'm putting it here so I can link to it.]

For whatever reason, in order for me to actively independently desire to continue this thread of discussion, I would probably need something from you. What I would need from you is something like more evidence that you're genuinely trying to understand what I'm saying in context, and some of the output from that process.

This is not a dunk on you. This is me just saying:

For the time being, I don't want to keep talking on this thread, though I'm still at least a little interested in the topic. By default I would just go silent. But, as a slight improvement over that, I'm letting you know that I might want to keep talking on this thread, if I had more sense that (and how) you were trying to understand what I'm saying / what I think / what we agree or disagree about / etc.

I might have stated slightly more specifically why I'm sending this, along with the link; but either way, this is a vague statement, not necessarily specifically saying what's going wrong or what would help.

1. What I'm not saying

I'm not promising, or even strongly suggesting, that I will continue the conversation if you say more. If I'm linking this, there's a good chance I don't have a clear sense of what's going wrong (in terms of, what's making me not want to continue engaging) or what would help. You should be aware of this before doing a bunch of interpretive work. Your work might be good, accurate, effortful, and generous work, and then I might still not want to continue. So I don't want to lead you to waste effort on this without a good enough chance of benefit, but with a mistaken expectation of benefit. Even if you do everything I list below I might still not want to engage. Maybe you didn't do it well enough to address the issue, or maybe my list doesn't include what would help in this specific case, or maybe I'm totally wrong that more understanding from you would make me want to engage.

There's no specific thing I'm saying that would make it so I want to talk more. For example, I'm not saying "cite more of the words I wrote" or "summarize the whole foregoing conversation". Sometimes that might help / be relevant, sometimes not. That might in some sense fall in the general category of [some type of evidence that you're trying to understand what I'm saying], but it may or may not address the communication issue for me. See below for examples of things that might help; notice that they are varied, so that any one thing may or may not help.

I'm not demanding or even requesting that you do anything. I'm just trying to communicate about a possibility other than "I simply stop responding". I'm not wanting to imply that I have some special right or position to demand or request that you do anything. You could very well respond to the exact same thread with a link to this very post or a similar disclaimer.

Likewise, this isn't a dunk. I'm not saying you necessarily did anything wrong. Your comments might be correct, relevant, polite, helpful, interesting, prosocial, etc.

You can still be annoyed at me for ending the conversation abruptly, same as if I just silently left. I hope you'll view me linking this as a slight improvement over that, rather than as an aggression.

You may very well have already put in a lot of work. You may have thought about what I wrote, thought about what you think, and formulated a correct relevant response—and then I linked this. This is perfectly possible. I may have not put in much work; I may have been behaving poorly, e.g. not understanding what you were saying, or not trying very hard to, or not communicating clearly. This is not to excuse my poor behavior, or imply that it's your fault.

Especially if you are a third party reading this disclaimer, remember, this is just me giving a low-context signal. More like me waving a driver to go ahead through the intersection, rather than me, a judge in a civil court, deciding whether someone violated traffic rules. Just the fact that I linked this is only very weak evidence that the person to whom I linked it is doing anything objectionable. It could very well be just that I, at this particular moment, have an especially high bar for wanting to engage, because I'm tired or busy or whatever; or I'm misunderstanding what they are saying, or not myself putting in much work to understand; etc.

2. What might help

(To repeat from above, this is not a list of demands to be met.)

This section may be annoying, in that it lists several separate things. Indeed, there's a good chance you don't want to try to continue the conversation (with me, for the time being), because it's hard to tell what would make me want to continue. That said, the single general thing would be:

Try to figure out how it seems like, from my perspective, we are failing to be on the same page about trying to talk about something / communicate about something / figure something out. Then try to address that somehow.

For example:

Maybe I thought you literally hadn't read what I wrote. Then a summary of the foregoing discussion might significantly help, because it would show you had read it.
Maybe I thought you were talking about something irrelevant to the topic of discussion. Then the summary might not help, and instead what might help is you saying what you think the topic at hand is, and why you're saying what you're saying—how it relates to the topic at hand.
Maybe I thought you were too much projecting / imagining positions onto me that I don't hold. Then it might help for you to restate the position you think I hold, that you're responding to.
Maybe I thought you seriously misinterpreted words I wrote, or missed the point. Then it might help if you gave a little transcript of what you're thinking when you're reading what I wrote and trying to understand the main thrust of it. That way I can see that you're trying, and maybe I can point out what went wrong—or I can see how I miscommunicated and say "oops, my bad, I should have said XYZ instead of what I said".
Maybe I thought you were just out to dishonestly frame something. Then it might help if you restated things in a more complete and fair (not necessarily balanced) way.
Maybe I thought you were a Charging Hobby Horse. In that case, it might help if you take some time to reflect on what you're really trying to do in the conversation, and then try to state that openly.

(It might seem strange that I sometimes say "the topic", as if there's one "the topic" and it's whatever I say it is. Fair enough, discourse is high-dimensional. But if you're riding a hobby horse, I may not be interested. If you're responding to something in the middle of another discussion, then the relevance and meaning of statements is by default in reference to that context.)

3. Why I might be saying this

There are several sorts of reasons I might be saying this; and often I can't immediately tell exactly what the reason is. Note that it may be just one or two of these. I definitely am not asserting that all or most of these are the case.

Again, this isn't a dunk, and it can't possibly be a detailed argument that I have the moral high-ground in the specific thread where I'm linking this, because this is a post I've written abstracted from any specific situation, maybe long before this thread occurred. The reason I list these possible reasons is not to accuse you, but just to give you information.

That said, some of the possible reasons I'm linking this:

You seem to simply be saying several importantly incorrect things.
You seem to not be responding at all to what I wrote, or seem to not have read it.
You do seem to have read what I wrote, but you're restating my positions very incorrectly.
You seem to perhaps be bad at understanding what I write, and this particular topic is complex / subtle, so this combination makes me not want to continue at the moment. (If I saw the output of your attempt to understand what I'm saying, I might be able to at least say that this is what's happening.)
You seem to be responding to positions I don't hold and didn't say. Maybe I've already restated my relevant positions and you're still not getting it; or if you are, then something else is going wrong.
You seem to be saying something pretty off-topic / not relevant to what I think we're talking about. Especially, you seem to be zooming in on certain statements, and then failing to correctly connect them to other statements.
I can't tell whether, on the one hand, you're genuinely misunderstanding / I communicated unclearly, vs. on the other hand, you're not trying to understand or you're actively trying to misunderstand. I would like to see proof-of-work so I know you're actually trying to understand.
You're only trying to understand some of what I said, rather than the main / most relevant parts of what I said.
You're persistently saying things that seem to me logically disconnected from the previous discourse, e.g. you're working off an enthymeme; and I'm not succeeding at getting you to talk about that in order to clarify the structure of what you're saying.
Information that I'm trying to communicate to you seems to bounce off you—especially information about my positions, especially especially information about my positions that you are apparently arguing with. Similarly, I'm trying to reorient with you in order to repair the discourse, but this is going wrong somehow and I've locally lost hope that we can get back on track by talking about how we went off track.

tag:blogger.com,1999:blog-8939787122970662740.post-8314871970963508525

The Charge of the Hobby Horse

tsvi bt Nov 14, 2025 Updated Nov 23, 2025

Show full content

nav > ol { list-style-type: disc; } nav ol ol { list-style-type: circle; } nav ol ol ol { list-style-type: square; }

[Epistemic status: !! 🚨 Drama Alert 🚨 !! discoursepoasting, LWslop]

1. Case 1: You only get six words
2. Case 2: Trees may be cool but how should concepts work in general??
3. Case 3: The Bannination
4. The Pattern
5. Conclusion

1. Case 1: You only get six words

In 2024, the MATS team published a post, originally titled "Talent Needs in Technical AI Safety".

I, a hero, made this comment and elaborated in the ensuing comment thread. The content isn't so important here—basically, I was objecting to a certain framing in the post, which tied into a general issue I had with the broader landscape of people nominally working on decreasing AGI X-risk.

Now, I have not actually read this post. (I kinda skimmed it and read parts.) So I don't actually know what's in it. The post's description of itself, from the introduction:

In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions.

Well, that sounds like a lot of work, and I believe that they definitely did write about several ideas coming from that work. My comment was not about any of that—which makes sense, given that I didn't read the post—but that did not stop my comment from being the second highest-upvoted comment, and dominating the discussion on that forum.

Is this bad? My comment was fine; I made an important remark, wasn't super disrespectful, clarified that I was talking about a broader phenomenon in the community... And yet, from another perspective, what happened is:

I (understandably) misunderstood an idiom in the original title ("talent needs") as subtly framing what counts as "technical AI safety", and homed in sharply on just a couple phrases from the post;
I used morally charged language (e.g. "...it's not enough...", "...should be made aware...", "...is conflating...", "I don't buy...") while prosecuting my off-topic issue;
and this distracted from the actual effort and content of the original post, while sucking up some attention from the author(s).

[There might be worse examples of me doing this Pattern, but I didn't immediately find them in the first page or two of my top-voted comments.]

2. Case 2: Trees may be cool but how should concepts work in general??

When eukaryote wrote a very nice post about trees ("There’s no such thing as a tree (phylogenetically)"), Zack_M_Davis had a reply. Follow that link if you want to read the whole thread (not very long). But I'll give an abstracted (and maybe somewhat unfair) paraphrase:

eukaryote: "...and by the way, this stuff about trees reminds us that X isn't uncomplicatedly true."
Zack: "Excuse me! X is definitely true! This is important and I've written about that a lot."
eukaryote: "Oh yeah I agree. We agree."
Zack: "No but X is super-duper true."
eukaryote: "Oh yeah I agree, I wasn't saying X is not super-duper true, I'm just also saying there's some interesting complications."
Zack: "But if I take your specific word choice and imagine a whole epistemological stance that produced that word choice, I disagree with that epistemological stance because of such-and-such."

Then eukaryote claps back and DESTROYS Zack with CLEAR CALM EXPLANATION of why his interaction was KINDA NOT GREAT!! gives a discourse-level response, which I'll quote here almost in full (bolding mine):

I don't love this thread - your first comment reads like you're correcting me on something or saying I got something important philosophically wrong, and then you just expand on part of what I wrote with fancier language. The actual "correction", if there is one, is down the thread and about a single word used in a minor part of the article, which, by your own findings, I am using in a common way and you are using in an idiosyncratic way. ...It seems like a shoehorn for your pet philosophical stance. [...]

To be clear, the expansion was in fact good, it's the unsupported framing as a correction that I take issue with. This wouldn't normally bother me enough to remark on, but it's by far the top-rated comment, and you know everyone loves a first-comment correction, so I thought I should put it out there.

[Note: Zack was the first association in my head to this pattern—so I was slightly surprised to find that it took a bit of effort to find an example. Some of his other top-voted comments are similarly confrontational, but do not exhibit The Pattern. E.g. here he does something superficially similar—but, even though it's technically off-topic it's close enough, and also he explicitly signposts the slightly-off-topicness, and also it's a useful comment, and I think its main argument doesn't misunderstand anything from the original post. Also Zack apologized to eukaryote. Maybe there are other examples I didn't quickly find.]

3. Case 3: The Bannination

I wrote a post ("The problem of graceful deference"). Wei Dai made a comment with an ensuing thread. You can read it there; I'll again provide an abstracted gloss (which is not trying to be a comprehensive summary [ETA: Wei states here that "I think it's a significant misrepresentation of my words (that makes me appear more unreasonable than I was)", which I don't agree with.]):

Tsvi: "...X..."
Wei: "It seems strange to say X because [reasons]. Also, not-Y."
Tsvi: "Here's some reasons for X; and I don't see how your reasons say not-X. Also, I'm not saying Y—in fact I done been sayin not-Y!"
Wei: "In response to your reasons for X, I would like to say that actually not-Y."
Tsvi: "I agree with not-Y. That's not an argument for X though."
Wei: "But it seems like you're saying "X" and "X implies Y"."
Tsvi: "Noooo rawr I keep saying "X and not-Y", and in fact I said that in the original post, how could you possibly think I'm saying "X implies Y"?"
Wei: "Ok, I misunderstood, but like, you didn't say "not Y" in your post right after you said "X". Anyway, let's talk about how Y is not true."
Tsvi: "I'm banning you from commenting on my posts."

I think this is another clear example of The Pattern. (If you're trying to interpret the full thread: X is "Yudkowsky is the best strategic thinker on AGI X-risk." and Y is something like "People should have been deferring to Yudkowsky as much as they did.".)

[Note: I'm unbanning Wei Dai from commenting on my posts so that he can respond here if he wants. By default I won't reban him. I prefer and would like to support open discourse within a community. But I reserve the normative right to block trolling.]

4. The Pattern

So what is this pattern? I'd call it The Charge of Hobby Horse. It's where you ride your hobby horse into battle in the comments, crashing through obstacles such as "the author did not even disagree with me, so there's nothing to actually have a battle about". A bit more explicitly:

You have a hobby horse. It's important, and you care about it, and you talk about it, but it's not satisfied yet. You don't know how to satisfy it. You only know how to ride it.
Someone says something, preferably getting some attention which you can ride into. What they said has one or two statements or word choices which are kinda related to your hobby horse. You pick up on that one thing.
You do your best to interpret them (correctly or not) as disagreeing with your hobby horse position, and you start arguing with them. You don't try too hard to quickly correct any misinterpretation—you have to get the argument going a bit.
This works, in that you start a big discussion about your hobby horse.

This comes off as misinterpretation; especially, "mistakenly" reading in a bunch of disagreement that isn't there. But really it's disinterpretation: You're searching for something to argue with, so you can talk about your thing. You're not trying to understand what the author actually thinks, so communication becomes difficult.

5. Conclusion

Mainly I just want to describe this behavior pattern. And also ask: Please don't fabricate fights just so you get to talk about your thing! That's pretty rude!

It's a bit tough because there are a lot of comments that look kinda similar to a Charging Hobby Horse, but that are good. It's fine to make corrections. It's fine to request for the author to give opinions. It's fine and good to use someone's post as a way of discussing some amorphous diaphanous distributed social pattern that is otherwise hard to point at.

It's fine to say things that are kinda off-topic, though it helps to acknowledge you're doing that. I would say it's even fine to comment on a post that you have not fully read, as long as you say you're doing that. If you're responding to The Broader Discourse, Not This Post Specifically™, then it would be nice to say you're doing that. Also, unless you're pretty sure what the author thinks, it's always good to quickly restate what you think they think, in a sentence or two.

And remember, you always have the option to write about your hobby horse in your own post or other forum. If it is actually relevant to the original post that got you wanting to rant a little, you can link to your separate write-up in a comment on that original post.

Anyway, thanks for your attention on this very something matter.

tag:blogger.com,1999:blog-8939787122970662740.post-680818255231846239

Tools for deferring gracefully

tsvi bt Nov 13, 2025 Updated Nov 23, 2025

Show full content

1. Introduction
2. Extended table of contents
3. Notice when you might be deferring
4. Just say that you might be deferring
5. Factor out and fortify endorsed deference
6. Respect the costs of independent investigation
7. Give stewardship, not authority
8. Take me to your leader
9. Recount your sources
10. Expose cruxes about deference
11. Distinguish your independent components
12. Retreat to non-deferential cruxes
13. Open problems in graceful deference
14. Acknowledgements

1. Introduction

Last time, I asked about The problem of graceful deference. We have to defer to other people's judgements of fact and of value, because there are too many important questions to consider thoroughly ourselves. Is germline engineering moral? What should I work on to decrease existential risk? Do I really have to floss? Should I get vaccinated? How feasible is safe AGI? Pick one each month or year to start the long process of becoming an expert on; the rest you'll have to defer on, for now.

Deference leads to several important dangers. It causes information cascades and correlated failures; it creates false moral consensus and false impressions that a question is really settled; it cuts us off from powerful intrinsic motivation.

How can we defer in a way that harnesses the power of deference, while attenuating the dangers?

Below are some partial answers.

If you're interested in this topic, there's a lot more to be worked out, so you could take a crack at it. See the last section, "13. Open problems in graceful deference".

(Caveats: This is a list of tools, each of which you may or may not want to pick up and use. They are phrased as imperatives, but of course they are only good for some people in some contexts. You may feel uncomfortable with some of these recommendations—just remember that you're wrong, trust me we're all already deferring on most questions. So, we are ok-ish with how we are deferring—at least, ok-ish to the extent that we're already ok-ish right now in general; and thinking about how we are deferring could open up ways to make our situation better. These tools are for doing what we're already doing, but more gracefully. These tools are not for throwing out independent reason. And these tools aren't for, you know, feeling super guilty—I only use them a bit! But I usually like when I use them more.)

2. Extended table of contents

Here's a synopsis of the tools.

When are you deferring?
- 3. Notice when you might be deferring. What does deferring feel like? How can you bring it to your conscious attention?
- 4. Just say that you might be deferring. No need to pretend you've already worked everything out yourself from first principles with your eyes closed, or to pretend you never have a stance about anything that you haven't worked out explicitly for yourself.
When should you defer?
- 5. Factor out and fortify endorsed deference. You can speculate about big ideas without having to also necessarily call into question important parts of your life that should stay stable.
- 6. Respect the costs of independent investigation. Truly doubting something important is often very hard work, so it makes sense that you shouldn't necessarily always be doubting lots of important stuff.
- 7. Give stewardship, not authority. Even if you're deferring to someone, keep an eye out about whether and how you should continue to defer to them.
Who are you deferring to?
- 8. Take me to your leader. If you feel an obligation to defend your belief in something, you can try just pointing to the person whose opinion you're deferring to, and hopefully they'll defend it for you.
- 9. Recount your sources. Say where you got information and ideas if you can.
- 10. Expose cruxes about deference. Just like it's often helpful to figure out what would change your mind about some concrete question, it's also helpful to figure out what would make you stop wanting to defer to someone about some topic.
What are you deferring about?
- 11. Distinguish your independent components. If you have something to add beyond your deferential opinion, it's helpful to distinguish the part you're adding away from the rest of the opinion.
- 12. Retreat to non-deferential cruxes. If you were arguing using a strong claim that you have to defer about, you could try instead arguing using a claim that is weaker—so you don't have to defer about it—but still strong enough to carry your argument.

3. Notice when you might be deferring

Needless to say, it helps to be aware when you're deferring. A couple indicators (certainly not proof, but maybe cause to consider that you might be deferring):

You directly associate an idea with a person.
A claim "seems like it stands to reason" (or maybe it really does stand to reason), or "is something everyone knows", or is something you would just take for granted. You might e.g. be deferring to a summary given in a textbook, which is sort of true but doesn't give all the important detail and which you did not indepedently question. Or you might be deferring to judgements that are implied or suggested by other people's actions, even if not stated or argued for explicitly.
You can't easily bring to mind the arguments or evidence for a judgement, or the next-level arguments for the arguments, or cruxes (observations that would change your mind), or specific doubts you have about the evidence, or a picture of what the alternative looks like. You might be deferring to your own cached thoughts, or to someone else's conclusions.
You feel nervous to say the opposite—like you should hold your tongue, or like someone might get mad at you for saying something, or you might say something that you later seem dumb for having said. You might be deferring to others's moral judgements (or your imaginations of their judgements).
You feel an experiential fringe of sanctimoniousness—like, "Ah, I see, you are not aware of this thing that the intelligentsia / elite / informed / experts / savvy people, let me help you out.". You feel comfortable to not worry too much that the newcomer's perspective will gain more cachet, and leave you working on something that few care about; you know that "the community" cares about the thing you're doing, and thinks it's important, but doesn't especially care about this thing that the newcomer is talking about. You might be deferring to the consensus of the group whose knowledge you are graciously sharing.

4. Just say that you might be deferring

If you realize that you have a bottom line judgement already, but that you HAVEN'T already really doubted and investigated the question, you can just say that. You can just say, "I feel pretty strongly that reprogenetics is a bad idea, though I won't argue for that position explicitly right now.". Or you can elaborate, e.g.: "I have a stance against reprogenetics, but I haven't thought about it much, so I might be wanting to go with the current generally accepted stance, or I might have an intuition that I haven't made explicit, or something.". Don't pretend that you are not deferring; don't pretend that you've investigated a bunch and come to an explicitly reasoned-out conclusion.

You can then go on to speculate about what your intuitive concerns might be about, who you are deferring to, why you want to defer to them, what your cruxes might be, what the reasons behind the consensus are, and so on. But by first acknowledging deference, you can go ahead with those speculations without feeling like they have to produce a justification for your bottom line judgement, or that you have to change your mind if you don't produce such a justification. You've already stated what your current judgement is, and you've already acknowledged that the source of your current judgement is likely to be mainly deference, not a concrete reason.

Allow others to defer, and to say they are deferring. Allow others to provisionally think through why the judgement is correct or incorrect without having to update their judgement just based on their own reasoning.

5. Factor out and fortify endorsed deference

In relation to his philosophical exercise of fundamentally doubting everything, Descartes writes in Discourse on Method, part three:

And finally, just as it is not enough, before beginning to rebuild the house where one is living, simply to pull it down, and to make provision for materials and architects or to train oneself in architecture, and also to have carefully drawn up the building plans for it; but it is also necessary to be provided with someplace else where one can live comfortably while working on it; so too, in order not to remain irresolute in my actions while reason required me to be so in my judgments, and in order not to cease to live as happily as possible during this time, I formulated a provisional code of morals, which consisted of but three or four maxims, which I very much want to share with you.

(https://grattoncourses.wordpress.com/wp-content/uploads/2017/12/rene-descartes-discourse-on-method-and-meditations-on-first-philosophy-4th-ed-hackett-pub-co-1998.pdf)

Separate out what you want to defer about from what you're going to really doubt. In particular, you're likely to want to mostly defer to society about which actions are generally very advisable or very inadvisable, even as you doubt the supposed justifications for those judgements.

That way you can more safely doubt some things without threatening your deference, or in other words, you can continue deferring while incurring less of a cost of restriction on what you can doubt. E.g. "Ok, I can discuss whether or not reprogenetics would hypothetically be good if safe and effective and accessible and legal and widely practiced, but either way I will not work on a project that's actually trying to do embryo editing.".

6. Respect the costs of independent investigation

Doubting something that's important to how you think and act is a fearsome undertaking. Respect the costs of doubting—i.e. the costs of maybe undoing some way that you have been deferring. Respect the costs of going against the grain, betting against the market, doing a bunch of cognitive labor yourself that you could have just copied from your society.

Because doubt is costly, it is dignified to defer instead. Defer with dignity. It is good to remember that, when you defer, you are drawing on the resources that your society provides to you. You could possibly have done more work on your own in order to produce better information and better judgement for yourself and for others; but it is respectable to choose which questions to struggle with and to mostly defer.

You don't have to come up with a reason for rejecting an idea that is not your true rejection. The most respectable thing is to do original work, solve a problem, and publicly demonstrate your solution; the least respectable thing is to defer and pretend that you are not deferring; and in the middle, respectable enough, is to defer and say you are deferring.

This is also something you can do to help others defer gracefully. If other people know that you understand that non-deference (independent investigation) is costly, then other people who are deferring can more comfortably just tell you "I'm deferring" rather than pretending to not defer.

7. Give stewardship, not authority

When you defer to someone, do not give them authority. Your judgements aren't their property to do with as they please.

They are the stewards of your judgements. You've given them concrete control, but that control is yours to modify or transfer or revoke, and you retain ultimate responsibility for your judgements.

Don't open niches for those who you defer-to, within which they can abuse their stewardship. Don't needlessly expand their control over your judgements. In other words, don't be a cult-follower towards anyone, even if they aren't yet being a cult-leader.

Keep accounts about whether the steward appropriately handled your judgements on your behalf. You can't necessarily hold them accountable, since your deference was a choice—you made that choice, you are responsible for it, and you may have made it without their input. But keep the accounts, usually ideally publicly.

8. Take me to your leader

There is a norm of debate. (For better or worse, it's a weak norm, deployed in few communities.) According to this norm, if you say X and Bob says not-X, then you should either debate Bob about X or else update to believe not-X.

This norm pressures people to not defer, because a judgement based on deference is not something you can stand up and defend with arguments and facts in a debate. Either you eject yourself from the communities that have that norm; or else you have to do a bunch of research and thinking in reaction to any random challenge; or else you have to fake having coherent positions in debates.

Instead of choosing one of those options, say: "I haven't investigated X deeply myself. What Carol says about X makes sense to me and I generally trust what she says about several topics. Further, so far she's successfully rebutted the critiques of her position. So, if you want to convince me about X, debate Carol and show her position about X to be wrong.".

And, to help others defer gracefully, treat that as a respectable response.

And, apply the debate-or-update norm more strongly to the leaders who are deferred-to. (Though this is fraught, if the leaders did not choose or make use of their position.)

9. Recount your sources

If you got an idea or insight or piece of information from Alice, and then you repeat it to Bob, also tell Bob that you got it from Alice.

People don't do this. Partly that's because it's hard to keep track and takes effort to recount in conversation. Partly it's because they want to sound smart—but that is a major transgression against God's will.

By recounting your sources to Bob, you let Bob know who to defer to, if he would want to defer to the source of what you shared with him.

If you fail to recount your sources, then you appear as though you aren't deferring, even when you are.

If you fail to recount your sources, then you open up your listeners to double-counting evidence, if they also hear other second-hand judgements that actually are transmissions from the same source as you are transmitting.

If you fail to recount your sources, then you make it harder for people to track and quarantine bad information. For example, if you say "NZT-48 causes brain bleeds.", what am I supposed to do with this information, if I know about the topic? Instead if you say "I read a study by Krombopulos et al. (2028) that says NZT-48 causes brain bleeds.", then I can be like "Yeah I've read that study, they totally screwed up their analysis, there's actually no effect.".

10. Expose cruxes about deference

Did COVID-19 originally leak from a lab? I don't even have a guess, but if I did, it would probably be based on deference to some expert in genetics and virology. You can't really argue to me about cleavage sites and base rates and so on (well you could, but it would take a lot of work). Would I then be, in practice, completely impervious to facts and reason? No, you could shake my judgement by convincing me that the expert(s) I'm deferring-to make visible errors that are important to their stated case; that they have often previously put forward plausible-sounding arguments that were later shown to be wrong; that their credentials are fake; and so on.

Even if you don't have relevant cruxes directly about the topic, put forward cruxes about the people you're deferring-to.

11. Distinguish your independent components

(I heard this from Andrew Critch or Anna Salamon.)

When you share an opinion, distinguish the part that is originating with you from the part that you are summarizing from other people.

For propositional opinions, this means sharing your first-hand observations separately from your summaries of other people's testimony. E.g.:

Instead of saying "I think Novavax is better than Moderna", you might say "Gippity says Novavax is better than Moderna, and Googling says it has less side effects ...and I took both and had less side effects from Novavax" or "...and I took both and couldn't tell the difference".
Instead of saying "I think AI alignment is hard", you might say "All the experts whose writing about AI makes sense to me say that AI alignment is hard, but I haven't tried myself". In this example you have an independent component, which is "which writing about AI makes sense to me". You explain what judgement you are adding in to the mix, and explain what your listener would be trusting if they trust your conclusion, rather than ambiguously posing as an expert.

For values and decisions, this means sharing your desires and your "best guess about what to do, if you were the sole decision-maker", separately from what your actual current plan is, which may be based on your having aggregated the group's values. E.g. you might say "It seems more convenient to go to the DMV and then the grocery store, but I'm not that confident and Alice said the opposite and I'll go along with what she said" rather than "We are going to the grocery store first" which makes it sound like you independently agree that we should go to the grocery store first.

If you're going to update your probability based on others's opinions, then also share your un-updated probability.

This helps avoid information cascades and social miasma.

12. Retreat to non-deferential cruxes

Are Jews genetically predisposed to be more sneaky than non-Jews? I don't know, probably not, and in order to form an opinion, in practice I would probably have to defer to experts in genetics. But I also don't care very much. Even if we are genetically sneaky, you can't kick us out of the government or ban us from business. The genetics thing isn't a crux, and it shouldn't be for you or for a free society. I don't need to answer that difficult-to-answer question about genetics. If you want to make me even care about genetics in the context of goverment policy, you'd first have to argue the implication from genetics to policy, not anything about genetics itself. I would need to defer about genetics, but I don't need to defer about my judgement that genetics should not effect policy. That's something I can see and argue for myself.

This illustrates a general principle: Often you don't have to make judgements at all. If you can answer the practical, action-affecting questions without fully answering some other question X, then you don't have to form an opinion about X right now.

As another example, I don't know when during development a child gains a soul; but I'm sure they have a soul by age 2 years, and I'm sure they don't have a soul by age 7 days. So I'm confident that it is morally acceptable for parents to choose to destroy 7-day embryos. I would have to defer to neurologists and embryologists about many of the relevant facts for, say, 4-month-old fetuses; but that's not a crux for IVF, and I'm indepedently confident that IVF is morally acceptable.

So, suppose you have a question at hand, and you have some cruxes for that question, and for some of those cruxes you have a non-deferential independent judgement about them. In this case, base your arguments for your position on those cruxes rather than on your deferential judgements. Say "I'm sure a 7-day embryo doesn't have a soul.", not "Experts agree that even a 2-month embryo doesn't have a soul.".

13. Open problems in graceful deference

What are more ways to notice when and in what ways you're deferring?
How are we already deferring, descriptively? Which of these ways are good and bad in what contexts? How can they be generalized, fixed, improved, refined?
What are some ways to notice that you are deciding or starting to defer? How do you get other people to notice when they are starting to defer?
- I've done a substantial amount of mentoring for newcomers to the CFAR sphere and AGI existential risk reduction sphere (CFAR, ESPR, MIRI, PIBBSS, SPAR, MATS). I spent a lot of effort trying to get newcomers to confront the pre-paradigm nature of technical AGI alignment, and the strategic uncertainty around AGI X-derisking. So, I've seen a lot of people go around at workshops asking "established experts" such as myself about what's important and what they should be working on. I tried, but never really figured out how, to get them to understand that they were engaging in process of downloading a consensus to defer to, and why it matters that that's what they're doing.
Intuitive deferential processes.
- Very often, deference happens unconciously and through forces other than some sensible epistemic updating.
- What are these other processes? When are they ok and not ok?
- E.g. what's the deal with naturally subtly deferring to your surrounding social milieu on lots of stuff?
- E.g. how to interpret when people are deferring to some inexplicit distributed supposed consensus communicated through quasi-linguistic cues?
How to choose who to defer to? E.g. epistemic spot checks.
Meta-deference. How should we defer about who to defer to? E.g. who do you trust to tell you who is a reliable expert on something?
How can awareness of deference be leveraged?
- Are there more graceful ways to defer that are unlocked by being fully concious that you're deferring from the beginning (e.g. when learning about a new field for the first time)?
- When you realize that you've been deferring, and you hadn't realized before, what to do? When should you endorse that deference, and how strongly should you endorse it? How quick should you be to stop deferring and instead investigate?
- What are good and bad ways to orient to others when they are deferring? When they are being deferred-to?
Compare: deferring to a single person vs. deferring to a group or consensus (e.g. "what virology thinks of COVID") vs. deferring to a multi-party process (e.g. "the jury trial acquitted").
How to un-defer?
- How do you prepare to un-defer? See e.g. Planting questions.
- When to change deference perms for specific deferrees? When and how to fully defer from a deferee, or widen or narrow the scope of deferrence to them?
- How to prioritize un-deferring? Which questions should you invest in investigating? See e.g. "overhaul key elements ASAP".
- What other "cleanup" should you do when undeferring? E.g. propagating updates about things you were deferring about; propagating updates about "I shouldn't have been deferring on this question or to this person".
How to make the deference relationship healthy?
- See e.g. some of habryka's posts.
- E.g. how to be a good deferred-to person? E.g. "Do not hand off what you cannot pick up".
- E.g. how to be a good deferrer, in relation to the deference and the deferree? E.g. "Question the Requirements".
What to do when you have to defer, but also there's no good deferrees?
- E.g. what to do when you're in market for lemons? Maybe you can fund people who might themselves become good deferrees.
How can you recognize when, by deferring, you're feeding yourself to hostile processes? What to do about that?
How do you gracefully defer as a group?
- E.g. how do you coordinate to alleviate correlated failures?
- How do you appropriately aggregate the incentive to investigate independently? Often no one person should unilaterally investigate, if it's just for their own undeferring, but it would be good for a group to have one person investigating so the group can defer less or defer to a more wholesome consensus.
Third parties.
- How to notice and understand the deference relationships of other people and groups?
- How to deal with them? E.g. how to be kind, but also not let people get away with bad behavior because they're just following orders, etc.
- As a deferree or deferrer, how do you make your deference relationship easier for third parties to interact with suitably?
Dimensions of deference.
- Compare: deferring on facts and propositional beliefs; deferring on importance and values; deferring on concepts and questions.
- There are several reasons that debate or investigation can be infeasible or inappropriate in some contexts. E.g. your stances or beliefs are uncertain, not well-informed, deferential, inexplicit, or weak. How do these relate? E.g. you can have a strong certain inexplicit non-deferential opinion ("I want you to not touch me there; I strongly want that; I'm not uncertain; I can't give a clear explicit explanation of why"). When and how can you and should you untangle deference from other such opaque stances? How to deal with e.g. having a vague blob of cruxes about some question, which is partly deferential and partly your independent intuitions?
If you're going to be a "foot soldier" for a group or cause, based on deferential stances, how can you alleviate the problems that come from that?
If you look carefully at my list of Dangers of Deference, you'll see several that aren't adequately addressed by the list of tools in this article. E.g. group effects of meta-deference are mentioned. E.g. the effects of deferring about importance are mentioned; see also "Please don't throw your mind away".

14. Acknowledgements

Thanks for helpful comments from: Ben Goldhaber, Clara Collier, Linch, Mikhail Samin, Scott Alexander, and Vaniver.

tag:blogger.com,1999:blog-8939787122970662740.post-2636505946071290014

https://tsvibt.blogspot.com/atom.xml

Posts