Unexpected Values — GeistHaus

Consider donating to AI safety champion Scott Wiener

Eric Neyman Oct 22, 2025

Written in my personal capacity. Thanks to many people for conversations and comments. Written in less than 24 hours; sorry for any sloppiness. [Link to donate here — please use this link rather than going to his website — but please read at least the first few paragraphs!] It’s an uncanny, weird coincidence that the two … Continue reading Consider donating to AI safety champion Scott Wiener →

Show full content

Written in my personal capacity. Thanks to many people for conversations and comments. Written in less than 24 hours; sorry for any sloppiness.

[Link to donate here — please use this link rather than going to his website — but please read at least the first few paragraphs!]

It’s an uncanny, weird coincidence that the two biggest legislative champions for AI safety in the entire country announced their bids for Congress just two days apart. But here we are.

On Monday, I put out a long blog post making the case for donating to Alex Bores, author of the New York RAISE Act. And today I’m doing the exact same thing for Scott Wiener, who announced a run for Congress in California today (October 22).

Much like with Alex Bores, if you’re potentially interested in donating to Wiener, my suggestion would be to:

Read this post to understand the case for donating to Scott Wiener.
Understand that political donations are a matter of public record, and that this may have career implications. Decide if you are willing to donate to Scott Wiener anyway.
- If you’ve already donated to Bores, I think there isn’t much cause for concern, with one exception. If you have the sort of federal policy role where donating against an incumbent might hurt your career, or think there’s a good chance that you will be in such a role in the near future, consider waiting to donate until Nancy Pelosi, the incumbent, announces retirement.
If you would like to donate to Scott Wiener, you can donate at this link. (Please use this link, rather than going to Wiener’s website to donate – that lets him know that the donation is from someone who cares about AI safety!)

To state my bottom line up front, I think that:

Marginal donations to Alex Bores still look a bit better than donating to Wiener. If you haven’t yet maxed out to Bores, my recommendation is to do that first.
If you have maxed out to Bores, and you are not working in a policy role where donating against an incumbent might hurt your career (and won’t be in the near future), my recommendation would be to donate now (Wednesday, October 22nd) to Wiener.
- However, if you disagree with my modeling assumptions, you might decide that it makes sense to instead wait and only donate if Nancy Pelosi retires. Edit 11/6: Nancy Pelosi has announced her retirement, so this is no longer a relevant consideration. Donations that come sooner are now more or less straightforwardly better.

[Edit 10/31: at this point, I think donations to Wiener are almost as effective as donations to Bores, and will be roughly as effective if/when Pelosi decides to retire!]

Just two days later…

AI safety champion Alex Bores announced on Monday that he’s running for Congress. In my blog post about Alex Bores, I wrote:

I expect that big opportunities that are as exciting as this one will come up very rarely (maybe once every couple of years). One comparison point: it has been reported that Scott Wiener will run for Congress. I’m super excited about this, and think that we’ll only get big donation opportunities this good once a year or so. Nevertheless, I think that donating to Bores looks a little better, mostly because it’s clearer to me that Bores will continue to prioritize AI safety.6

Where footnote 6 read:

[6]: Note that I wrote a version of this paragraph before I found out that Scott Wiener might run! So I don’t think I’m biased by the recency of this news; I think that, by coincidence, we’ll be getting two amazing donation opportunities only a few days/weeks apart.

Well, today, October 22nd, Scott Wiener announced that he is also running for Congress. As I indicated above, I think this is a really good donation opportunity. In this doc, I will lay out the case for donating to Scott Wiener.

My current bottom line: my current estimate is that donating to Wiener on Wednesday, October 22, looks 75% as good as donating to Bores on his launch day. This is better than any other donation opportunities that I’m aware of, and is likely to be better than any big donation opportunities that I’ll become aware of in the next several months. I have donated $7,000 (the legal maximum) to Wiener.

(While there are considerations cutting in both directions, the main reason that donating to Wiener looks worse than donating to Bores is because Wiener is challenging Nancy Pelosi, and it’s not clear whether she’ll retire. If she doesn’t, it’s unlikely that Wiener will win.)

(Worried that I’m underestimating the frequency of such opportunities, and that I’ll keep asking for money? Here’s a section on that.)

Introduction

Scott Wiener represents San Francisco in the California legislature and has been an AI safety champion. He sponsored and fought really hard for SB 1047, which is my all-time favorite piece of AI safety legislation today (even better than the RAISE Act). Gov. Newsom vetoed the bill (in large part due to opposition from big tech), but it got really close: it passed the legislature, and my sense is that vetoing the bill was a pretty close call for Newsom.

(There’s debate on whether the second-order effects of SB 1047 were net-positive. Some argue that the second-order effects were net-negative because it galvanized opposition to AI safety legislation from big tech. Others argue that the second-order effects were net-positive because the provisions of the bill survived as ideas for public policy and shaped efforts like the RAISE Act in New York. I lean toward the second camp, but I’m not sure.)

This year, Wiener sponsored and fought for SB 53, a whistleblower protection and transparency bill that passed the legislature and was signed into law by Gov. Newsom last month. I’m a big fan of this bill, even though I don’t think it goes far enough.

Today, Scott Wiener announced that he’s running for Congress. I believe that donating to his campaign is a really good use of philanthropic dollars to mitigate AI x-risk. My blog post on Alex Bores is a good starting point for thinking about donations to Scott Wiener. However, there are some important differences to keep in mind (more here).

(If you’re already convinced to donate to Scott Wiener, you can donate here (and thank you so much!). My donation suggestion is $7,000 (the maximum), if you can afford it. But note: if you didn’t donate to Alex Bores, then before deciding to donate, please be aware that political donations are a matter of public record! If you’re potentially interested in working in the federal government, this could have career implications for you. See this post for more discussion. Additionally, if you are or soon will be a federal lobbyist or similar, consider waiting to donate until Pelosi retires. If you’re not sure whether this applies to you, feel free to reach out and I’ll try to connect you with an expert.)

Things I like about Scott Wiener A champion of AI safety

The main thing I like about Scott Wiener is that he has been a really dedicated AI safety champion in the California legislature. This started with SB 1047, an AI safety bill that would have applied only to companies that train models with >$100 million training runs, and which would have instituted the following rules for those companies:

Whistleblower protections for employees.
Companies must “create a reasonable safety and security plan (SSP) such that their model does not pose an unreasonable riskof causing or materially enabling critical harm: mass casualties or incidents causing $500 million or more in damages.”
- The bill required the SSPs to be concrete, and to “include objective evaluation criteria for determining compliance.”
Companies “must adhere to their own SSP and publish the results of their safety tests.”
Companies must be able to shut down any models under their control, if necessary.
Companies that violate these rules and cause critical harm (as defined above) would be held liable for the harm.

(All quotes are from Zvi’s summary of the bill.)

I think that this bill was really good: very well-crafted to be as light-touch as possible while still trying to mitigate the most catastrophic harms from advanced AI.

Unfortunately, the bill generated a huge amount of opposition from big tech. Under these circumstances, lawmakers usually back down, because they don’t care enough about the bill that they would spend their political capital on passing it and make powerful enemies along the way. But that did not deter Wiener, who fought really hard for the bill, revising it repeatedly to try to appeal to as broad a coalition as possible while still retaining the most important parts. In the end, the bill passed the state legislature, although it was vetoed by Gov. Newsom.

Having talked to the AI safety advocates working with Wiener, passing this bill through the legislature was an exhausting ordeal for Wiener and his staff. He was constantly bombarded with personal attacks on Twitter, and made powerful enemies in big tech.

Not to be deterred, the very next year, Wiener introduced SB 53, an AI whistleblower protection and transparency bill. The bill essentially did a subset of what SB 1047 did, but was crafted to strike a balance between helping ensure that advanced AIs are built safely and having a good chance of being signed by Gov. Newsom. Essentially, the bill:

Institutes whistleblower protections for AI company employees (the least controversial part of SB 1047).
Requires companies to publish a “frontier AI framework” describing the measures that the company is taking to mitigate catastrophic risks.
Requires companies to publish “model cards” summarizing how likely their models are to cause catastrophic risks.
Report “critical safety incidents” to the government of California.
Not say anything false about catastrophic risk from its frontier models.

In summary, SB 53 is a transparency bill: companies won’t face consequences unless they lie. And while I think this bill doesn’t go far enough, I think it’s a great bill, and it says a lot about Wiener that he was willing to push for another AI safety bill in spite of the personal nightmare he endured when trying to pass SB 1047.

Further, my understanding from talking to AI safety advocates who worked closely with Wiener on the bill is that Wiener wasn’t just pushing a bill that was written and handed to him by AI safety advocates. Instead, he really understood the technical details of the bill and was key to crafting the bill.

Abundance

“Abundance” is a recently-coined term referring to the pro-growth wing of the Democratic Party. In a nutshell, abundance prioritizes solving issues that stem from scarcity. This includes things like:

Building a lot more housing.
Investing a bunch in the development of clean energy technology.
Generally being careful not to place overly burdensome regulations on industries.

I think abundance is pretty great, and Wiener has been one of the foremost champions of the abundance cause. The most recent example of this is SB 79, which makes it much easier to build housing near public transit. I’m not an expert here, but I’ve gathered from the discussion on Twitter that this is a really big deal for increasing the housing supply.

I’m not sure, but my guess is that if you asked advocates for the abundance agenda who their favorite state legislator in the entire country was, the most common answer would be Scott Wiener.

Effectiveness

Scott Wiener keeps passing important bill after important bill. I’m not sure whether he is literally the most effective current state legislator in the entire United States, but I don’t know who else it could be. When I ask ChatGPT who the most effective state legislator in the United States is, it refuses to answer; but at least when I ask it who the most effective state legislator in California is, it says Scott Wiener.

In other words: I’m not an expert, but I follow politics reasonably closely. And as far as I know, it is literally true that among the 8,000 state legislators in the United States, Scott Wiener is the most effective one at counterfactually passing important bills. (A couple of policy people I talked to agreed with this assessment.)

What are the differences between Scott Wiener and Alex Bores?

In this section, I’ll presume familiarity with the case for donating to Alex Bores as laid out in my blog post, and talk about the main ways that Scott Wiener’s situation is different.

The state of the race

In this subsection, I’ll talk about differences between the state of Bores’ race and the state of Wiener’s race.

The donkey in the room: Nancy Pelosi

Alex Bores is running for an open seat: Rep. Jerry Nadler retired, and there is a race to replace him. By contrast, Scott Wiener is running against former House Speaker Nancy Pelosi, an extremely powerful member of Congress. So why is he running, and does he stand a chance?

Nancy Pelosi may decide to retire: according to Politico, Pelosi “hasn’t said if she plans to run for another term in 2026.” My current guess is that Pelosi is 60% likely to retire, but this guess is based on very little information (and I’m not sure anyone besides her inner circle has more information). (There are some relevant Kalshi markets; see footnote.1)

Note: unfortunately, my cost-effectiveness analysis is pretty sensitive to this 60% number. This number is not resilient: I did my best to make a guess by thinking about it on my own, talking to people, etc. But I am kind of flying blind when it comes to this number.

If Pelosi doesn’t retire, I think that Wiener is quite unlikely to win. I wouldn’t completely rule it out: Pelosi is very old (will be 86 in 2026), and voters are feeling kinda burned by Dianne Feinstein and Joe Biden. According to SF Standard, a recent poll showed that 51% of SF voters prefer someone else to Nancy Pelosi (though I think the poll hasn’t been publicly released, so take this with a grain of salt). And in 2020, a far-left, no-name challenger to Pelosi with a sexual harassment scandal got 22% of the vote against Pelosi (back when she was merely 80).2

After chatting with a couple of people whose opinions on this I trust, I think that Wiener would have a 10% chance of winning against Pelosi.

Why didn’t he wait until Pelosi retires? Does it make sense to donate now, or only once she retires? More on that below.

Pelosi on AI safety

Nancy Pelosi urged Gov. Newsom to veto SB 1047, which may have sunk the bill. (See this Manifold market, where the probability that SB 1047 would get signed dropped from 50% to 35% on this news.)

Some people have asked me what I think about AI safety advocates backing Wiener before Pelosi announces her retirement, in terms of effects on Congressional advocacy in the coming year. My guess is that these effects will be small, especially because Wiener will have many communities strongly backing him, but I’m open to changing my mind on this.

Who else is running?

At least for now, besides Pelosi, Wiener’s main opponent would be Saikat Chakrabarti. The two main things to know about Chakrabarti is that (1) he’s a left-wing activist and AOC protégé, and (2) he’s very rich (“holds at least $50 million in Stripe equity”, apparently) and plans to mostly self-fund his campaign.

SF is a pretty left-wing place, and I think that Chakrabarti is a formidable opponent. If Pelosi retires, I’d slightly favor Wiener over Chakrabarti, because he has represented SF for many years. But I think that it would be a close race.

There are also rumors that Christine Pelosi, Nancy Pelosi’s daughter, may run. I think that her main advantage would be having the support of Nancy Pelosi; however, I ultimately think that this won’t be enough for her to win. I would be surprised if she wins the race.

There will be other candidates as well – for example, I expect there to be a candidate backed by big tech if Pelosi retires – but I’m not aware of anyone in particular who’s worth noting.

Edited to add: it’s also worth keeping an eye on San Francisco Supervisor Connie Chan. If Pelosi decides not to run, Pelosi is likely to back Chan. Chan is a progressive who is more ideologically similar to Chakrabarti than to Wiener. If I had to guess right now, I would guess that Chan is a strong candidate who’s second most likely to win the seat (after Wiener) if Pelosi doesn’t run, although I’m really unsure about that.

Overall, I think that Wiener has a 45% chance of winning if Pelosi retires, and a 10% chance of winning if she doesn’t, for an overall ~30% chance of winning.

Why doesn’t Scott Wiener wait?

If Scott Wiener knew for sure that Nancy Pelosi was not retiring, I’m not sure whether he would have decided to run. Maybe he’s more optimistic than me about his chances of beating Pelosi.

But my guess is that the main reason for Wiener to jump in early is that, if Pelosi retires late, he doesn’t want support to consolidate around Chakrabarti by the time that Pelosi retires. There’s a pretty big advantage to jumping in early: you can raise more money and get early endorsements. I do think that Wiener is somewhat more likely to win conditional on Pelosi retiring as a result of jumping in now rather than later, but it comes at the cost of burning political capital if she doesn’t retire.

Overall, I don’t know whether Wiener made the right choice by jumping in now. It seems plausible either way. But that’s the choice that he has made.

California’s top-two system

Unlike most states, California doesn’t have partisan primaries. They have a “top-two primary system”, which means that:

A primary election featuring all candidates regardless of party is held (on June 2, 2026).
The top two advance into the general election (on November 3, 2026).

One consequence of this is that if the top two candidates in the primary are Wiener and Chakrabarti, then non-Democrats – who’d probably favor Wiener – would be voting. This is similarly true if the top two candidates are Wiener and Pelosi.

This also has implications for the value of the first $3,500 of a donation versus the last $3,500; see below.

Scott Wiener is a really strong fundraiser

Scott Wiener has already raised $1 million through an exploratory committee (the step before announcing that you’re running). This is really impressive, but not too surprising: he has the backing of many influential groups, including the abundance movement and LGBT rights advocates. I expect Bores to also be a reasonably strong fundraiser, but not quite as strong. This means that counterfactual AI safety dollars go less far for Wiener than they do for Bores. (Though there are considerations in the other direction; see below.)

Will Scott Wiener be an AI safety champion in Congress?

In my analysis of Alex Bores, I wrote that I’m confident that Bores would make AI safety one of his top priorities.

Scott Wiener has done a ton for AI safety in California; nevertheless, I’m less confident that Wiener will continue to prioritize AI safety, compared to Bores. That’s because Wiener is generally a really effective legislator and has gotten a huge amount done besides AI safety. It’s harder to get things done in Congress, so I think Wiener will have to be more picky and won’t just be able to “do everything”. My guess is that his first priority would be YIMBY/abundance-type stuff.

However, I think that there’s a pretty good chance that AI will become a high-salience issue more generally. In those worlds, I think AI safety would very plausibly become Wiener’s first priority, and that people would look to him as Congress’ foremost expert on AI safety legislation (at least, if Bores doesn’t get elected).

I also think that Wiener is a really experienced legislator. As far as I know, he is the most effective state legislator anywhere in the country, in terms of passing important bills. This is an important consideration.

All things considered, my guess is that having Wiener in Congress would be somewhat less valuable than having Bores in Congress when AI isn’t super salient, and about as valuable when AI is really salient.

Are there positive effects from strengthening ties between Scott Wiener and AI safety?

In general, I think that if advocates for an issue donate to a candidate, that creates stronger bonds between the advocates and the candidate, and makes the candidate more likely to prioritize the issue.

I think that this applies in the case of AI safety and Scott Wiener, and is an important effect to consider. This is especially the case because Scott Wiener used a lot of political capital on SB 1047. I think that elected representatives are somewhat less likely to pick similarly bruising fights if they don’t know that the communities who care about those issues will stand behind them and support them like this.

Running for higher office

Scott Wiener is running for a seat in the House of Representatives, but many House members later go on to serve in even higher-ranking roles. Scott Wiener is generally better positioned than Alex Bores to seek higher office in the future. He’s a more experienced and powerful legislator who has built a name for himself to a greater extent than Bores has. I think conditioned on winning, Wiener is somewhat more likely than Bores to become a senator,3 cabinet official, high-ranking representative, or US president.

Cost-effectiveness analysis

I think that there two separate strategies for donating to Scott Wiener, and it’s not a priori obvious which one’s better:

Donate on launch day.
Donate once Nancy Pelosi retires; if she doesn’t retire, don’t donate.

In this section, I’ll consider both options. The analysis will be a bit wonky, so I’ll state my bottom line in advance.

My bottom line

I think that:

Marginal donations to Alex Bores still look a bit better than donating to Wiener. If you haven’t yet maxed out to Bores, my recommendation is to do that first.
If you have maxed out to Bores, and you are not working in a policy role where donating against an incumbent might hurt your career (and won’t be in the near future), my recommendation would be to donate now (Wednesday, October 22nd) to Wiener.

My confidence in this recommendation is weak-to-medium. My recommendation is somewhat sensitive to the parameters of my analysis. For example, I would change my mind and believe that some donors with particularly good counterfactual uses of money would prefer to wait until Pelosi makes her decision if:

I thought that Wiener’s fundraising numbers had zero effect on whether Pelosi retires; or
I thought that money given to Wiener when Pelosi retires is only 10% less valuable than money given on day 1 (rather than my current estimate of 20%); or
I thought that Pelosi was unlikely to retire.

The intuition for why it’s better to donate now

Why is donating now better than waiting and donating if Pelosi retires? There’s a lot of factors that go into my overall estimate, but I can tell a story that captures most of what the numbers are saying.

First, there are a few factors that make donations today look better than donations later. The most important effect is that money now increases Wiener’s probability of winning somewhat more than money later increases his probability of winning. (Maybe 20% more.) The second most important effect is that I think Pelosi is a little more likely to retire if she sees that Scott Wiener raised a lot of money.

Second, even if we found out right now that Pelosi was definitely going to run, donations to Wiener wouldn’t look that bad. By my estimates, they’d be about 30% as good as day-1 Bores donations.4 That’s basically because, although Wiener is in a much worse position, he’s not out of the running, so donations still increase his probability of winning. This means that the money isn’t totally wasted in worlds where Pelosi stays in.

A summary of the numbers

I’ll put the full cost-effectiveness analysis, but I’ll say the main numbers here.

I chose to denominate impact in “day-1 dollars given to Bores”: in other words, how good this donation opportunity is compared to how good donating to Alex Bores was on Monday.

How good is donating to Scott Wiener on day 1? The numbers I got were:

An effect size of 58% (i.e. 58% as good as Bores day-1 dollars) from the effect on increasing the probability that Wiener will win.
An effect size of 6% from donations increasing the probability that Nancy Pelosi retires.
An effect size of 11% from having Wiener’s back after his costly and impactful work, strengthening ties between Wiener and AI safety advocates.

And so overall, this comes out to: 75% as good as day-1 donations to Bores.

How good is waiting and donating if Pelosi retires? The numbers I got were:

An effect size of 38% from the effect on increasing the probability that Wiener will win (all coming from the 60% of worlds where Pelosi retires).
An effect size of 8% from having Wiener’s back after his costly and impactful work, strengthening ties between Wiener and AI safety advocates (all coming from the 60% of worlds where Pelosi retires).
Supposing that Pelosi doesn’t retire and so you don’t donate to Wiener, you’ll use that money in some other way (or you’ll save it or whatever). How much do you like how you would spend that money, compared to giving it to Alex Bores on day 1? Call that number c; you get to decide what c is. (If you donated to Bores, hopefully you think c < 1!). Then there is an effect size of 40%*c from this counterfactual use of money (all coming from the 40% of worlds where Pelosi does not retire).

And so overall, this comes out to: 46% + 40%*c as good as day-1 donations to Bores.

If you trust these numbers, it makes sense to wait to donate until Pelosi retires if c > 72%. But note that if you believe that your c is greater than 77%5 and you trust my numbers, then you shouldn’t be donating to Wiener at all. (You should instead be using it in whatever way makes it more than 77% as good as donating to Bores.) So there’s a very narrow range of c-values under which it makes sense to wait (compared to either donating to Wiener now or not donating to Wiener no matter what), and even in those worlds, donating now is only very slightly suboptimal.

See the appendix for a full cost-effectiveness analysis!

[Edit: the current cost-effectiveness analysis fails to account for the opportunity cost of Scott Wiener remaining in the State Senate for another two years — 2027-2028 — until he needs to leave due to term limits. I think this is an important consideration. My current all-things-considered belief is that this consideration is almost canceled out by the other neglected effect of strengthening ties between AI alignment advocates and Wiener in worlds where he loses and remains in the State Senate for those two years. However, this analysis is subject to change. I have a quantitative model here — feel free to get in touch.]

An important second-order effect

There’s a positive second-order effect to donating to Wiener, where politicians see that AI safety champions get lots of donations, and this makes them more likely to become AI safety champions themselves. This is not accounted for in the above cost-effectiveness analysis.

Logistics and details of donating Who can donate?

Any US Citizen or permanent resident (e.g. any green card holder6) can donate.

How much can I donate?

You can give up to $7,000: $3,500 for the primary election and $3,500 for the general election. If Scott Wiener loses the primary election, you will get back any money you donate beyond $3,500.

Above, I talked about California’s top-two primary system: the top two candidates, regardless of party, advance to the general election. This has an important consequence that potential donors should be aware of: if Wiener advances to the top two and then loses, people who donate beyond $3,500 won’t get a refund. (By contrast, if Bores loses, people who donate beyond $3,500 will get a refund on everything beyond the first $3,500.)

Does that mean that donors should donate $3,500, rather than $7,000? I don’t think so, because the top-two system means that Wiener is likely to have a competitive general election in November 2026. I haven’t done the math, but someone I trust said that the last $3,500 look a tiny bit better than the first $3,500. (By contrast, for Bores, the last $3,500 looked a lot better than the first $3,500.)

How do I donate?

You can donate through ActBlue at this link!

If you are donating on October 22nd: donations via ActBlue are preferred (credit card, Paypal, Venmo, Google Pay). While ActBlue donations are instant, checks and wire transfers made directly to the campaign’s bank account may take more than 24 hours to receive.

If you are donating after October 22nd: feel free to donate via ActBlue, or consider doing a bank transfer. That’s because ActBlue (the vendor that donations go through) charges a 4% fee: if you donate $7,000, the Wiener campaign will receive $280 less than that. I don’t currently know how to do this, but feel free to reach out and I’ll try to figure it out!

Will my donation be public? What are the career implications of donating?

Yes: donations are a matter of public record. You should take this into account when making your decision. The current administration has been illegally using ideological/loyalty litmus tests for hiring for some roles (e.g. rejecting people who say that Biden won the 2020 election). My best guess is that they are also discriminating on the basis of political donations, at least for some roles. So if you think you might want a role in the federal government, and you don’t already have a recent history of donating to Democrats, you should take this into account.

You can read my discussion of career capital considerations here, and see here for another good post on the matter.

If you’ve already donated to Alex Bores, I think there isn’t much cause for concern, with one exception. If you have the sort of federal policy role where donating against an incumbent might hurt your career, or think there’s a good chance that you will be in such a role in the near future, consider waiting to donate until Nancy Pelosi, the incumbent, announces retirement.

[Edit: Additionally, I’ve only just learned of some potential concerns regarding personal information being revealed via political donations. I expect that this is not a concern for the vast majority of people; however, if you are particularly concerned about privacy, feel free to reach out and we can chat.]

Potential concerns Are these requests for donations just going to continue?

I really doubt it. I specifically called out Scott Wiener in my Bores post, saying that he was the one other AI safety champion that I was super excited about. I can’t think of anyone who comes close and who might run for Congress in the 2026 cycle. (I know another AI safety champion who’s currently running for Congress, but I’m only recommending that as a second-tier donation.)

In my Alex Bores post, I said that I thought Bores was a once-in-two-years opportunity and Scott Winer was a once-a-year opportunity. I stand by that – and in particular, I wouldn’t be surprised if I end up recommending two or three more donations in 2027-28 about as strongly as I’m recommending Wiener/Bores.

I’d be kind of surprised if I recommended any donation this strongly in 2025; nothing’s on my radar. Maybe something will come up in 2026? Who knows. But I definitely don’t anticipate regularly suggesting donations with this level of strength.

(One possible way that I could turn out to be wrong: if there’s a Republican AI safety champion – one that isn’t quite as aligned on AI safety, but who would be super valuable to have in Congress because they’re a Republican – I think that might end up being an amazing donation opportunity. I’m not aware of any such person who might run in 2026, but it wouldn’t completely shock me. 2028 seems more plausible.)

Other concerns

See “Potential concerns” from my Bores post, including:

What if Bores loses?
What about the press coverage?
Feeling rushed?

There are nuances, but I think these same concerns are at play here, to about the same degree.

Appendix: Cost-effectiveness analysis details Baseline unit: comparison to Alex Bores

The target audience of this doc is people who have already been convinced to give to Alex Bores, and are considering whether to also donate to Scott Wiener. As such, my units here will be Bores-day-1-dollars per dollar (B$/$): so for example, if donating to Wiener on day 1 is 50% as altruistically good as donating to Bores on day 1, that would mean that donating to Wiener on day 1 is a 0.5 B$/$ opportunity.

(As a reminder: I estimated that an extra $85,000 donated to Bores on day 1 would increase his chances of winning by about 1%.)

Factors that I’ll consider

Here are the various mechanisms by which donating to Scott Wiener, or holding off until Pelosi makes her decision, could have positive effects:

Effect of fundraising numbers on whether Wiener wins the race (conditioning on Pelosi retiring or not).
Effect of fundraising numbers on whether Pelosi retires.
Effect of fundraising numbers on strengthening ties between Wiener and AI safety advocates.
[Only relevant when considering the option of waiting to see whether Pelosi retires] Counterfactual usefulness of donors’ money: holding off on donating gives donors the ability to use their money in a different way if Pelosi stays in.

I will use the following numbers:

Probability Pelosi retires: 60%
How good it would be to elect Wiener, compared to electing Bores: 87.5%. This comes from three numbers:
- Above I said that “all things considered, my guess is that having Wiener in Congress would be somewhat less valuable than having Bores in Congress when AI isn’t super salient, and about as valuable when AI is really salient.” Overall, I’d guess that having Wiener in Congress is about 70% as good as having Bores in Congress, for AI safety.
- My guess that about half of Bores’ impact comes from him being a representative, compared to potentially holding higher office in the future
- My guess that Wiener is somewhat (1.5x?) more likely to be elected to higher office in the future
- The calculation here is .5*.7 + .5*1.5*.7 = .875

Donating on launch day 1. Effect of fundraising numbers on whether Wiener wins the race

I estimated that raising $85k for Bores would change his probability of winning by about 1%.7 How does that compare to Wiener?

Some factors that make money given to Wiener look better:

Conditional on Pelosi retiring, Wiener’s probability of winning is closer to 50% (my best guess is 45%, compared to Bores’ 20%). If you think of money as pushing Wiener further to the right on the logistic S-curve of win probability, that makes a greater difference in win probability than it does for Bores.
SF is a cheaper place to run ads than New York, so a dollar goes further.

Some factors that make money given to Wiener look worse:

Conditional on Pelosi not retiring, Wiener’s probability is farther from 50% (my guess is 10%), so donations have a smaller effect on P(win): maybe about a third as far as if she retires.
Because Wiener is an excellent fundraiser and already has $1 million, and because SF is a cheaper place to run ads than New York, I think that we are steeper into the diminishing-returns curve.
I think that endorsements matter less in this race, because Wiener is really well-known in the district. Additionally, dollars have less effect on endorsements, because party elites’ opinions of Wiener are more set in stone.

Overall, I think these effects roughly cancel out, with maybe money to Wiener looking just slightly worse. I’ll say that money to Wiener goes about 90% as far as money to Bores if Pelosi retires, and 30% as far if she doesn’t.8 And so the calculation looks like:

[In the 60% of worlds where she retires] 0.9 * 0.875
[In the 40% of worlds where she doesn’t retire] 0.3 * 0.875

This gives us a total of 0.58 B$/$ just from this consideration.

2. Effect of fundraising numbers on whether Pelosi retires

Different people I asked had pretty wildly diverging opinions here, but my all-things-considered guess is that Pelosi is 2 percentage points more likely to retire if Wiener raises $1M more. The mechanism is that if Wiener raises more, Pelosi will be faced with the prospect of a more bruising, competitive primary that she’d rather avoid. But I don’t think this effect is all that strong, because Pelosi is already facing a fairly formidable opponent in Saikat Chakrabarti. Plus, there’s a consideration in the other direction: if Pelosi wants to hold her seat until conditions look favorable for her daughter Christine to win a primary, then Wiener putting up a strong showing would increase the chances that Pelosi would hold the seat for another two years. But I think this is a weaker effect than the positive effect mentioned above.

Given that Wiener is 45% to win if Pelosi retires and 10% to win if she doesn’t, $1M increases Wiener’s probability of winning by (45% – 10%)*2% = 0.7% via this mechanism. By comparison, $1M to Bores raises his probability of winning by about 11%, per my blog post. So that’s another 0.7*0.875/11 = 0.06 B$/$.

3. Effect of fundraising numbers on strengthening ties to AI safety advocates

Scott Wiener used a lot of political capital on SB 1047, and I think it’s important to show him that we have his back. I think that elected representatives are somewhat less likely to pick similarly bruising fights if they don’t know that the communities who care about those issues will stand behind them and support them like this. This is important to keep in mind. This is important to keep in mind.

Feel free to reach out if you want details on my thinking here, but I don’t think this ends up being a huge consideration, relative to the effect of helping Wiener win. My estimate is 0.11 B$/$.

And therefore…

Adding these numbers up, we get 0.75 B$/$.

In other words: I think that giving to Scott Wiener on Wednesday, October 22nd (day 1) is about ¾ as good as it was to give to Alex Bores on Monday (day 1 of his campaign).

Donating if Pelosi retires

Pelosi is 60% to retire and donations are better in worlds where she retires, so one option is to wait for her to decide: if she retires, donate; if she runs, don’t.

Conditional on Pelosi retiring, donating later is slightly worse than donating now for boosting Wiener’s chances and slightly worse for boosting how much he feels supported by the AI safety community (and thus, how likely he is to work with us in the future). Plus, donating now makes Pelosi more likely to retire.

On the other hand, conditional on Pelosi not retiring, your money could plausibly be better spent.9 So how does the math shake out?

The analysis below examines the strategy of waiting, then either donating to Wiener if Pelosi retires (60%) or using the money in the best way if Pelosi doesn’t retire (40%). So the calculation will look like “60% * (some of the terms discussed above) + 40% * (next best use of money)”.

1. Effect of fundraising numbers on whether Wiener wins the race

If Pelosi retires, my guess is that it would be in December or so, but I’m pretty uncertain: it could be in November, or in early 2026.

My guess is that money when Pelosi retires is somewhat worse than money now (see this section of my Alex Bores post for why). It would range from a little worse (10% worse) if she retires soon, to a lot worse (40% worse) if she retires late.10 In expectation, I think that conditional on Pelosi retiring eventually, giving money to Wiener when she retires looks about 20% worse than giving money to Wiener now.

So above I wrote:

[In the 60% of worlds where she retires] 0.9 * 0.875

And now that’s:

[In the 60% of worlds where she retires] 0.9 * 0.8 * 0.875 = 0.63

And so, this effect contributes 60% * 0.63 = 0.38 B$/$.

2. Effect of fundraising numbers on whether Pelosi retires

If you don’t give until Pelosi retires, the effect here is 0 B$/$.

3. Effect of fundraising numbers on strengthening ties to AI safety advocates

I think that raising money later would have a somewhat weaker effect, because Wiener would probably not appreciate it quite as much. I’ll call the effect size here 0.08 B$/$ (compared to 0.11 B$/$ above).

4. Counterfactual usefulness of your money

If you wait and only donate if Pelosi retires, then, in the 40% of worlds where Pelosi doesn’t retire, you’ll get to spend your money on other things you value! How good the counterfactual use of your money is depends on your values and beliefs, but let’s denominate it in terms of “fraction as good as donating to Alex Bores”. Let’s call this fraction c (so c = 0 means you don’t value your money at all, and c = 1 means you could use a dollar that you’re thinking of giving to Scott Wiener in order to do something that you value as much as you value giving a dollar to Bores).

For what it’s worth, in my view:

If you’re plugged into efforts to build relationships between members of Congress and AI safety advocates, and would give your money to that instead, then I think that using c = 0.5 is about right.
In my view, pretty much all ways to give away money are much worse than giving to Bores (c < 0.2). One exception to this might be efforts to preserve democracy in the United States, which I haven’t evaluated closely but which look pretty good to me.
If this is funging against money that you would have spent on personal stuff or savings, I’ll leave it up to you to decide how to weigh that!

And therefore…

Adding these numbers together, we get 0.46 + 0.4c B$/$.

So, you should hold off on donating if 0.46 + 0.4c > 0.75, which happens if c > 0.72.

(But note that if you defer to my analysis but your personal c is greater than 0.77, then you shouldn’t be donating to Scott Wiener at all! You should be using your money for whatever thing is more than 77% as good as day-1 donations to Bores.)

This means that my analysis suggests that giving to Wiener now pretty much dominates the strategy of waiting and giving to Wiener if Pelosi retires.

Plugging in your own Pelosi retirement probability

As indicated above, my recommendation is pretty sensitive to the probability that Pelosi retires. I chose 60% as my all-things-considered estimate, but I’ll let you plug in your own estimate if you have an opinion that differs strongly from mine.

If your probability that Pelosi will retire is p, then donating now comes out to 0.67p + 0.36 B$/$,11 whereas waiting and donating if Pelosi retires comes out to 0.78p + c(1 – p) B$/$.

If you think that Pelosi is only 40% to retire, then you should wait to donate if your c is greater than 0.53. If you think that Pelosi is only 20% to retire, then you should wait to donate if your c is greater than 0.42.

As of this writing, there is an atrociously worded Kalshi market called “Will Nancy Pelosi retire before the midterms?” that’s pretty illiquid and is trading at 22%. The question is whether this market is about Pelosi announcing before the election that she will retire/resign before the end of her term or if it’s about Pelosi announcing before the election that she will retire before or at the end of her term. As far as I can tell, a plain reading of the rules summary suggests the latter (which is what we care about). However, I asked my friend Jesse, who is one of the top traders on prediction markets, and he told me his intuition was that only the former counts (so it would only resolve YES if Pelosi left her seat early). He then asked two people who he thought were particularly good at reading Kalshi rules, and they agreed. He pointed me to a similar market on Mitch McConnell, where the rules are exactly the same (they point to the same document), but the rules summary is clear (“resign” cannot refer to retirement) – plus it’s clear from the market price, since McConnell has announced that he won’t seek reelection.

Another instructive market is this one, which is really illiquid but asks whether Pelosi will come in the top two in the June primary (which is basically equivalent to her deciding to run). This market technically implies that Pelosi is at least 85% to retire, though I basically wouldn’t read anything into that: it’s extremely illiquid. That said, there are arbitrages worth tens of dollars under an interpretation of the first market that would imply that Pelosi is 78% to seek reelection. I think this is some more evidence that people are interpreting the first market in the way that Jesse did. ︎
In fairness, some of that vote was Republican anti-Pelosi vote, and Wiener would need a lot of backing from Democrats to beat Pelosi. But someone I trust looked at the numbers in more detail and estimated that the challenger got 17% of voters who voted for Biden in that election. ︎
I think this despite California being a bigger state than New York. ︎
By contrast, if we found out today that Pelosi will retire, then day-1 donations to Wiener look almost exactly as good as day-1 donations to Bores. (My point estimate is 97% as good.) ︎
That’s the solution to the equation 46% + 40%*c = c, i.e. the indifference point between donating later and doing whatever you were going to use your money on instead. ︎
Some green card holders have expressed concern to me that they might have problems with immigration/naturalization if they donate. The Trump administration has really shocked me with its lawbreaking; nonetheless, I think that the Trump administration is currently pretty far from being so brazen as to systematically deny people citizenship because of their political donations. I can’t completely rule it out, though; I think that donating to Democrats could have a 1-2% chance of causing someone to be unable to become a citizen. In most of those worlds, the U.S. looks much more like a dictatorship than it does now. ︎
There’s been a little less coverage of Bores’ fundraising number than I’d have expected – though I’ve heard that the number has been mentioned in New York Politico – so my current guess is $90k/1% rather than $85k/1%. ︎
This factor of 3 difference is based on the slope of the logistic curve at y = 10% vs. y = 45%. ︎
Although this isn’t super clear, and I might end up recommending that people donate in a world where Pelosi doesn’t retire but the race ends up looking close. ︎
If you read my post on Bores closely, you might notice that the numbers I used there would have implied a 20%-60% range here, rather than 10%-40%. The main reason for the lower numbers is that I expect a larger fraction of the usefulness of Wiener’s money to come from advertising rather than signaling campaign strength, so early money is still better, but by less. ︎
Assuming that p is not too close to 0 or 1. The “effect of donations on probability that Pelosi retires” term acts a little weirdly at the extremes. ︎

scott_wiener_official_biography_portrait

http://ericneyman.wordpress.com/?p=2811

Extensions

Consider donating to Alex Bores, author of the RAISE Act

Eric Neyman Oct 20, 2025

Written in my personal capacity. The views expressed here are my own. Thanks to Zach Stein-Perlman, Jesse Richardson, and many others for comments. [Link to donate here — please use this link rather than going to his website — but please read at least the first few paragraphs!] Over the last several years, I’ve written a … Continue reading Consider donating to Alex Bores, author of the RAISE Act →

Show full content

Written in my personal capacity. The views expressed here are my own. Thanks to Zach Stein-Perlman, Jesse Richardson, and many others for comments.

[Link to donate here — please use this link rather than going to his website — but please read at least the first few paragraphs!]

Over the last several years, I’ve written a bunch of posts about politics and political donations. In this post, I’ll tell you about one of the best donation opportunities that I’ve ever encountered: donating to Alex Bores, who announced his campaign for Congress today.

If you’re potentially interested in donating to Bores, my suggestion would be to:

Read this post to understand the case for donating to Alex Bores.
Understand that political donations are a matter of public record, and that this may have career implications. Decide if you are willing to donate to Alex Bores anyway.
If you would like to donate to Alex Bores: donations today, Monday, Oct 20th, are especially valuable. You can donate at this link. (Please use this link, rather than going to Bores’ website to donate – that lets him know that the donation is from someone who cares about AI safety!)

Or if you’re just curious, read whatever parts of the post you find most interesting!

Introduction

In June, Zvi Mowshowitz wrote a post about the New York RAISE Act. I’d encourage you to read Zvi’s whole post (and I’ll include a brief summary below), but Zvi’s bottom line was:

The bill is insufficient on its own but an important improvement upon the status quo. I strongly support this bill.

The RAISE Act is one of only a handful of bills that are specifically aimed at mitigating catastrophic and existential risks from AI. (The bill passed the New York legislature, and will likely be signed by the governor in the coming months.1)

The bill’s sponsor is a state legislator named Alex Bores (a Democrat). After seeing California SB 1047 get vetoed, Bores decided to make AI safety his first priority. He sponsored the RAISE Act and fought tooth and nail to get it through the legislature, spending a huge amount of political capital along the way (see below).

Today (Monday, October 20), Alex Bores announced that he’s running for the U.S. House of Representatives. Quoting the New York Times:

Mr. Bores, a former software engineer, said the rapid pace of technological advancement and its implications for American democracy compelled him to run. He pointed to President Trump’s close relationships with wealthy tech executives as well as the omnipresence of artificial intelligence and other software that can distort reality, saying he felt that most Democratic leaders were ill-equipped to contend with such challenges.

Although he isn’t considered the favorite, I think he has a decent chance of winning. For reasons that I go into below, I’m really excited about his run for Congress, and think that electing Bores would be really beneficial for having sensible AI regulation in the United States. I also think that giving to Bores’ campaign is one of the best opportunities I’ve seen so far for mitigating existential risk from AI. I plan to donate $7,000 to his campaign (the legal maximum), and I think many readers of this post should too.

Note that giving to Bores on Monday, October 20th, is significantly more valuable than giving later. This is because campaigns often come out with a press release announcing how much they raised on day 1. This helps the candidate secure endorsements. (See more details here.)

(If you’re hearing about Bores for the first time today, October 20th, and are feeling rushed to make a donation: yup, I’m sorry about that. See this section of the post for some thoughts about that.)

If you’re already convinced to donate to Alex Bores, you can donate here (and thank you so much!). My donation suggestion is $7,000 (the maximum), if you can afford it. (But note: before deciding to donate, please be aware that political donations are a matter of public record! If you’re potentially interested in working in the federal government, this could have career implications for you. I discuss those implications below.)

[Edit 4/9/2026: I now think that the correct donation size is $3,500, in light of the fact that the last quarterly reporting deadline before Bores’ primary election has passed.]

If you want to know more about the case for donating, read the rest of this post

(Note: my goal is to provide an unbiased best-guess assessment of this donation opportunity, and I’m worried about people being more reluctant to share arguments against donating than arguments in favor of donating. If you think there are substantial downsides to donating that I haven’t touched on, please leave a comment, or email me.)

Before doing an explicit cost-effectiveness analysis, I want to mention some things that I particularly like about Alex Bores.

Things I like about Alex Bores

The most important thing I like about Alex Bores is that he has a great track record on AI safety. Concretely:

He sponsored the RAISE Act. In a nutshell, the RAISE act requires large model developers (ones with >$100 million training runs) to:
- Report “safety incidents” to the attorney general and Division of Homeland Security and Emergency Services. (This is similar to how companies must report cybersecurity breaches.)
- Have and follow a safety plan to prevent severe risks.
- Not release models that would create an unreasonable risk of “critical harm” (roughly speaking: the death or serious injury of one hundred or more people, or at least one billion dollars of damages).
He fought tooth and nail to get the RAISE Act through the legislature, spending a huge amount of political capital along the way.
- In the New York legislature, not all bills get voted on. A bill only gets voted on if the NY Senate majority leader and the Speaker of the NY House decide to put it up for a vote. In this case, putting the bill up for a vote was politically costly: that’s because the lobbying arms of big-tech corporations were spending huge amounts of money to sink the bill, with an implicit threat to spend money against legislators who would help the bill pass. Ordinarily, such opposition would have doomed the bill, even though the bill had overwhelming support. However, particularly effective legislators (and Alex Bores is one of them 2) can build up enough political capital with the legislative leadership that they can occasionally spend that capital on bills that they particularly care about. Bores did that with the RAISE Act.
He understood the ins and outs of the bill really well, as demonstrated in his handling of adversarial questions on the House floor and his appearance on the Cognitive Revolution podcast.
- Note that this is relatively uncommon: most legislators don’t understand the AI bills that they sponsor particularly well.
I have talked to several AI policy experts who have worked with Bores. They have been consistently impressed with his understanding of the technical details of AI governance, and his dedication to making sure that advanced AI is developed safely.

While this is the most important consideration for me, there are other things I like about him, too:

I recently met him, and he struck me as an unusually honest person. I’ve chatted with several high-profile politicians, and while none of them were dishonest with me, they sometimes tried to modulate their language to try to please me. I didn’t get that sense at all from Alex. He told me exactly what he thought on every issue I asked him about, in a way that I really appreciated.
One of his top priorities (in addition to AI safety) is safeguarding American democracy and preventing democratic backsliding. He plans to do that by trying to pass legislation that limits the power of the presidency. I’m really glad that he plans to prioritize this.
He cares about animal welfare. For example, he led a budget letter3 to establish an alternative protein research center in NY. He is a vegetarian.
He co-sponsored a bill to institute a pilot program for implementing a land value tax in New York. He also just introduced a bill to prevent sportsbooks from limiting winning players (which I consider to be a good idea for reasons described here). This makes me think that Bores is more generally a technocrat who will support important, carefully considered policies.
He has a masters degree in computer science. (In fact, he is the first New York state legislator to have any degree in computer science!) I think that his technical background is pretty useful for evaluating tech regulation bills.

Are there any things about Bores that give me pause?

Alex Bores worked at Palantir for a few years. I think that many of Palantir’s activities are pretty evil. That said, as far as I know, Bores did not work on any such projects while at Palantir, and he says that he left Palantir over concerns about its collaboration with ICE.

Bores also highlights his ties to unions and support for organized labor. I generally have mixed feelings about unions, and support for organized labor is often a sign that a politician is pretty left-wing (whereas I’m center-left). That said, my overall impression is that Bores is center-left, rather than left-wing.

So: maybe some really minor things? But, pretty much no.

Cost-effectiveness analysis

This section is broken down into two subsections:

How does an extra $1k affect Alex Bores’ chances of winning his race?
How good is it if Alex Bores wins his race?

My estimate of #1 will be much more precise, because it’s a pretty straightforward statistical modeling question, but the two parts are equally important to the analysis.

How does an extra $1k affect Alex Bores’ chances of winning?

My all-things-considered guess is that Alex Bores has a 20% chance of winning. My best guess is that a marginal $85,000 donated on Monday, October 20th raises his chances of winning by 1%. (My 50% confidence interval is something like [$40k, $170k].) I also estimate that a marginal $105,000 donated sometime in 2025 but after October 20th raises Bores’ chances of winning by 1%. This breaks down as follows:

I estimate a 1.6% chance that Bores loses his primary election, but by fewer than 1000 votes.
I estimate that the Bores campaign could spend a marginal $300 on advertising in order to attract the support of one more voter.4 (Note: low confidence; I wouldn’t be surprised if this is off by a factor of 2 in either direction.)
I estimate that money given in 2025 is 2x more valuable than money given in 2026, because showing strength in early fundraising numbers consolidates party support (or prevents opponents from consolidating support), and that money given on Monday, October 20th is 2.5x more valuable than money given in 2026.
- Erik Bottcher, one of Alex Bores’ primary opponents, raised $683k on his first day (a New York state record), which generated positive headlines and established him as a serious candidate. The buzz around Alex’s candidacy announcement will be qualitatively different if he can beat Bottcher’s number, especially if he beats it by a lot. I would guess that Bores will beat Bottcher’s number, but I don’t think it’s a slam-dunk.
I adjust my cost-effectiveness estimate by 10% to take into account the fact that the big tech super PAC might get involved (or spend more) if Bores raises a lot of money.

Multiplying these numbers together gives the estimate above. I’m not sure how interested readers will be in the details, so I’ve put my reasoning in an appendix.

How good is it if Alex Bores wins?

So, suppose that you believe my estimate that a marginal $85,000 given on Alex Bores’ launch day increases his chances of winning by 1%. How effective is such a donation?

We need to think about how much impact we might expect Bores to have, compared to alternative ways that we could use $8.5 million.5 To do that, I’ll start by listing what I consider to be the most important outcomes from Bores winning this race.

Direct influence on legislation

I see three main ways that Alex Bores could come to directly influence legislation in a way that benefits AI safety.

The first of these is influencing big spending bills on the margin, to direct more government money toward AI safety. Every year, the government passes a $7 trillion budget; in many years, it additionally passes one other multi-trillion-dollar spending bill. These bills are big enough that individual representatives can often influence how a small fraction (but still huge absolute amount) of the money is spent. A representative whose first priority is AI safety could plausibly cause a line item allocating $1 billion to AI safety to be included in the budget.

The second of these is influencing AI legislation: either by including safety-related provisions in sweeping AI legislation, or by making safety-specific legislation more sensible/targeted at catastrophic risks. If Bores wins, he might be the only member of Congress who chooses to prioritize AI safety, and who has any amount of expertise on the matter. This definitely doesn’t mean that he’ll get his way on AI safety issues, but I think that he will be listened to, and will have some influence.

The third is sponsoring major AI safety legislation. While we are not currently in a political environment in which major AI safety legislation can pass, this could easily change, either because people become more scared of AI, or because we get a presidential administration that is more favorable to AI regulation (e.g. in 2029). If that happens, Alex Bores could potentially sponsor and advocate for a major piece of AI safety legislation (analogous to California’s SB 1047) at the federal level. Bores says that he will pursue AI safety policies federally, if elected.

The House is a first step toward even more influential positions

Alex Bores is running for a seat in the House of Representatives, but many House members later go on to serve in even higher-ranking roles.

I’d guess that the conditional probabilities of Bores eventually taking these roles (if he gets elected to the House) are something like:

Senator: 4%
Cabinet-level official: 4%
Governor: 2%
President: 0.2%

I think this matters a lot. If Bores becomes a cabinet-level official (e.g. Secretary of Commerce), it’ll likely be a role where he would have considerable influence over the presidential administration’s AI policy.

Encouraging more action in this space

OpenAI president Greg Brockman and a16z co-founder Marc Andreessen recently created a Super PAC to fight against AI regulation, with a $100 million war chest. I expect that big tech will spend millions of dollars against Alex Bores in the primary election. If Bores can demonstrate that it’s possible to win his race even while standing up to big tech, I think that would encourage other legislators to sponsor bills like SB 1047 and the RAISE Act, despite opposition from big tech.

I also think that Bores’ election would encourage more people in the AI safety space to get involved in politics, either by contributing to efforts to promote AI safety via political giving, or by running for office.

How does this compare to other AI safety donation opportunities?

I’ll break other AI safety donation opportunities into three categories: technical AI safety; non-politics AI governance; and politics.

Comparison to technical AI safety

I think donating to Alex Bores compares very favorably to all technical AI safety opportunities that I’m aware of. I believe this for two reasons:

The billionaires have it covered. I think a lot of the low-hanging fruit of funding the best technical AI safety agendas has been picked.
I think the expected amount of counterfactual funding for AI x-risk focused technical safety work is quite large: on the order of $50 million/year if Bores is elected (see “Influencing big spending bills” above). I think these grants won’t be as good as the marginal technical AI safety grants funded by members of this community, but would still be pretty good. And that’s just one of several ways in which electing Bores would be beneficial to AI safety.

Comparison to non-politics AI governance

I think that one of the best interventions in AI governance is skilling up people who are interested in working on AI safety in the federal government (e.g. as a Congressional staffer).

I don’t currently know of great ways to fund such efforts, but I think they are instructive as a comparison point. My admittedly wild guess is that it’s possible to create a counterfactual career in x-risk-focused AI governance for about $500k-$1M. (This is assuming that a low two-digit percentage of their fellows counterfactually end up with such a career because of their fellowships.) So $8.5 million could be spent to create about ten AI governance careers. These are marginal AI governance careers, rather than the best ones; I’d guess that ten such careers are probably substantially worse in expectation than having Alex Bores in Congress (though maybe less than by a factor of 10).

Comparison to other political opportunities

Unfortunately there’s only so much I can say here, because most of those opportunities are sensitive. However, I believe that this is the best currently-existing political donation opportunity, by a factor of 2-3 or so.

I think the next-best donation opportunities (after Bores) are ones aimed at relationship-building with existing members of Congress. There are also a couple of other people running for office right now who are plausibly worth donating to, but I think those donation opportunities are worse by a factor of 2-10 or so.

I expect that big opportunities that are as exciting as this one will come up very rarely (maybe once every couple of years). One comparison point: it has been reported that Scott Wiener will run for Congress. I’m super excited about this, and think that we’ll only get big donation opportunities this good once a year or so. Nevertheless, I think that donating to Bores looks a little better, mostly because it’s clearer to me that Bores will continue to prioritize AI safety.6

If you’re looking for the best way to spend your charitable donation budget this year, I’m not aware of a better opportunity than this one, and I don’t think that one will come up.

Comparison to non-AI safety opportunities

I think that electing Alex Bores to Congress would decrease x-risk by 1 in 5 thousand or so. Of course, this is sensitive to my particular views on AI, but I’ll spell out my calculations so that you can plug in your own numbers.

The US government’s decisions in the lead-up to AGI will have huge effects on the trajectory of AI. The difference in my subjective probability of x-risk depending on whether the US government’s decisions on AI are generally wise or generally unwise is about 5%.
Congress is a big part of that: its decisions don’t matter as much as those of the executive branch, but I think its actions are maybe 10% of the actions of the US government as a whole, weighted by importance.
Having Bores in Congress would plausibly make Congress 2% better on AI, via some of the pathways discussed above.
Plus, I think a lot of Bores’ expected impact (maybe half) would happen through routes other than being in Congress, e.g. being a cabinet secretary. (Thus, 2x multiplier.)

Multiplying these numbers together gives 1 in 5 thousand, which is really quite a lot. I think this donation opportunity compares very favorably with non-AI safety opportunities, even without taking into account future lives.

Logistics and details of donating Who can donate?

Any US Citizen or permanent resident (e.g. any green card holder7) can donate.

How much can I donate?

You can give up to $7,000: $3,500 for the primary election and $3,500 for the general election. If Alex Bores loses the primary election, you will get back any money you donate beyond $3,500.

Note that even though Bores is guaranteed to win the general election conditioned on winning the primary, those second $3,500 are still useful to him, for a couple of reasons. The first is that (as I discussed above) a large fraction of the value of a donation is that it helps Bores signal strength and consolidate support. The second is that representatives who raise more money tend to get better committee assignments, and it seems really useful for Bores to be on e.g. the Committee on Science, Space, and Technology.

In fact, since (by my estimate) Bores is only 20% likely to win the primary, there is an 80% chance that you will get back any money you donate beyond the first $3,500. This means that with probability 80%, the impact of those dollars is “free” (at least if you ignore time preference of money). The impact of the dollars is smaller, but above I estimated that 50-60% of the value of a donation is signaling campaign strength.8 For this reason, I believe that dollars beyond the first $3,500 are over 2x more valuable in expectation. If you’re donating $3,500 and can afford to donate $7,000, I think it’s likely that you should.

(Note also that this logic still applies if you think that e.g. only 30% of the value of a donation is in signaling campaign strength.)

How do I donate?

You can donate through ActBlue at this link!

If you are donating on October 20th: donations via ActBlue are preferred (credit card, Paypal, Venmo, Google Pay). While ActBlue donations are instant, checks and wire transfers made directly to the campaign’s bank account may take more than 24 hours to receive.

If you are donating after October 20th: feel free to donate via ActBlue, or consider doing a bank transfer. That’s because ActBlue (the vendor that donations go through) charges a 4% fee: if you donate $7,000, the Bores campaign will receive $280 less than that. Don’t hesitate to shoot me an email if you want details on how to do this!

Will my donation be public? What are the career implications of donating?

How big are the risks here? My current understanding (which could be wrong) is that:

If you already have a big or recent donation history to Democrats, you have already paid most of the cost.
If you don’t have a big or recent donation history to Democrats, are strongly considering a role in the federal government in the next few years, have a realistic shot at getting that role, and think that you would be much better than whoever they would hire instead of you, then you should probably refrain from donating.

For most AI safety or (especially) AI policy researchers who haven’t donated to Democrats in the last few years, small donations (e.g. $250) are not worth the cost in career capital. So if you work on AI safety, only donate if you’re donating a lot, or if you already have recent donations to Democrats on record.

Is donating worth the career capital costs in your case?

If you think it’s a close call, here are a few ways to try to figure out what you should do:

Check out the “donor profiles” section below, which gives some examples of potential donor profiles, and my subjective opinion on whether or not it makes sense for them to donate.
Check out this post, which has a questionnaire with items like “Do you have a security clearance? +4” and “Have you worked for at least 2 years in a big AI lab? +1″. If your total score is at least 11, the post’s recommendation is to avoid donating. (Note: I haven’t vetted the post super closely, but overall it seems reasonable to me.)
Check out the “more quantitative cost-benefit analysis” section below. (I think most readers of this post should skip this one.)

Some examples of potential donor profiles

In this section, I’ll give some hypothetical examples of potential donors to Alex Bores, and give my subjective assessment about whether – from an altruistic standpoint – it would make sense for them to donate $7k to Bores, or to forgo a donation to preserve career capital. (Note: these are my best guesses, and are subject to revision.)

(I’ve bolded “from an altruistic standpoint”, because there are of course personal costs to not being able to get a job. I’ll leave it up to readers of this post to price that in as appropriate.)

Technical AI safety researchers

Example profile 1a: Alice is a technical AI safety researcher at an AI company or AI safety nonprofit. She isn’t the head of a safety team. She thinks it isn’t all that likely that she’ll end up in government, but thinks there’s some chance it could happen (e.g. if the lab is nationalized). She hasn’t donated to Democrats before.
- My guess is that it would be altruistically optimal for Alice to donate to Alex Bores and similar opportunities.
Example profile 1b: same as #1a, except Alice is considering quitting her current job and moving to Washington, D.C. to work on AI policy.
- If Alice is only vaguely considering this option: probably she should donate.
- If Alice is strongly considering this option, and is specifically looking at potential jobs in the federal government: probably she should not donate.
Example profile 1c: same as #1a, except Alice is the head of an important safety team.
- I think this case is borderline and probably depends on Alice’s particular circumstances.
Example profile 1d: same as #1c, except that Alice made a $1,000 donation to a Democratic candidate for Congress in 2022.
- I think Alice should probably donate: my guess is that from Alice’s 2022 donation means that she has already paid a substantial fraction of the career-capital cost of being a fairly consistent Democratic donor.

AI policy researchers

Example profile 2a: Bob is a junior AI policy researcher in D.C. He’s a couple of years out of college, and just got a job at a nonpartisan think tank, where he researches AI policy. He thinks that AI policy is a good fit for his skills, and expects to continue working on AI policy. He isn’t specifically thinking about roles in the federal government, but thinks there’s a real possibility that he’ll want to work in the federal government in the next few years. Bob isn’t a strong partisan, and hasn’t donated to political candidates before.
- I think Bob should probably not donate.
Example profile 2b: same as #2a, except Bob strongly objects to the Trump administration’s actions, and would find it challenging to work in an environment where lots of people are in favor of the actions that Bob strongly opposes.
- I think Bob probably should donate to Alex Bores. That’s because it’s unlikely that Bob would be a good fit for a policy role in the federal government under Trump.
Example profile 2c: same as #2a, except Bob has some small, recent donations to Democrats on record.
- I think this case is borderline, as Bob has paid some but not all of the cost of donating.
Example profile 2d: same as #2c (recent small donations to Democrats), but Bob already has several years of experience and is pretty senior at his think tank and sees a realistic prospect of applying to work in the federal government in the next few years.
- I think Bob should not donate.

A more quantitative cost-benefit analysis

This cost-benefit analysis below will let you decide two parameters:

p, the probability that you will want to try to get a job in the federal government.
r, the benefit to the world of having you in government, as a fraction of the benefit of Alex Bores being elected. (I realize that estimating r may be aversive to some readers; if so, I recommend that you base your decision off the “donor profiles” above instead, or on this post.)

Let’s say that you don’t have any recent donations to Democrats. Then I think that from an altruistic standpoint, it’s worth donating $7,000 to Bores (and donating to similar opportunities in the future) if:

p * r < 2%, for technical AI safety researchers
p * r < 1%, for AI governance researchers

The lower threshold for governance researchers reflects the fact that I expect more discrimination on the basis of donations for policy positions than for technical work.

(If you’re a smaller donor, the threshold should scale proportionally: e.g. an AI safety researcher donating half as much should only donate if p * r < 1%.)

My guess is that for a typical technical AI safety researcher, p is around 5-10% and r is around 1-3%. Assuming the numbers on the higher end, we still only get p * r = 0.3%. (And so, from an altruistic standpoint, I think that donating to Alex Bores is probably the right call for most technical AI safety researchers.) Meanwhile, for AI governance researchers, I think it will generally be a closer call.

I’ve put the details of my calculation in this appendix. I suggest that you read my calculation and see if it makes sense. I’m really uncertain of it, and I would really hate for my analysis to have adverse consequences because I made bad assumptions.

Potential concerns What if Bores loses?

As I mentioned above, Bores isn’t currently considered the favorite in this race, though I do think he has a decent chance of winning. I want to make sure that donors are mindful of this, not because it directly affects the cost-effectiveness of their donation, but because of how donors may feel about donating $7,000, only to watch Bores lose.

And so I want to emphasize to potential donors that Bores is in fact considered the underdog. I think it’s worth donating to him despite that. But if you’re likely to feel burned by making a large donation and seeing him lose – in a way that might affect your likelihood of giving to similar campaigns in the future – then I think that’s worth taking into consideration.

What about the press coverage?

Overall, I expect that if Bores raises a lot, this will generate substantial positive press coverage. This is not to say that all of the coverage will be positive; I want to make sure that potential donors are aware of that.

I think it’s likely that there will be some unwarranted attacks made by the media on Bores’ connections to AI safety. While there are some tail risks here, my guess is that this will be basically fine for him. While fighting for the RAISE Act, Bores adeptly handled public battles with big tech, without shying away from his support for AI x-risk mitigations. I believe that he’s prepared for potentially unfair media coverage, and that he will behave similarly. Overall, these considerations don’t change my bottom line.

Feeling rushed?

If you are learning about Alex Bores for the first time today, Monday, October 20th, then you might be feeling rushed into making a donation without really understanding the case and doing your due diligence.

I’m sorry about that. I understand that asking people to decide to make a $7,000 donation on the same day that they encounter the donation opportunity is a big, unfair ask.

If you feel equipped to take your time today to figure out if you want to make a donation, I encourage you to do so. And if you’d like to chat to me (or someone else) before donating, feel free to email me, and I’ll try to find time to chat with you today!

But if you don’t feel equipped to adequately assess this opportunity today, then I recommend waiting to donate. Yes, I think your donation will be about 20% less effective. However, I think it’s really important to preserve the norm of letting people think carefully and think for themselves about their donations. And I think that donations made after October 20th are still really valuable and are still the best donation opportunity that I’m aware of.

Appendix Details of the cost-effectiveness analysis of donating to Bores Probability that Bores loses by fewer than 1000 votes

Bores is running in New York’s 12th congressional district. This is a safe Democratic seat, which means that Bores will win the election if and only if he wins the Democratic primary. The probability that Bores loses the primary by fewer than 1000 votes basically depends on how close the primary is, so we need to understand the dynamics of the race.

Assemblymember Micah Lasher and city council member Erik Bottcher are also running for the seat. I think it’s plausible that the race will attract other prominent candidates, such as Jack Schlossberg (John F. Kennedy’s grandson). This would lower Bores’ chances of winning, but I think it doesn’t substantially affect the cost-effectiveness analysis,9 so I’ll be assuming that these are the only three candidates.

My guess is that Lasher is more likely than Bores to win the seat, because he was previously the policy director for New York governor Kathy Hochul and an aide to representative Jerry Nadler, who currently represents the district (but is retiring). According to the New York Times, Nadler is likely to endorse Lasher. These would be important endorsements that voters will likely care about. (I think that Bores has about a 20% chance of winning this race, to Lasher’s 50-60%.)

On the other hand, I expect Alex Bores to pick up a substantial number of endorsements of his own. Bores is considered an unusually effective legislator and has earned the respect of his colleagues in the legislature. He also has close ties to labor unions, whose endorsements matter a lot to voters.

Bottcher seems to be a relatively weaker candidate on paper (though I’d expect endorsements from some LGBT groups), but I wouldn’t count him out.

If you forced me to guess what percent of the vote (not probability of winning!) each candidate will get, I’d guess: Lasher 35%, Bores 25%, Bottcher 15%, with 25% scattered among other candidates. But of course there’s huge uncertainty. In such a race, it isn’t surprising if the winner wins by less than 1%, and it also isn’t surprising if the winner wins by 30%. A very naive model (just to get a ballpark estimate) might be something like: the margin of victory is uniform between 0% and 30%; conditional on the margin of victory being less than 1%, there’s a ⅔ chance that Alex Bores is one of those top two; and conditional on that, there’s a 50% chance that he just barely loses (as opposed to just barely wins). That would come out to a 1/90 chance that Bores loses by less than 1%.

I did some slightly more sophisticated (but still sketchy) statistical modeling that suggests a 1.45% chance that Bores loses by less than 1% (details in footnote);10 1.45% is my all-things-considered guess.

Based on historical data, I’m expecting about 90,000 votes in the primary. This means that 1,000 votes is about 1.1%; so this gives a probability of about 1.6% for the event that Bores loses by less than 1,000 votes.

How much marginal funding would net Bores an extra vote?

This is the hardest part to estimate. But the heuristic I usually use is that in a primary election that doesn’t get a huge amount of media attention, campaigns can spend a marginal $100 on ads and flyers11 in order to attract the support of one more voter.12 For this race, I’m tripling that to $300, for a couple of reasons:

I expect this to be a more well-funded race than most, because Bores’ constituents are some of the richest people in the United States (residents of the Upper East Side of Manhattan). A campaign’s 3-millionth dollar is somewhat less valuable than a campaign’s 1-millionth dollar.
New York is an unusually expensive place to run TV ads,13 so dollars get you less value.

Where does the $100 number come from? Sorry, I don’t have a good source. Most online sources give numbers that are very obviously too optimistic. The $100 number comes from some combination of:

Thinking about how much money tends to be spent on Congressional primaries and using some intuitions I’ve gathered from watching politics about how much a campaign benefits from advertising.
Deferring to numbers thrown around casually by experts.

This means that you shouldn’t trust my numbers very much. But I have sanity-checked this number with some people I trust, and I would guess that most people reading this won’t have a better source to defer to.

This means that an extra $300,000 would better position the campaign such that Alex Bores would be able to net an extra 1000 votes in expectation, which (as per my earlier estimate) has a 1.6% chance of counterfactually winning him the election. That would translate to $190,000 for a 1% increase in his chance of winning… but earlier I claimed that $75,000 donated on launch day would translate to a 1% increase in his chance of winning. Why the discrepancy? This is because a lot of the value of campaign contributions is about signaling strength to consolidate support – as I describe below.

Early donations help consolidate support

One of the most important things for winning a primary election is getting endorsements from people and groups that voters trust, such as members of Congress, state legislators, and unions. Raising lots of money is really helpful for this goal: politicians and unions are much more likely to endorse a candidate that they perceive as in the running, especially if they’re likely to win. That’s because endorsements are “wasted” if they’re given to candidates who don’t have a chance of winning, and also it’s because it’s useful to have good relationships with elected officials (so endorsing someone who then goes on to win is valuable). Additionally, if a candidate raises a large amount of money early, other potential candidates might decide not to run.

From talking to political operatives, my sense is that money donated early on is considered much more valuable than money donated later, for this reason. It’s hard to me to put a number on how much more valuable it is, but my method was basically:

Taking the typical numbers I’ve heard from political operatives (typically in the 2x-4x range).
Going with the lower end of that range (because political operatives have an incentive to try to convince people that it’s urgent to donate their money now rather than later).

I think that money donated on launch day is somewhat more useful than money donated later, because campaigns typically come out with a press release announcing how much they raised on day 1. My sense is that these press releases are actually pretty important. So I’ve decided on a 2x multiplier for money donated before the end of 202514 and 2.5x for money donated on day 1.

So we need to divide $210,000 by 2 and 2.5, respectively, to estimate the cost-effectiveness of donating on launch day and later in 2025. This gives us about $75k and $95k, respectively.

One last adjustment: the big tech super PAC

The more Alex Bores raises, the more likely the big tech super PAC is to get involved, and the more we should expect it to spend trying to defeat him.

I don’t have a great understanding of the dynamics here, but I would guess that there’s an:

80% chance that the super PAC gets involved, even if he raises very little.
5% extra chance that the super PAC gets involved if Bores raises an extra $1M on launch day.

In those 80% of worlds, the super PAC probably spends a few hundred thousand dollars more (let’s say $300k), though note that super PAC spending is less effective (maybe by a factor of 4) than early donations, because there is less signaling value in super PAC spending (for getting endorsements etc.), and because super PACs can’t spend in coordination with campaigns. Overall, this consideration decreases the effectiveness of giving to Bores by about 6%.

In those 5% of worlds, the super PAC probably spends $2M or so on the race. This estimate is based on how much money Fairshake, the pro-crypto super PAC that this super PAC is modeled after, spent in Congressional races in 2024. Again, we should divide by a factor of 4 or so, because this spending is less effective than hard dollars. Overall, this consideration decreases the effectiveness of giving to Bores by about 2.5%, for a total of 8.5%.

A couple of other considerations:

Spending on Bores funges against other super PAC spending opportunities, which suggests that the 8.5% number above is a (slight) overestimate.
But also, if Bores loses the race, then this will look good for the super PAC (some people will take it as evidence that they should be scared of super PAC spending). Though if Bores wins the race, this looks really bad for the super PAC, because Bores is considered an underdog (Micah Lasher is considered the favorite in the race, because he’s likely to be backed by the outgoing, powerful Congressman Jerry Nadler). Overall, I’d guess that this is a weak-to-moderate consideration in the direction of suggesting that the 8.5% number above is an underestimate.

Taking all of these considerations into account, I would guess that the big tech super PAC makes donations to Alex Bores about 10% less effective.

Taking our previous numbers and multiplying by 10/9 gives us about $85k and $105k, respectively – these are my final estimates.

Cost-benefit analysis of donating to Bores vs. adverse career effects

I denominate everything in milliBoreses (mB), which is an amount of good equal to a 0.1% increase in the probability that Bores gets elected.

The philanthropic benefit of donating

If you plan to donate $7k on launch day, that has an effect of +0.7 mB, according to the above numbers.
- But note that I expect there to be occasional (if rare) similar opportunities in the future. If your bar for donating to politics is such that your bar for donations is roughly at Bores’ level or a little lower, I expect the total impact of donating to such opportunities to be roughly 3x as good as the impact of only donating to Bores.
- And so, if you follow the policy of donating a lot to safety-aligned political candidates, but only the very best ones, that policy will have an effect of +2 mB.
(Since I think that non-political donation opportunities are much worse than political ones, I think it’s fair to round the counterfactual use of this money to 0-ish.)

The altruistic cost of donating

What’s the probability that someone seriously pursuing a job in government in the future will be counterfactually denied the job because of political donations?
- “The future” could refer to 2025-2028, or after 2028 (when we’ll have a different presidential administration). Overall, I’ll say that there’s a 50% chance that there will be a presidential administration that discriminates on the basis of political donations.
- Conditional on that, I’ll say that there’s a 20% chance that a donation will counterfactually prevent you from being hired for technical researchers and a 40% chance for governance researchers. (There’s a chance you would have been turned away anyway, e.g. for not being sufficiently qualified or for political social media posts; and there’s a chance that your donations will be overlooked and you’ll be hired anyway.)
- So overall that’s 10% for technical researchers. If you’re a governance researcher, we’ll multiply by 2 at the end.
What’s the probability that you’ll want to pursue a government position? That depends on the person; we’ll call this p. (I think that for a typical safety researcher, p is around 5% or 10%??)
How good will it be to have you in government, compared to having Alex Bores elected? We’ll call this r. (So, having you in government is 1000*r mB.) (I think that for a typical safety researcher, r is around 2%??)
Multiplying these together, we find that the career cost of your donation is 10% * p * 1000r mB = 100pr mB.

Cost-benefit analysis

And so, it’s philanthropically good to donate if p*r < 2% if you’re a technical researcher. I think this is true for most (but not all) technical AI safety researchers. As indicated above, I think that for a typical AI safety researcher, p*r is more like 0.1% or 0.2%.

On the other hand, the threshold for governance researchers is p*r < 1%, and p is probably much higher for a typical governance researcher. So I think that for a typical AI governance researcher, donating is a pretty close call. Consider using this post to weigh the costs and benefits.

Caveats

If you’re donating much less than $7k (e.g. if you’re only donating $500), the cost-benefit analysis may go the other way! That’s because the philanthropic benefit of your donation will be proportionally less.
Conversely: there are good political donation opportunities that can absorb hundreds of thousands of dollars per year from each person. (Message/email me privately if you want to know more about these.) If you’re interested in becoming a really big donor, then the career costs of donating are worth it (from a philanthropic standpoint) for all but a few people, in my opinion.
How does a possible AGI Manhattan project affect things? Well, I think that if there’s an AGI Manhattan project, then p goes up a lot for most researchers, but r is really low because there will be tons of people in the Manhattan project and steering things will be hard. So I don’t think it substantially changes the calculation. (I also think that an AGI Manhattan project is unlikely before 2029, though some people disagree.)

It’s possible that the bill will ultimately be weakened before it’s signed, after negotiations between the governor and the legislature. I don’t have much insight into that process, though. ︎
Alex Bores passed more bills in his first year than any first-year legislator, and more bills in his second year than any second-year legislator. ︎
A budget letter is the process by which legislators officially request money be added to the state budget. ︎
In practice, this would mostly look like: $300-350 to persuade someone to vote for Bores instead of a different candidate (which is twice as valuable, if the other candidate is the one who beats Bores by fewer than 1000 votes). ︎
To be clear, I’m not making any claim like “giving the Bores campaign $8.5 million will make him win the election”. I’m just talking about the effect of a marginal dollar. And similarly, if there’s a different good thing you can do for $85,000 that doesn’t scale, we still need to be comparing the good done by 100*(that thing) to the good done by electing Bores. ︎
Note that I wrote a version of this paragraph before I found out that Scott Wiener might run! So I don’t think I’m biased by the recency of this news; I think that, by coincidence, we’ll be getting two amazing donation opportunities only a few days/weeks apart. ︎
Some green card holders have expressed concern to me that they might have problems with immigration/naturalization if they donate. The Trump administration has really shocked me with its lawbreaking; nonetheless, I think that the Trump administration is currently pretty far from being so brazen as to systematically deny people citizenship because of their political donations. I can’t completely rule it out, though; I think that donating to Democrats could have a 1-2% chance of causing someone to be unable to become a citizen. In most of those worlds, the U.S. looks much more like a dictatorship than it does now. ︎
This is the same number as my “early donations are 2-2.5x more valuable than late donations that can only be spent on advertising” number: 2x implies that 50% of the value is signaling and 2.5x implies that 60% is signaling. ︎
On the one hand, this would increase the probability that the top two candidates are separated by a small number of votes. On the other hand, it decreases the probability that Bores would be one of the top two candidates. I think these roughly cancel out. ︎
I used a Dirichlet distribution with parameters (3.5, 2.5, 1.5, 2.5) for the vote shares of (Lasher, Bores, Bottcher, “scattered among other candidates”), respectively. These were the values that gave the same mean vote percentages that I gave above and that seemed to me to have the correct amount of uncertainty in vote percentage. I then checked what fraction of the time Bores lost but by less than 1%; this happened about 1.45% of the time. This wasn’t super sensitive to varying the parameters of the model (within reason). ︎
While there are more effective uses of money for the first $100k that a campaign raises, I expect Bores to raise a lot more than that, and that his marginal dollars will go toward ads. ︎
In practice, this looks more like “convincing 0.6 people to vote for you instead of someone else”, because not many people will counterfactually turn out to vote just because they saw your ad. But this is about as good as getting one extra person to vote for you (instead of no one) because you’re probably taking a vote from whoever has the most votes besides you. ︎
A quick google search suggested that TV ads in New York are 10x as expensive. However, my guess is that in practice this means that the most effective use of money is other kinds of advertising (not TV ads). So 2x is my overall best guess about the right factor to multiply by, but this could be significantly off. ︎
Donating on December 31, 2025 is much better than donating on January 1, 2026, because campaigns report how much they raise in each quarter, and are judged a lot based on how much they raise in their first quarter. ︎

http://ericneyman.wordpress.com/?p=2734

Extensions

Will Jesus Christ return in an election year?

Eric Neyman Mar 24, 2025

Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet … Continue reading Will Jesus Christ return in an election year? →

Show full content

Thanks to Jesse Richardson for discussion.

Polymarket asks: will Jesus Christ return in 2025?

In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you’re right.

There are two mysteries here: an easy one, and a harder one.

The easy mystery is: if people are willing to bet $13,000 on “Yes”, why isn’t anyone taking them up?

The answer is that, if you wanted to do that, you’d have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you’d get 1% returns on your investment. And you can do so much better on the stock market, or even in U.S. treasury bonds.

So that’s why no one is buying the market down to 1%. But the real mystery is: why is anyone participating in the market on the “Yes” side? Like, who is betting that Jesus will return this year, and why?

Here are a few answers I came up with:

[True Believers] Maybe these people really believe that there’s a 3% chance that Christ will return this year!
[Incorrect Resolution] Maybe the “Yes” people are betting that the market will be resolved incorrectly (that there’s a 3% chance that the market will resolve “Yes” even though Christ will not return in 2025).
[The Memes] Maybe the “Yes” people are buying “Yes” for the lulz. It’s kinda fun to tell people that you bet that Jesus Christ would return this year!

But none of these hypotheses ring true to me:

The True Believers hypothesis rings false because that would be a frankly ridiculous belief to hold. Sometimes people profess ridiculous things, but very few of them put their money where their mouth is on prediction markets.1
The Incorrect Resolution hypothesis rings false because, while there’s some chance of an incorrect resolution, it’s really unlikely to be as high as 3%. Polymarket has a lot of reputation to lose by incorrectly resolving this market, and would almost certainly override its consensus-based resolution mechanism if it came to that.
The Memes hypothesis is more plausible, but I think ultimately false. Several people have spent hundreds of dollars betting yes, which is a lot of money to spend for the memes.

So I asked my friend Jesse, who trades on Polymarket, and he had a pretty interesting theory:

[Time Value of Money] The Yes people are betting that, later this year, their counterparties (the No betters) will want cash (to bet on other markets), and so will sell out of their No positions at a higher price.

In other words: right now, there’s not much interesting stuff happening on Polymarket. People are spending a lot of money betting on sports, but not much else. But at some point in 2025, other markets will get a lot of attention. The New York mayoral election is happening this year. Pope Francis is in poor health, so there may be a new pope this year. God forbid China invade Taiwan, but such an invasion would result in many interesting markets. Right now, all these markets have mere single-digit millions of dollars in trading volume, but that could very easily change.

And if it changes, some of the people betting No on Christ’s return will want to unlock that money — that is, sell their “No” shares — so that they can use it to bet on other markets. If enough people want to sell their “No” shares, the “Yes” holders may be able to sell out at an elevated price, like 6%, potentially getting a 2x return on their investment!

The Time Value of Money hypothesis posits that the Yes bettors are more sophisticated than they look. In finance, time value of money is the idea that a dollar today is worth more than a dollar tomorrow, because you can do things with that dollar, such as making bets. The Yes traders are betting that the time value of Polymarket cash will go up unexpectedly: that other traders will be short on cash to place bets with, and will at some point be willing to pay a premium to free up the cash that they spent betting against Jesus.

Has this galaxy-brained trade ever gone well? Yes! In late October of last year — a week before the election — Kamala Harris was trading around 0.3% in safe red states like Kentucky, while Donald Trump was trading around 0.3% in safe blue states like Massachusetts. On election day, these prices skyrocketed to about 1.5%, because “No” bettors desperately needed cash to place other bets on the election. Traders who bought “Yes” for 0.3% in late October and sold at 1.5% on election day made a 5x profit! This means that even though Harris only had a 0.1% chance2 of winning Kentucky, the “correct” price for the Kentucky market to trade at was more like 1.5%.

This means that the Jesus Christ market is quite interesting! You could make it even more interesting by replacing it with “This Market Will Resolve No At The End Of 2025”: then it would be purely a market on how much Polymarket traders will want money later in the year.3 As long as there is disagreement about the future time value of Polymarket cash, there will be trades, and then trading price will be above zero. The more that traders expect to want cash later in the year, the higher the market will trade.

(If Polymarket cash were completely fungible with regular cash, you’d expect the Jesus market to reflect the overall interest rate of the economy. In practice, though, getting money into Polymarket is kind of annoying (you need crypto) and illegal for Americans. Plus, it takes a few days, and trade opportunities often evaporate in a matter of minutes or hours! And that’s not to mention the regulatory uncertainty: maybe the US government will freeze Polymarket’s assets and traders won’t be able to get their money out?)

What kinds of years see a high time value of Polymarket cash? Election years. As late as June, Polymarket had Democrats with a 6.5% chance of winning Kentucky (and this was typical of the safe states), even though the actual probability was more like 1%.4 This means that traders were forgoing a relatively safe 16% annualized return, just so that they could have cash now to make other bets with! If there had been a “will Christ return in 2024” market, I bet it would have traded higher than 3% around this time last year: maybe more like 5%.

And so, you heard it here first: Jesus Christ will probably return in an election year (at least if you believe the prediction markets)!

I’ve seen some pretty mispriced markets. At one point in 2019, PredictIt had Andrew Yang at 16% to win the Democratic presidential primary. And in 2020, Donald Trump was about 16% to become president even after he had lost the election. But the sorts of people who bet on prediction markets are not the sorts of fundamentalist Christians who think that Jesus Christ has a high chance of returning this year. ︎
So says Nate Silver’s model in late October, and I agree. ︎
Jesse points out that “the Jesus market should trade really low” is potentially a really good metric for evaluating the efficiency of prediction markets, and that prediction markets should aim to structure their mechanisms in a way that makes markets like this one trade really low. Manifold Markets has experimented with giving out loans for basically this purpose, although this seems much safer to do with fake money than real money. ︎
After the election, Polymarket changed the labels from “Democrat” and “Republican” to “Harris” and “Trump”, but the labels said “Democrat” and “Republican” at the time. ︎

http://ericneyman.wordpress.com/?p=2670

Extensions

Seven lessons I didn’t learn from election day

Eric Neyman Nov 14, 2024

I spent most of my election day — 3pm to 11pm Pacific time — trading on Manifold Markets. That went about as well as it could have gone. I doubled the money I was trading with, jumping to 10th place on Manifold’s all-time leaderboard. Spending my time trading instead of just nervously watching results come … Continue reading Seven lessons I didn’t learn from election day →

Show full content

It’s been a week now, and people seem to be in a mood for learning lessons, for grand takeaways. There is, of course, virtue in learning the right lessons. But there is an equal amount of virtue in not learning the wrong lessons. People seem to over-learn lessons from dramatic events. And so this blog post is intended as a kind of push-back: “Here are some lessons that people seem to be learning, and here is why those lessons are wrong.”

The seven most important things I didn’t learn 1. No, Kamala Harris did not run a bad campaign

The most important fact about politics in 2024 is that across the world, it’s a terrible time to be an incumbent. For the first time this year since at least World War II, the incumbent party did worse than it did in the previous election in every election in the developed world. This happened most dramatically in the United Kingdom, where the Labour Party won in a landslide victory, ending 14 years of Conservative rule. But the same thing played out all over the world, in places like India, France, Japan, Austria, South Korea, South Africa, Portugal, Belgium, and Botswana.

Why? The answer is probably inflation: inflation rates were unusually high throughout the world, and voters really don’t like it when prices go up.

The fact that this phenomenon is global shows that we can’t infer much about Kamala Harris’ quality as a candidate, or about her campaign, just from the fact that she lost. Indeed, as the chart shows, she fared unusually well compared to other incumbents (though I wouldn’t read too much into that either2).

Honestly, I don’t think that Kamala Harris was a good candidate, electorally speaking. According to a New York Times poll, 47 percent of voters thought that Harris was too progressive (compare: only 32 percent thought that Trump was too conservative). This is perhaps because she expressed some fairly unpopular progressive views in 2019-2020, including praising the “defund the police” movement and supporting a ban on fracking.

But I think that her 2024 campaign was pretty good. She had a good speech at the Democratic National Convention, did well in the presidential debate, and mostly avoided taking unpopular positions. By most accounts, her ground game in swing states was superior to Trump’s. Plus, Harris was backed by a really impressive Super PAC called Future Forward, which did rigorous ad testing to figure out which political ads were most persuasive to swing voters. Harris’ campaign wasn’t perfect — she should have picked Josh Shapiro as her running mate and gone on Joe Rogan’s show — but I have no major complaints.

Whatever the cause, I think there’s evidence that Harris’ campaign was more effective than Trump in the most important states. Here is a map of state-by-state swings in vote from 2020 to 2024, relative to the national popular vote. In other words, while literally every state swung rightward relative to 2020, I’ve colored red the states that swung rightward more than the nation as a whole, and blue the states that swung less.3

Each state’s swing relative to the nation as a whole. For example, New York is red on the map because Trump gained more in New York (11.5%) than he did nationally (6%). Oklahoma is blue because Trump gained less in Oklahoma (1%) than he did nationally.

Notice that all seven swing states — Nevada, Arizona, Georgia, North Carolina, Pennsylvania, Michigan, and Wisconsin — swung right less than the nation. Georgia and North Carolina particularly stand out, having swung the least of any states in the Southeast. And despite Harris’ massive losses among Hispanic voters (visible on the map in states like California, Texas, Florida, and New York), she did okay in the heavily-Hispanic swing states of Arizona and Nevada. (See this Twitter thread by Dan Rosenheck for a more rigorous county-by-county regression analysis that agrees with this conclusion.4)

The result? It looks like Harris has pretty much erased the bias that massively benefited Trump in the electoral college in 2020. That year, Biden won the popular vote by 4.5%, but came really close to losing the electoral college. He needed to win the national popular vote by 4% (!) to win the electoral college, a historic disadvantage.

In 2024, it looks like Harris would have needed to win the national popular vote by about 0.3% to win the election. In other words, the last election’s bias against Democrats — a bias that was unprecedented in recent history — got almost entirely erased.

This analysis is not definitive, but it looks to me that the Harris campaign did a basically good job under highly unfavorable circumstances.

2. No, polls aren’t useless. They were pretty good, actually.

I’ve actually seen surprisingly few people complain about the polls this time around. But people will inevitably complain about this year’s polls the next time they’re interested in dismissing poll results. So it’s worth stating for the record that the polls this year were pretty good. I’d maybe give them a B grade.

FiveThirtyEight compares 2024 polls to polls from previous presidential elections, finding that polls had low error and medium bias. Below, “statistical error” refers to the average absolute value of how much the polling average differed from the final result, across relatively close states. Meanwhile, “statistical bias” looks at whether polls were persistently wrong in the same direction.

National polls were similarly biased. It looks like Trump will win the popular vote by about 1.5%, whereas the polling average had Harris winning the popular vote by 1%. That’s a 2.5% bias, which is in line with the historical average:

In 2016 and 2020, we saw a massive polling bias in the Midwest; this year, polls had a modest bias. We only saw a large polling bias on Florida and Iowa, neither or which were swing states.

Why were the polls so biased in 2016 and 2020? For 2016, I think we have a reasonably good explanation: most polls did not weight respondents by education level. This ended up biasing polls because low-education voters swung massively against Democrats, and also were much less likely to respond to polls.

I think it’s less clear what happened in 2020. The best explanation I’ve heard is that the polling bias was a one-off miss caused by COVID: Democrats were much more likely to stay home, and thus much more likely to respond to polls. I don’t find this explanation fully satisfying, because I would think that weighting by party registration would have mostly eliminated this source of bias.

Some people went into the 2024 election fearing that pollsters had not adequately corrected for the sources of bias that had plagued them in 2016 and 2020. I was ambivalent on this point. On the one hand, I think we didn’t end up getting a great explanation for what caused polls to be biased in 2020. On the other hand, pollsters have pretty strong incentives to produce accurate numbers. I hope we end up with a good explanation of this year’s (more modest) polling bias; so far, I haven’t seen one.

Should we expect polls to once again be biased against Republicans in 2028? I don’t know. The polls did not exhibit a bias in the last two midterm elections, but they exhibited a persistent bias in the last three presidential elections. One could posit a few theories:

[The default hypothesis] Every year has idiosyncratic reasons for polling bias. It just so happened that polls were biased against Trump in all of the last three elections: the coin happened to land heads all three times. (After all, if you flip a coin three times, there’s a 25% chance that it will land the same way every time!) Under this theory, we shouldn’t expect a bias in 2028.
[The efficient polling hypothesis] Polls have a reputational incentive to be accurate. Although they were unsuccessful at adjusting their methodologies to get rid of their Democratic bias in 2020 and 2024, in 2028 they will finally succeed in doing so. Under this theory, we shouldn’t expect a bias in 2028.
[The Trump hypothesis] There is a Trump-specific factor that biases polls against him. The most likely reason for this is that some of Trump’s voters (a) only turn out to vote for Trump and (b) are really hard to reach, in a way that isn’t easily fixed by weighting poll respondents differently. Under this theory, we shouldn’t expect a bias in 2028, since Trump isn’t running again.
[The low-propensity voter hypothesis] Some voters only turn out every four years, to vote for president. While a decade ago, a majority of these voters were Democrats, now a majority are Republicans. (This hypothesis is similar to the previous one, except that it doesn’t posit a Trump-specific phenomenon.) Under this theory, to the extent that polls have trouble picking up on those voters (because they’re unlikely to respond to polls), maybe we should expect the bias to continue.

I think these hypotheses are about equally likely. So, should we expect a bias favoring Democrats in 2028? My tentative answer: probably not, but maybe.

There’s one caveat I’d like to make, which concerns Selzer & Co.’s Iowa poll. Ann Selzer is a highly regarded — see this glowing profile from FiveThirtyEight — whose polls have been accurate again and again. But this year, her final poll showed Harris up 3 in Iowa; meanwhile, Trump won by 13 points: a huge miss.

So, what happened? Most likely, the error was a result of non-response bias: more Democrats responded to her poll than Republicans. I couldn’t the poll methodology online, but Selzer is famous for taking a “hands-off” approach to polling, doing minimal weighting. According to Elliott Morris at FiveThirtyEight, Selzer “only weights by age and sex“, basically meaning that she re-weights respondents to ensure a correct number of men vs. women and young vs. middle-aged vs. old respondents, but doesn’t do any other weighting.

This means that if Iowa has equal number of registered Democrats and registered Republicans, but Democrats are twice as likely to pick up the phone as Republicans (after controlling for age and sex), then her polls will show that twice as many Democrats as Republicans will show up to vote.5 By contrast, most pollsters weight respondents by party registration or recalled vote to try to get an unbiased sample of registered voters.

As far as I can tell, this sort of “hands-off” methodology hasn’t been tenable since 2016 and won’t be tenable going forward. My guess is that luck was a major factor in the accuracy of Selzer’s 2016 and 2020 polls. I probably won’t place too much stock on Selzer & Co. polls in future years, though I’m open to being persuaded otherwise.

3. No, Theo the French Whale did not have edge

Theo the French Whale, also known as Fredi9999, was one of the more fun characters of the 2024 presidential election.

Theo the French Whale is actually a human. But in gambling-speak, a whale is a gambler who wagers a really large amount of money. And that kind of whale, Theo was.

About a month before the election, the prediction market site Polymarket started becoming more and more confident that Donald Trump would win the election. While Polymarket had previously been in relatively close agreement with forecast models like Nate Silver’s, Trump’s odds on Polymarket started going up, eventually reaching as high as 66%. Meanwhile, most forecasters thought the race was a tossup.

Traders noticed that the price increase was driven mostly by a single trader with the username Fredi9999, who was buying tens of millions of dollars of Trump shares. A different trader named Domer did some snooping and figured out that Fredi9999 was a Frenchman. The two briefly chatted before Fredi9999 got mad at Domer for disagreeing with him. You can read Domer’s account of it all here.

In all, Theo wagered close to $100 million on Trump winning the election. People posited many theories about Theo’s motivations, but the most straightforward theory always seemed likeliest to me: Theo was betting on Trump because he thought that Trump was likely to win the election.

After the election, a Wall Street Journal report (paywalled; see here for some quotes) revealed Theo’s reasoning: Theo believed that polls were yet again biased against Trump, so he commissioned his own private polling that used a nonstandard methodology called the “neighbor method”.

The idea of the neighbor method is that, instead of asking people who they support, you ask them who their neighbors support. This is supposed to reduce the bias that results from Trump supporters being disproportionately unwilling to tell pollsters that they’re voting for Trump (so-called “shy Trump voters”). According to the WSJ article, Theo’s “neighbor method” polls “showed Harris’s support was several percentage points lower when respondents were asked who their neighbors would vote for, compared with the result that came from directly asking which candidate they supported.”

Many people saw the WSJ report as a vindication of prediction markets. Prediction market proponents argue that we should expect prediction markets to be more accurate than other forecasting methods, because holders of private information are incentivized to reveal that information by betting on the markets. And in this case, Theo even did his own novel research, in order to acquire private information, so that he could reveal that information through his bets! A dream come true for prediction market enthusiasts.

Except, as far as I can tell, the neighbor method is total nonsense. This is for a few reasons.

The first reason has to do with the geographic distribution of Democrats and Republicans. Cities are very heavily Democratic, while rural areas are only moderately Republican. As a simple model, imagine that Pennsylvania is split 50/50 between Harris voters and Trump voters, and that in particular:

25% of voters live in Philadelphia, which supports Harris 80-20.
75% of voters live in rural areas, which support Trump 60-40.

If you ask people who they’re voting for, 50% will say they’re voting for Harris. But if you ask them who most of their neighbors are voting for, only 25% will say Harris and 75% will say Trump! It’s no wonder that Theo’s neighbor polls found “more support” for Trump.

(This is just an illustrative example. The actual distribution of voters isn’t as dramatic, but the point still stands: while Trump won 51% of the two-party vote, 55% of Pennsylvanians live in counties won by Trump. This lines up with the shift of “several percentage points” in Theo’s polls!)

The second reason that I don’t trust the neighbor method is that people just… aren’t good at knowing who a majority of their neighbors are voting for. In many cases it’s obvious (if over 70% of your neighbors support one candidate or the other, you’ll probably know). But if it’s 55-45, you probably don’t know which direction it’s 55-45 in.

On the other hand, I could have given you a really good idea of what percentage of voters in every neighborhood will vote for Trump: I’d look at the New York Times’ Extremely Detailed Map of the 2020 Election and maybe make a minor adjustment based on polling. My guess would be within 5% of the right answer most of the time.

So… the neighbor method is supposed to elicit voters’ fuzzy impressions of whether most of their neighbors are voting for Trump, when I could easily out-predict almost all of them? That doesn’t sound like a good methodology.

And the final reason is that the neighbor method’s track record is… short and bad. I’m aware of one serious, publicly available attempt at the neighbor method: in 2022, NPR’s Planet Money asked Marist College (which does polling for NPR) to poll voters on the following question:

Think of all the people in your life, your friends, your family, your coworkers. Who are they going to vote for?

While the main polling question (“Who will you vote for?”) found a 3-point advantage for Republicans (spot on!), the “friends and family” question found a whopping 16-point advantage (which was way off).

(Also, how are you even supposed to answer that question?? “Well, Aunt Sally is voting for the Democrats, while Uncle Greg is voting for the Republicans. Meanwhile, my best friend Joe is planning to vote for the Democrat in the House but the Republican in the Senate. My coworkers seem to be split 50/50 though I don’t talk to them about politics much…”)

So, barring further evidence, I will continue to be dismissive of the neighbor method. Theo did a lot of work, but it was bad work, and he got lucky.

4. No, we didn’t learn which campaign strategies worked

The Kamala Harris campaign is getting a lot of flak for spending millions on swing-state concerts by celebrities lake Katy Perry and Lady Gaga. Had Harris won, the media would probably be praising her youth-savvy strategy.

By contrast, previously-skeptical media coverage of Elon Musk’s efforts to turn out voters for Trump in swing states are increasingly viewed as effective, just because Trump won.

Were the concerts a good use of money? I don’t know. Did Musk’s $200 million get spend wisely? I also don’t know. In both cases, my guess is: probably not. But the fact that Trump won and Harris lost provides very little evidence, just because there are so many factors at play in determining who wins or loses an election.

5. No, Donald Trump isn’t a good candidate

Trump has now gone two-for-three in presidential elections. This year was just the second time that a Republican won the popular vote in the last nine presidential elections (the other being George Bush in 2004). It’s tempting to conclude that Trump is an above-replacement Republican, when it comes to electability. I think that would be the wrong conclusion.

In my opinion, Trump mostly got lucky in the two general elections that he won. In 2016, he barely beat Hillary Clinton, who was deeply unpopular at the time of the election.

And in 2024, he was running against a quasi-incumbent during an unprecedentedly bad time to be an incumbent.

I’m actually not sure that Trump is unusually bad for a Republican. For example, hypothetical Harris vs. Vance polls showed Harris doing about 9 points better against Vance than against Trump. On the other hand, during this year’s Republican primary, most polls showed Haley doing better than Trump against Biden (see e.g. this New York Times poll). Overall, I’d guess that Trump is about average for a Republican in terms of electability.

6. No, spending money on political campaigns isn’t useless

I’ve seen a few people jump to this sort of conclusion based on the fact that Harris significantly outraised Trump and still lost.

But again, the relevant question is how much she would have lost by if she hadn’t outraised Trump. My guess is that she would have lost by more, particularly in the swing states (where most of her ad spending went, and where she overperformed — see above).

Natural experiments show that campaign spending helps win votes. I think that while donating to the Harris campaign is only moderately effective, some efforts such as Swap Your Vote were able to get Harris additional swing state votes at a cost of about $200/vote. As an altruistic intervention, I think this is pretty good, given that the outcome of the presidential election affects how trillions of dollars get spent. (See here for some more of my thoughts about this.)

For my part, I didn’t donate to Harris, but I donated a substantial amount to my favorite state legislative candidate, in what looked to be a really close race. He ended up losing by about 10%, but I think my decision to donate was well-informed, and I would do it again.

7. No, my opinion of the American people didn’t change

I expected 50% of voters to vote for Harris and 48.5% to vote for Trump. Instead, 48.5% voted for Harris and 50% voted for Trump. The 1.5% of Americans who voted for Trump, and who I thought would vote for Harris, were incredibly important for the outcome of the election and for the future of the country, but they are only 1.5% of the population.

If you (like me) supported Harris, then perhaps you think that 1.5% of Americans have worse judgment than you expected. If you supported Trump, then perhaps you think that 1.5% of Americans have better judgment than you expected. So maybe election day should have very slightly raised or lowered your esteem of the American people — but certainly not very much.

I think that instead, most of the evidence you got came earlier. In my case, it was 2015-2017, when Trump first ran for president and got elected, and then again in 2021, when his popularity didn’t go down very much despite attempting to steal the 2020 presidential election.

One thing I did learn

Just for fun, I wanted to highlight the most interesting thing I did learn from election night:

Foreign-born Americans shifted toward Trump

Dan Rosenheck points out a really strong relationship between the percentage of a county’s population that was foreign born and how much the county shifted toward Trump. Indeed, the r-squared is 0.51, meaning that foreign-born percentage explains 51% (!) of the variance in how much different counties shifted toward Trump. Every 6% increase in foreign-born population was associated with a 1% increase in swing toward Trump.

(This data is arguably consistent with the theory that foreign-born Americans didn’t swing toward Trump, but rather than native-born Americans who live in immigrant communities swung toward Trump. However, I find this explanation a little less likely, because of some other evidence we have, such as swings toward Trump among Hispanic voters, who are disproportionately foreign-born.6)

I haven’t checked, but I suspect that this is reversion to pre-Trump voting patterns. Maybe otherwise-conservative immigrants voted against Trump in 2016 and 2020, but decided to vote for Trump in this election. If someone wanted to check, I’d be interested in seeing the results!

I was in a room with six or so of my friends. We were all rooting for Harris, but the chatter was all about what trades we should be making. Is the market overreacting to Trump’s crushing victory in Florida? Why is “Trump wins the popular vote” stubbornly staying below 30%? Wisconsin seems bluer than Pennsylvania — should we be buying Trump shares in Pennsylvania and selling them in Wisconsin? Going into election day, I expected that my day would be ruined if Trump won, that I would give up on trading and go cry in a corner. Instead, the opposite happened: the exhilaration of trading — the constant decision-making — displaced most of the grief that I would have felt that day. I think it’s really bad that Trump won, and that the world will be a much worse place because of it. But just on an emotional level, things turned out okay for me. ︎
For one thing, the U.S. economy has generally fared better than most other economies in the developed world. For another, Joe Biden wasn’t running for re-election, which may have dulled the anti-incumbent effect. For a third, the U.S. is unusually polarized, so one should expect smaller swings from election to election. Also, the sample of countries in the chart just isn’t that large. ︎
These numbers are subject to change, mostly because California has so far only counted about two-thirds of its votes. ︎
Rosenheck found that including a variable for “is the county in a swing state?” significantly improves the regression: being in a swing state is associated with a higher vote share for Harris. However, it’s possible that this effect is entirely due to an idiosyncratic pro-Trump effect in a few states like New York, New Jersey, and Florida, which happen to not be swing states. ︎
This is assuming that the Democratic respondents and Republican respondents are equally likely to end up voting. Selzer’s poll is a “likely voter” poll, meaning that she weights respondents by how likely they are to vote. ︎
Once we have precinct-level data, this hypothesis could be tested by comparing very heavily foreign-born precincts with surrounding ones. If precincts with tons of immigrants swung more toward Trump than surrounding ones, it would be evidence for my explanation. If they swung less, that would be evidence that native-born Americans in immigrant communities are partly responsible for the swing. ︎

http://ericneyman.wordpress.com/?p=2582

Extensions

My puzzles for the 2024 LessOnline puzzle hunt

Eric Neyman Jun 22, 2024

A couple weeks ago, the LessWrong team threw a 400-person festival called LessOnline for rationalist-adjacent bloggers and their readers. The festival featured a puzzle hunt, which I helped write! The puzzle hunt had two rounds with a somewhat unusual structure: all participants who completed the first round were thrown onto one team to solve the … Continue reading My puzzles for the 2024 LessOnline puzzle hunt →

Show full content

A couple weeks ago, the LessWrong team threw a 400-person festival called LessOnline for rationalist-adjacent bloggers and their readers. The festival featured a puzzle hunt, which I helped write!

The puzzle hunt had two rounds with a somewhat unusual structure: all participants who completed the first round were thrown onto one team to solve the (much harder) puzzles in the second round. There was also a really well-crafted and elaborate plot line (that I played no part in creating), which you can read about here.

I wrote two puzzles for the hunt: one for the first round and one for the second. Each of these puzzles opened a 5-letter lock, which means that the answer to both puzzles is 5 letters long (but isn’t necessarily a word). Standard puzzle hunt rules apply: feel free to Google things, ask ChatGPT, etc.

The first puzzle

On the subjective difficulty scale I’ve used before on this blog, the first puzzle is a 3/10. If you’re an experienced puzzle solver, you can probably solve it in about 15 minutes. If you’re not, it might take a while. The puzzle is below; I wrote up the solution here.

A blind game of Projective Set

The second puzzle

On my subjective difficulty scale, this one’s a 6/10. It would probably take an experienced puzzle solver a few hours. The puzzle consists of 50 plastic tiles, but unfortunately I can only provide a soft copy here. If you want to work on this puzzle, I recommend printing and cutting out the pieces. The puzzle is below, and also available as a PDF here. I wrote up the solution here.

Odd ones out

4 (/6)
1 (/6)
8 (/8)
1 (/6)
2 (/10)

http://ericneyman.wordpress.com/?p=2536

Extensions

Algorithmic Bayesian Epistemology

Eric Neyman May 9, 2024

In January, I defended my PhD thesis. My thesis is called Algorithmic Bayesian Epistemology, and it’s about predicting the future. In many ways, the last five years of my life have been unpredictable. I did not predict that a novel bat virus would ravage the world, causing me to leave New York for a year. … Continue reading Algorithmic Bayesian Epistemology →

Show full content

In January, I defended my PhD thesis. My thesis is called Algorithmic Bayesian Epistemology, and it’s about predicting the future.

In many ways, the last five years of my life have been unpredictable. I did not predict that a novel bat virus would ravage the world, causing me to leave New York for a year. I did not predict that, within months of coming back, I would leave for another year — this time of my own free will, to figure out what I wanted to do after graduating. And I did not predict that I would rush to graduate in just seven semesters so I could go work on the AI alignment problem.

But the topic of my thesis? That was the most predictable thing ever.

It was predictable from the fact that, when I was six, I made a list of who I might be when I grow up, and then attached probabilities to each option. Math teacher? 30%. Computer programmer? 25%. Auto mechanic? 2%. (My grandma informed me that she was taking the under on “auto mechanic”.)

It was predictable from my life-long obsession with forecasting all sorts of things, from hurricanes to elections to marble races.

It was predictable from that time in high school when I was deciding whether to tell my friend that I had a crush on her, so I predicted a probability distribution over how she would respond, estimated how good each outcome would be, and calculated the expected utility.

And it was predictable from the fact that like half of my blog posts are about predicting the future or reasoning about uncertainty using probabilities.

So it’s no surprise that, after a year of trying some other things (mainly auction theory), I decided to write my thesis about predicting the future.

If you’re looking for practical advice for predicting the future, you won’t find it in my thesis. I have tremendous respect for groups like Epoch and Samotsvety: expert forecasters with stellar track records whose thorough research lets them make some of the best forecasts about some of the world’s most important questions. But I am a theorist at heart, and my thesis is about the theory of forecasting. This means that I’m interested in questions like:

How do I pay Epoch and Samotsvety for their forecasts in a way that incentivizes them to tell me their true beliefs?
If Epoch and Samotsvety give me different forecasts, how should I combine them into a single forecast?
Under what theoretical conditions can Epoch and Samotsvety reconcile a disagreement by talking to each other?
What’s the best way for me to update how much I trust Epoch relative to Samotsvety over time, based on the quality of their predictions?

If these sorts of questions sound interesting, then you may enjoy consuming my thesis in some form or another. If reading a 373-page technical manuscript is your cup of tea — well then, you’re really weird, but here you go!

If reading a 373-page technical manuscript is not your cup of tea, you could look at my thesis defense slides (PowerPoint, PDF),1 or my short summary on LessWrong.

On the other hand, if you’re looking for a somewhat longer summary, this post is for you! If you’re looking to skip ahead to the highlights, I’ve put a * next to the chapters I’m most proud of (5, 7, 9).

Chapter 0: Preface

I don’t actually have anything to say about the preface, except to show off my dependency diagram.

Figure 0.1: Solid arrows mean “required”; dashed arrows mean “recommended”.

(I never learned how to make diagrams in LaTeX. You can usually do almost as well in Microsoft Word, with way less effort!)

Chapter 1: Introduction

“Algorithmic Bayesian epistemology” (the title of the thesis, a.k.a. ABE) just means “reasoning about uncertainty under constraints”. You might’ve seen math problems that look like this:

0.1% of people have a disease. You get tested using a test that’s ten times more likely to come up positive for people who have the disease than for people who don’t. If your test comes up positive, what’s the probability that you have the disease?

But the real world is rarely so simple: maybe there’s not one test but five. Test B is more likely to be a false positive in cases where Test A is a false positive. Tests B and C test for different sub-types for the disease, so they complement each other. Tests D and E are brand new and it’s unclear how correlated they are with the other tests. How do you form beliefs in that sort of information landscape?

Here’s another example. A month ago, I was deciding whether to change my solar eclipse travel plans from Mazatlán, Mexico to Montreal, Canada, on account of the weather forecasts. The American model told me that there was a 70% chance that it would be cloudy in Mazatlán; meanwhile, the Canadian model forecast a mere 20% chance. How was I to reconcile these sharply conflicting probabilities?2

I was facing an informational constraint. Had I known more about the processes by which the models arrived at their probabilities and what caused them to diverge, I would have been able to produce an informed aggregate probability. But I don’t have that information. All I know is that it’s cloudy in Mazatlán 25 percent of the time during this part of the year, and that one source predicts a 20% chance of clouds while another predicts a 70% chance. Given just this information, what should my all-things-considered probability be?

(If you’re interested in this specific kind of question, check out Chapter 7!)

But informational constraints aren’t the only challenge. You can face computational constraints (you could in theory figure out the right probability, but doing so would take too long), or communicational constraints (figuring out the right probability involves talking to an expert with a really detailed understanding of the problem, but they only have an hour to chat), or strategic constraints (the information you need is held by people with their own incentives who will decide what to tell you based on their own strategic considerations).

So that’s the unifying theme of my thesis: reasoning about uncertainty under a variety of constraints.3

I don’t talk about computational constraints very much in my thesis. Although that topic is really important, it’s been studied to death, and making meaningful progress is really difficult. On the other hand, some of the other kinds of constraints are really underexplored! For example, there’s almost no work on preventing strategic experts from colluding (Chapter 4), very little theory on how best to aggregate experts’ forecasts (Chapters 5, 6, 7), and almost no work on communicational constraints (Chapter 8). In no small part, I chose which topics to study based on where I expected to find low-hanging fruit.

Chapter 2: Preliminaries

This is a great chapter to read if you want to get a sense of what sort of stuff my thesis is about. It describes the foundational notions and results that the rest of my thesis builds on. Contents include:

Proper scoring rules: suppose I want to know the probability that OpenAI will release GPT-5 this year. I could pay my friend Jaime at Epoch AI for a forecast. But how do I make sure that the forecast he gives me reflects his true belief? One approach is to ask Jaime for a forecast, wait to see if GPT-5 is released this year, and then pay him based on the accuracy of his forecast. Such a payment scheme is called a scoring rule, and we say that a scoring rule is proper if it actually incentivizes Jaime to report his true belief (assuming that he wants to maximize the expected value of his score). (I’ve written about proper scoring rules before on this blog! Reading that post might be helpful for understanding the rest of this one.)
Forecast aggregation methods: now let’s say that Jaime thinks there’s a 40% chance that GPT-5 will be released this year, while his colleagues Ege and Tamay think there’s a 50% and 90% chance, respectively. What’s the right way for them to aggregate their probabilities into a single consensus forecast? One natural approach is to just take the average, but it turns out that there are significantly better approaches.
Information structures: if some experts are interested in forecasting a quantity, an information structure is a way to formally express all of the pieces of information known by at least one of the experts, and how those pieces of information interact/overlap. I also discuss some “nice” properties that information structures can have, which make them easier to work with.

Chapter 3: Incentivizing precise forecasts

(Joint work with George Noarov and Matt Weinberg.)

I’ve actually written about this chapter of my thesis on my blog, so I’ll keep this summary brief!

In the previous section, I mentioned proper scoring rules: methods of paying an expert for a probabilistic forecast (depending on the forecast and the eventual outcome) in a way that incentivizes the expert to tell you their true probability.

The two most commonly used ones are the quadratic scoring rule (you pay the expert some fixed amount, and then subtract from that payment based on the expert’s squared error) and the logarithmic scoring rule (you pay the expert the log of the probability that they assign to the eventual outcome). (See this post or Chapter 2.1 of my thesis for a more thorough exposition.) But there are also infinitely many other proper scoring rules. How do you choose which one to use?

All proper scoring rules incentivize and expert to give an accurate forecast (by definition). In this chapter, I explore the question of which proper scoring rule most incentivizes an expert to give a precise forecast — that is, to do the most research before giving their forecast. Turns out that the logarithmic scoring rule is very good at this (99% of optimal), but you can do even better!

(Click here for my old blog post summarizing this chapter!)

Chapter 4: Arbitrage-free contract functions

(Joint work with my PhD advisor, Tim Roughgarden.)

Now let’s say that you’re eliciting forecasts from multiple experts. We can revisit the example I gave earlier: Jaime, Ege, and Tamay think there’s a 40%, 50%, and 90% chance that GPT-5 will be released this year. (These numbers are made up.)

Let’s say that I want to pay Jaime, Ege, and Tamay for their forecasts using the quadratic scoring rule. To elaborate on what this means, the formula I’ll use is: $\$100 \cdot (1 - (\text{forecasting error})^2)$ . For example, Jaime forecast a 40% chance. If GPT-5 is released this year, then the “perfect” forecast would be 100%, which means that his “forecasting error” would be 0.6. Thus, I would pay Jaime $\$100(1 - 0.6^2) = \$64$ . On the other hand, if GPT-5 is not released, then his forecasting error would be 0.4, so I would pay Jaime $\$100(1 - 0.4^2) = \$84$ .

To summarize all these numbers in a chart:

ExpertForecastPayment if YESPayment if NOJaime40%$64$84Ege50%$75$75Tamay90%$99$19Total payment$238$178

Table 4.1: How much I owe to each expert under the YES outcome (GPT-5 is released this year) and the NO outcome (it’s not released this year).

But now, suppose that Jaime, Ege, and Tamay talk to each other and decide to all report the average of their forecasts, which in this case is 60%.

ExpertForecastPayment if YESPayment if NOJaime60%$84$64Ege60%$84$64Tamay60%$84$64Total payment$252$192

Table 4.2: How much I owe to each expert under the YES and NO outcomes, if all three experts collude to say the average of their true beliefs.

In this case, I will owe more total dollars to them, no matter the outcome! They know this, and it gives them an opportunity to collude:

Step 1: They all report the average of their beliefs (60%).
Step 2: They agree to redistribute their total winnings in a way that leaves each of them better off than if they haven’t colluded. (For example, they could agree that if YES happens, they’ll redistribute the $252 so that Jaime gets $68, Ege gets $80, and Tamay gets $104, and if NO happens, they’ll redistribute the $192 so that Jaime gets $88, Ege gets $80, and Tamay gets $24.)

The collusion benefits them no matter what! Naturally, if I want to get an accurate sense of what each one of them believes, if I want to figure out how to pay them so that there’s no opportunity for them to collude like that.

And so there’s a natural question: is it possible to pay each expert in a way that incentivizes each expert to report their true belief and that prevents any opportunity for collusion? This question was asked in 2011 by Chun & Shachter.

In this chapter, I resolve Chun & Shachter’s question: yes, preventing Jaime, Ege, and Tamay from colluding is possible.

Why should this be possible? It’s because I can pit Jaime, Ege, and Tamay against each other. If there were only one expert, I could only reward the expert as a function of their own forecast. But if there are three experts, I can reward Jaime based on how much better his forecast was than Ege’s and Tamay’s. That’s the basic idea; if you want the details, go read Chapter 4!

* Chapter 5: Quasi-arithmetic pooling

(Joint work with my PhD advisor, Tim Roughgarden.)

As before, let’s say that I elicit probabilistic forecasts from Jaime, Ege, and Tamay using a proper scoring rule.4 How should I combine their numbers in a single, all-things-considered forecast?

In this chapter, I make the case that the answer should depend on the scoring rule that you used to elicit their forecasts.

To see why, consider for comparison the quadratic and logarithmic scoring rules. Here’s a plot of the score of an expert as a function of the probability they report, if the event ends up happening.

Figure 5.1: If the YES outcome happens, an expert’s score under the quadratic and logarithmic scoring rules, as a function of the expert’s reported probability. (The scoring rules are normalized so as to be comparable.)

If Jaime says that there’s a 50% chance that GPT-5 comes out this year, and it does come out, he’ll get a score of 0.75 regardless of whether I use the quadratic or the log score. But if Jaime says that there’s a 1% chance that GPT-5 comes out this year, and it does come out, then he’ll get a score of 0.02 if I use the quadratic score, but will get a score of -0.66 if I use the log score.

(The scoring rules are symmetric: for example, Jaime’s score if predicts 30% and GPT-5 doesn’t come out is the same as if he had predicted 70% and it did come out.)

This means that Jaime cares which outcome happens a different amount depending on which scoring rule I use. Below is a plot of how much higher a score Jaime would get if GPT-5 did come out compared to if it didn’t, as a function of the probability that he reports.

Figure 5.2: How much higher an expert’s score is under a YES outcome than under a NO outcome, as a function of the expert’s reported probability, for the quadratic and logarithmic scoring rules. In other words, how invested is the expert in getting a YES outcome instead of a NO outcome?

Suppose that Jaime reports an extreme probability, like 1% or 99%. This plot shows that Jaime cares much more about the outcome if I use the log score rule to reward him than if I use the quadratic score. This makes sense, since the log scoring rule is strongly punishes assigning a really low probability to the eventual outcome. But conversely, if Jaime reports a non-extreme probability, like 25% or 75%, he actually cares more about the outcome if I use the quadratic score than if I use the log score.

Intuitively, this means that if I use the log score, then Jaime cares a lot more about making his forecasts precise when they’re near the extremes. He cares about the difference between a 1% chance and a 0.1% chance, to a much greater degree than if I used the quadratic score. Jaime will think carefully and make his forecast extra precise before reporting a probability like 0.1%.

And so, if I use the log score and Jaime tells me 0.1% anyway, it makes sense for me to take that forecast seriously. If a different expert tells me 50%, it doesn’t make much sense for me to just take the average — 25.05% — because Jaime’s 0.1% forecast likely reflects a more informed, precise understanding.

To formalize this intuition, I came up with a method of aggregating forecasts that I called quasi-arithmetic pooling (QA pooling) with respect to the scoring rule being used for elicitation. Roughly speaking, instead of averaging the forecasted probabilities, QA pooling averages the kinds of numbers represented in Figure 5.2: each expert’s “amount of investment” in the possible outcomes. I was able to prove a bunch of cool properties of QA pooling:

QA pooling with respect to the quadratic scoring rule just means taking the average of the forecasts (this is called linear pooling). QA pooling with respect to the logarithmic scoring rule involves treating the forecasts as odds instead of probabilities, and then taking their geometric mean (this is called logarithmic pooling). Logarithmic pooling is the second most well-studied forecast aggregation technique (after linear pooling), and it works very well in practice. Thus, QA pooling maps the two most widely-used proper scoring rules to the two most well-studied forecast aggregation techniques!
Suppose that your receive a bunch of forecasts from different experts. You don’t know the eventual outcome, but your goal is to beat the average of all the experts’ scores no matter which outcome happens, and by as much as possible in the worst case. The way to do that is to use QA pooling with respect to the scoring rule.
- There’s a natural interpretation of this fact in terms of the concept of the wisdom of crowds. Suppose a bunch of people (the crowd) report forecasts. Is it possible to do better that a single random crowd member — that is, to guarantee yourself a better score than the average person in the crowd? The answer is yes! And the way to beat the crowd by the largest possible amount is to QA-pool the forecasts. In that sense, the QA pool is the correct way to aggregate the crowd (with respect to whichever scoring rule you care about). On this view, “wisdom of the crowds” is not just an empirical fact, but a mathematical one!
You can also do QA pooling with different weights for different experts (just like you can take a weighted average of numbers instead of a simple average). This is useful if you trust some experts more than others. But how can you decide how much to trust each expert? It turns out that so long as the scoring rule is bounded (e.g. quadratic, but not log), you can learn weights for experts over time based on the experts’ performance, and you’ll do almost as well as if you had known the best possible weights from the get-go. (In the field of online learning, this is called a no-regret algorithm.)
QA pooling can be used to define what it means to be over- or under-confident. This notion of overconfidence turns out to be equivalent to another natural notion of overconfidence (one that I first came up with in order to analyze the results of my pseudorandomness contest).
When coming up with a method of aggregating forecasts, there are some axioms/desiderata that you might want your aggregation method to satisfy. It turns out that for a certain natural set of axioms, the class of aggregation methods that comply with those axioms is precisely the class of all QA pooling methods.

In all of these senses, QA pooling seems like a really natural way to aggregate forecasts. I’m really excited to see QA pooling investigated further!

Chapter 6: Learning weights for logarithmic pooling

(Joint work with my PhD advisor, Tim Roughgarden.)

In my description of Chapter 5, I said:

It turns out that so long as the scoring rule is bounded (e.g. quadratic, but not log), you can learn weights for experts over time based on the experts’ performance, and you’ll do almost as well as if you known the best possible weights from the get-go.

That is fair enough, but many natural proper scoring rules (such as the log score) are in fact unbounded. It would be nice to have results in those cases as well.

Unfortunately, if the scoring rule is unbounded, there is no way to get any result like this unconditionally. In particular, if your experts are horribly miscalibrated (e.g. if 10% of the time, they say 0.00000001% and then the event happens anyway), there’s no strategy for putting weights on the experts that can be guaranteed to work well.

But what if you assume that the experts are actually calibrated? In many cases, that’s a pretty reasonable assumption: for example, state-of-the-art machine learning systems are calibrated. So if you have a bunch of probability estimates from different AIs and you want to aggregate those estimates into a single number (this is called “ensembling”), it’s pretty reasonable to make the assumption that the AIs are giving you calibrated probabilities.

In this chapter, I prove that at least for the log scoring rule, you can learn weights for experts over time in a way that’s guaranteed to perform well on average, assuming that the experts are calibrated. (For readers familiar with online learning: the algorithm is similar to online mirror descent with a Tsallis entropy regularizer.)

* Chapter 7: Robust aggregation of substitutable signals

(Joint work with my PhD advisor, Tim Roughgarden.)

Let’s say that it rains on 30% of days. You look at two (calibrated) weather forecasts: Website A says there’s a 60% chance that it’ll rain tomorrow, while Website B says there’s a 70% chance. Given this information, what’s your all-things-considered estimate of how likely it is to rain tomorrow?

The straightforward answer to this question is that I haven’t given you enough information. If Website A’s is strictly more informed than Website B, you should say 60%. If Website B is strictly more informed than Website A, you should say 70%. If the websites have non-overlapping information, you should say something different than if their information is heavily overlapping. But I haven’t told you that, so I haven’t given you the information you need in order to produce the correct aggregate forecast.

In my opinion, that’s not a good excuse, because often you lack this information in practice. You don’t know which website is more informed and by how much, or how much their information overlaps. Despite all that, you still want an all-things-considered guess about how likely it is to rain. But is there even a theoretically principled way to make such a guess?

In this chapter, I argue that there is a principled way to combine forecasts in the absence of this knowledge, namely by using whatever method works as well as possible under worst-case assumptions about how the experts’ information sets overlap. This is a quintessentially theoretical CS-y way of looking at the problem: when you lack relevant information, you pick a strategy that’ll do well robustly, i.e. no matter what that information happens to be. In other words: you want to guard as well as possible against nasty surprises. This sort of work has been explored before under the name of robust forecast aggregation — but most of that work has had to make some pretty strong assumptions about the forecasters’ information overlap (for example, that there are two experts, one of whom is strictly more informed than the other, but you don’t know which).

By contrast, in this chapter I make a much weaker assumption: roughly speaking, all I assume is that the experts’ information is substitutable, in the economic sense of the word. This means that there’s diminishing marginal returns to learning additional experts’ information. This is a natural assumption that holds pretty often: for example, suppose that Website A knows tomorrow’s temperature and cloud cover, whereas Website B knows tomorrow’s temperature and humidity. Since their information overlaps (they both know the temperature), Website B’s information is less valuable if you already know Website A’s information, and vice versa.

The chapter has many results: both positive ones (“if you use this strategy, you’re guaranteed to do somewhat well”) and negative ones (“on the other hand, no strategy is guaranteed to do very well in the worst case”). Here I’ll highlight the most interesting positive result, which I would summarize as: average, then extremize.

In the leading example, I gave two pieces of information:

Each expert’s forecast (60% and 70%)
The prior — that is, the forecast that someone with no special information would give (30%)

A simple heuristic you might use is to average the experts’ forecasts, ignoring the prior altogether: after all, the experts know that it rains on 30% of days, and they just have some additional information.

Yet, the fact that the experts updated from the prior in the same direction is kind of noteworthy. To see what I mean, let’s consider a toy example. Suppose that I have a coin, and I have chosen the coin’s bias (i.e. probability of coming up heads) uniformly between 0% and 100%. You’re interested in forecasting the bias of the coin. Since I’ve chosen the bias uniformly, your best guess (without any additional information) is 50%.

Now, suppose that two forecasters each see an independent flip of the coin. If you do the math, you’ll find that if a forecasters sees heads, they should update their guess for the bias to 2/3, and if they see tails, they should update to 1/3. Let’s say that both forecasters tell you that their guess for the bias of the coin is 2/3 — so you know that they both saw heads. What should your guess be about the bias of the coin?

Well, you now have more information that either forecaster: you know that the coin came up heads both times it was flipped! And so you should actually say 3/4, rather than 2/3. That is, because the two forecasters saw independent evidence that pointed in the same direction, you should update even more in that direction. This move — updating further away from the prior after aggregating the forecasts you have available — is called extremization.

Now, generally speaking, experts’ forecasts won’t be based on completely independent information, and so you won’t want to extremize quite as much as you would if you assumed independence. But as long as there’s some non-overlap in the experts’ information, it does make sense to extremize at least a little.

The benefits to extremization aren’t just theoretical: Satopää et al. found that extremization improves aggregate forecasts, and Jaime Sevilla found that the extremization technique I suggest in this chapter works well on data from the forecast aggregator Metaculus.

Beyond giving a theoretical grounding to some empirical results in forecast aggregation, I’m excited about the work in this chapter because it opens up a whole bunch of new directions for exploration. Ultimately, in this chapter I made progress on a pretty narrow question. I won’t define all these terms, but here’s the precise question I answered:

What approximation ratio can be achieved by an aggregator who learns expected value estimates of a real-valued quantity Y from m truthful experts whose signals are drawn from an information structure that satisfies projective substitutes, if the aggregator’s loss is their squared error and the aggregator knows nothing about the information structure or only knows the prior?

Each of the bolded clauses can be varied. Relative to what baseline do we want to measure the aggregator’s performance? What sort of information does the aggregator get from the experts? Are the experts truthful or strategic? What assumptions are we making about the interactions between the experts’ information? What scoring rule are we using to evaluate the forecasts? In all, there are tons of different questions you can ask within the framework of robust forecast aggregation. I sometimes imagine this area as a playground with a bunch of neat problems that people have only just started exploring. I’m excited!

Chapter 8: When does agreement imply accuracy?

(Joint work with Raf Frongillo and Bo Waggoner.)

In 2005, Scott Aaronson wrote one of my favorite papers ever: The Complexity of Agreement. (Aaronson’s blog post summarizing the paper, which I read in 2015, was a huge inspiration and may have been counterfactually responsible for my thesis!) Here’s how I summarize Aaronson’s main result in my thesis:

Suppose that Alice and Bob are honest, rational Bayesians who wish to estimate some quantity — say, the unemployment rate one year from now. Alice is an expert on historical macroeconomic trends, while Bob is an expert on contemporary monetary policy. They convene to discuss and share their knowledge with each other until they reach an agreement about the expected value of the future unemployment rate. Alice and Bob could reach agreement by sharing everything they had ever learned, at which point they would have the same information, but the process would take years. How, then, should they proceed?

In the seminal work “Agreeing to Disagree,” Aumann (1976) observed that Alice and Bob can reach agreement simply by taking turns sharing their current expected value for the quantity[…] A remarkable result by Aaronson (2005) shows that if Alice and Bob follow certain protocols of this form, they will agree to within $\epsilon$ with probability $1-\delta$ by communicating $O \left( \frac{1}{\delta \epsilon^2} \right)$ bits [of information…] Notably, this bound only depends on the error Alice and Bob are willing to tolerate, and not on the amount of information available to them.

In other words: imagine that Alice and Bob — both experts with deep but distinct knowledge — have strongly divergent opinions on some topic, leading them to make different predictions. You may have thought that Alice and Bob would need to have a really long conversation to hash out their differences — but no! At least if we model Alice and Bob as truth-seeking Bayesians, they can reach agreement quite quickly, simply by repeatedly exchanging their best guesses: first, Alice tells Bob her estimate. Then, Bob updates his estimate in light of the estimate he just heard from Alice, and responds with his new estimate. Then, Alice updates her estimate in light of the estimate he just heard from Bob, and responds with her new estimate. And so on. After only a small number of iterations, Alice and Bob are very likely to reach agreement!5

However, while Aaronson’s paper shows that Alice and Bob agree, there’s no guarantee that the estimate that they agree on is accurate. In other words, you may have hoped that by following Aaronson’s protocol (i.e. repeatedly exchanging estimates until agreement is reached), the agreed-upon estimate would be similar to the estimate that Alice and Bob would have reached if they had exchanged all of their information. Unfortunately, no such accuracy guarantee is possible.

As a toy example, suppose that Alice and Bob each receive a random bit (0 or 1) and are interested in estimating the XOR of their bits (that is, the sum of their bits modulo 2).

Bob’s bit = 0Bob’s bit = 1Alice’s bit = 0XOR = 0XOR = 1Alice’s bit = 1XOR = 1XOR = 0

Table 8.2: XOR

Since Alice knows nothing about Bob’s bit, she thinks there’s a 50% chance that his bit is the same as hers and a 50% chance that his bit is different from hers. This means that her estimate of the XOR is 0.5 from the get-go. And that’s also Bob’s estimate — which means that they agree from the start, and no communication is necessary to reach agreement. Alas, 0.5 is very far from the true value of the XOR, which is either 0 or 1.

In this example, even though Alice and Bob agreed from the start, their agreement was superficial: it was based on ignorance. They merely agreed because the information they had was useless in isolation, and only informative when combined together. Put otherwise, to an external observer, finding out Bob’s bit is totally useless without knowing Alice’s bit, but extremely useful if they already know Alice’s bit. Alice and Bob’s pieces of information are complements rather than substitutes. (Recall also that the notion of informational substitutes came up in Chapter 7!)

This observation raises a natural question: what if we assume that Alice and Bob’s information is substitutable — that is, an external observer gets less mileage from learning Bob’s information if they already know Alice’s information, and vice versa? In that case, are Alice and Bob guaranteed to have an accurate estimate as soon as they’ve reached agreement?

In this chapter, I show that the answer is yes! There’s a bunch of ways to define informational substitutes, but I give a particular (admittedly strong) definition under which agreement does imply accuracy.

I’m excited about this result for a couple reasons. First, it provides another example of substitutes-like conditions on information being useful (on top of the discussion in Chapter 7). Second, the result can be interpreted in the context of prediction markets. In a prediction market, participants don’t share information directly; rather, they buy and sell shares, thus partially sharing their beliefs about the expected value of the quantity of interest. Thus, this chapter’s main result might also shed light on the question of market efficiency: under what conditions does the price of a market successfully incorporate all traders’ information into the market price? This chapter’s suggested answer: when the traders’ pieces of information are substitutable, rather than complementary.6

I generally think that the topic of agreement — and more generally, communication-constrained truth-seeking — is really neglected relative to how interesting it is, and I’d be really excited to see more work in this direction.

* Chapter 9: Deductive circuit estimation

(Joint work at the Alignment Research Center with Paul Christiano, Jacob Hilton, Václav Rozhoň, and Mark Xu.)

This chapter is definitely the weirdest of the bunch. It may also be my favorite.

A boolean circuit is a simple kind of input-output machine. You feed it a bunch of bits (zeros and ones) as input, it performs a bunch of operations (ANDs, ORs, NOTs, and so forth), and outputs — for the purposes of this chapter — a single bit, 0 or 1. Boolean circuits are the building blocks that computers are made of.

Let’s say that I give you a boolean circuit C. How would you go about estimating the fraction of inputs on which C will output 1? (I call this quantity C’s acceptance probability, or p(C).) The most straightforward answer is to sample a bunch of random inputs and then just check what fraction of them cause C to output 1.

This is very effective and all, but it has a downside: you’ve learned nothing about why C outputs 1 as often as it does. If you want to understand why a circuit outputs 1 on 99% of inputs, you can’t just look at the input-output behavior: you have to look inside the circuit and examine its structure. I call this process deductive circuit estimation, because it uses deductive reasoning, as opposed to sampling-based estimation (which uses inductive reasoning). Deductive reasoning of this kind is based on “deductive arguments”, which point out something about the structure of a circuit in order to argue about the circuit’s acceptance probability.

Here are a few examples, paraphrased from the thesis:

Suppose that a circuit C takes as input a triple (a, b, c) of positive integers (written down in binary). It computes max(a, b) and max(b, c), and outputs 1 if they are equal. A deductive argument about C might point out that if b is the largest of the three integers, then max(a, b) = b = max(b, c), and so C will output 1, and that this happens with probability roughly 1/3.

This argument points out that C outputs 1 whenever b is the largest of the three integers. The argument does not point out that C also outputs 1 when a and c are both larger than b and happen to be equal. In this way, deductive arguments about circuits can help distinguish between different “reasons why” a circuit might output 1. (More on this later.)

The next example makes use of SHA-256, which is a famous hash function: the purpose of SHA-256 is to produce “random-looking” outputs that are extremely hard to predict.

Suppose that C(x) computes SHA-256(x) (the output of SHA-256 is a 256-bit string) and outputs 1 if the first 128 bits (interpreted as an integer) are larger than the last 128 bits. One can make a deductive argument about p(C) by making repeated use of the presumption of independence. In particular, the SHA-256 circuit consists of components that produce uniformly random outputs on independent, uniformly random inputs. Thus, a deductive argument that repeatedly presumes that the inputs to each component are independent concludes that the output of SHA-256 consists of independent, uniformly random bits. It would then follow that the probability that the first 128 bits of the output are larger than the last 128 bits is 1/2.

The third example is about a circuit that checks for twin primes. This example points out that deductive arguments ought to be defeasible: a deductive argument can lead you to an incorrect estimate of p(C), but in that case there ought to be a further argument about C that will improve your estimate.

Suppose that C takes as input a random integer k between $e^{100}$ and $e^{101}$ and accepts if k and k + 2 are both prime. A deductive argument about p(C) might point out that the density of primes in this range is roughly 1%, so if we presume that the event that k is prime and the event that k + 2 is prime are independent, then we get an estimate of p(C) = 0.01%. A more sophisticated argument might take this one step further by pointing out that if k is prime, then k is odd, so k + 2 is odd, which makes k + 2 more likely to be prime (by a factor of 2), suggesting a revised estimate of p(C) = 0.02%. A yet more sophisticated argument might point out that additionally, if k is prime, then k is not divisible by 3, which makes k + 2 more likely to be divisible by 3, which reduces the chance that k + 2 is prime.

In this chapter, I ask the following question: is there a general-purpose deductive circuit estimation algorithm, which takes as input a boolean circuit C and a list of deductive arguments about C, and outputs a reasonable estimate of p(C)? You can think of such an algorithm as being analogous to a program that verifies formal proofs. Much as a proof verifier takes as input a mathematical statement and a purported formal proof, and accepts if the proof actually proves the statement, a deductive circuit estimator takes as input a circuit together with observations about the circuit, and outputs a “best guess” about the circuit’s acceptance probability. A comparison table from the thesis:

Deductive circuit estimationFormal proof verificationDeductive estimation algorithmProof verifierBoolean circuitFormal mathematical statementList of deductive argumentsAlleged proof of statementFormal language for deductive argumentsFormal language for proofsDesiderata for estimation algorithmSoundness and completenessAlgorithm’s estimate of circuit’s acceptance probabilityProof verifier’s output (accept or reject)

Table 9.1: We are interested in developing a deductive estimation algorithm for boolean circuits. There are similarities between this task and the (solved) task of developing an algorithm for verifying formal proofs of mathematical statements. This table illustrates the analogy. Importantly, the purpose of a deductive estimation algorithm is to incorporate the deductive arguments that it has been given as input, rather than to generate its own arguments. The output of a deductive estimation algorithm is only as sophisticated as the arguments that it has been given.

Designing a deductive estimation algorithm requires you to do three things:

Come up with a formal language in which deductive arguments like the ones in the above examples can be expressed.
Come up with a list of desiderata (i.e. “reasonableness properties”) that the deductive estimation algorithm ought to satisfy.
Find an algorithm that satisfies those desiderata.

In this chapter, I investigate a few desiderata:

Linearity: given a circuit C with input bits $x_1, ..., x_n$ , define $C[x_i = 0]$ to be the circuit that you get when you “force” $x_i$ to be 0. (The resulting circuit now has n – 1 inputs instead of n.) Define $C[x_i = 1]$ analogously. The deductive estimator’s estimate of p(C) should be equal to the average of its estimate of $p(C[x_i = 0])$ and $p(C[x_i = 1])$ .
Respect for proofs: a formal proof that bounds the value of p(C) can be given to the deductive estimator as an argument, and forces the deductive estimator to output an estimate that’s within that bound.
0-1 boundedness: the deductive estimator’s estimate of the acceptance probability of any circuit is always between 0 and 1.

In this chapter, I give an efficient algorithm that satisfies the first two of these properties. The algorithm is pretty cool, but I argue that ultimately it isn’t what we’re looking for, because it doesn’t satisfy a different (informal) desirable property called independence of irrelevant arguments. That is, the algorithm I give produces estimates that can be easily influenced by irrelevant information.

Does any efficient algorithm satisfy all three of the linearity, respect for proofs, and 0-1 boundedness? Unfortunately, the answer is no (under standard assumptions from complexity theory). However, I argue that 0-1 boundedness isn’t actually that important to satisfy, and that instead we should be aiming to satisfy the first two properties along with some other desiderata. I discuss what those desiderata may look like, but ultimately leave the question wide open.

Even though this chapter doesn’t get close to actually providing a good algorithm for deductive circuit estimation, I’m really excited about it, for two reasons.

The first reason is that I think this problem is objectively really cool and arguably fundamental. Just as mathematicians formalized the notion of a mathematical proof a century ago, perhaps this line of work will lead to a formalization of a much broader class of deductive arguments.

The second reason for my excitement is because of potential applications to AI safety. When we train an AI, we train it to produce outputs that look good to us. But one of the central difficulties of building safe advanced AI systems is that we can’t always tell whether an AI output looks good because it is good or because it’s bad in a way we don’t notice. A particularly pernicious failure mode is when the AI intentionally tricks us into thinking that its output was good.

(Consider a financial assistant AI that takes actions like buying and selling stocks, transferring money between bank accounts, and paying taxes, and suppose we train the AI to turn a profit, subject to passing some basic checks for legal compliance. If the AI finds a way to circumvent the compliance checks — e.g. by doing some sophisticated, hard-to-notice money laundering — then it could trick its overseers into thinking that it’s doing an amazing job, despite taking actions that the overseers would strongly disapprove of if they knew about them.)

How does this relate to deductive circuit estimation? Earlier I mentioned that deductive arguments can let you distinguish between different reasons why a circuit might exhibit some behavior (like outputting 1). Similarly, if we can formally explain the reasons why an AI exhibits a particular behavior (like getting a high reward during training), then we can hope to distinguish between benign reasons for that behavior (it did what we wanted) and malign reasons (it tricked us).

This is, of course, a very surface-level explanation (see here for a slightly more in-depth one), and there’s a long ways to go before the theory in this chapter can be put into practice. But I think that this line of research is one of the most promising for addressing some of the most pernicious ways in which AIs could end up being unsafe.

(I am now employed at the Alignment Research Center, and am really excited about the work that we’ve been doing — along these lines and others — to understand neural network behavior!)

Epilogue

As you can probably tell, I’m really excited about algorithmic Bayesian epistemology as a research direction. Partly, that’s because I think I solved a bunch of cool problems in some really under-explored areas. But I’m equally excited by the many questions I didn’t answer and areas I didn’t explore. In the epilogue, I discuss some of the questions that I’m most excited about:

Bayesian justifications for generalized QA pooling: In Chapter 5, I defined QA pooling as a particular way to aggregate forecasts that’s sensitive to the scoring rule that was used to elicit the forecasts. One natural generalization of QA pooling allows experts to have arbitrary weights that don’t need to add to 1. It turns out that for the quadratic and logarithmic scoring rules, this generalization has natural “Bayesian justifications”. This means that in some information environments, generalized linear and logarithmic pooling is the best possible way to aggregate experts’ forecasts. (See Section 2.4 for details.) I’m really curious whether there’s a Bayesian justification for generalized QA pooling with respect to every proper scoring rule.
Directions in robust forecast aggregation: In Chapter 7, I discussed robust forecast aggregation as a theoretically principled, “worst-case optimal” approach to aggregating forecasts. There are a whole bunch of directions in which one could try to generalize my results. For example, the work I did in that chapter makes the most sense in the context of real-valued forecasts (which don’t have to be between 0 and 1), and I’d love to see work along similar lines in the context of aggregating probabilities, with KL divergence used as the notion of error instead of squared distance.
Finding a good deductive estimator: In Chapter 9, I set out to find a deductive circuit estimation algorithm that could handle a large class of deductive arguments in a reasonable way. Ultimately I didn’t get close to finding such an algorithm, and I would love to see more progress on this.
Sophisticated Bayesian models for forecast aggregation: While several of the chapters of my thesis were about forecast aggregation, none of them took the straightforwardly Bayesian approach of making a model of the experts’ information overlap. I have some ideas for what a good Bayesian model could look like, and I’d love to see some empirical work on how well the model would work in practice. (If this sounds up your alley, shoot me an email!)
Wagering mechanisms that produce good aggregate forecasts: Wagering mechanisms are alternatives to prediction markets. In a wagering mechanism, forecasters place wagers in addition to making predictions, and those wagers get redistributed according to how well the forecasters did. These mechanisms haven’t been studied very much, and — as far as I know — have never been used in practice. That said, I think wagering mechanisms are pretty promising and merit a lot more study. In part, that’s because wagering mechanisms give an obvious answer to the question of “how much should you weigh each forecaster’s prediction”: proportionally to their wagers! But as far as I know, there’s no theorem saying this results in good aggregate forecasts. I would love to see a wagering mechanism and a model of information for which you could prove that equilibrium wagers result in good aggregate forecasts.

My thesis is called Algorithmic Bayesian Epistemology, and I’m proud of it.

Thanks so much to my thesis advisor, Tim Roughgarden. He was really supportive throughout my time in grad school, and was happy to let me explore whatever I wanted to explore, even if it wasn’t inside his area of expertise. That said, even though algorithmic Bayesian epistemology isn’t Tim’s focus area, his advice was still really helpful. Tim has a really expansive knowledge of essentially all of theoretical computer science, which means he was able to see connections and make suggestions that I wouldn’t have come up with myself.

1 I don’t want to make the video of my defense public, but email me if you want to see it!

2 The right answer, as far as I can tell, is to defer to the NWS’ National Blend of Models. But that just raises the question: how does the National Blend of Models reconcile disagreeing probabilities?

3 How did the name “Algorithmic Bayesian Epistemology” come about? “Bayesian epistemology” basically just means using probabilities to reason about uncertainty. “Algorithmic” is more of a term of art, which in this case means looking for satisfactory solutions that adhere to real-world constraints, as opposed to solutions that would be optimal if you ignored those constraints. See here for a longer explanation.

4 Our discussion of collusion was confined to Chapter 4 — now we’re assuming the experts can’t collude and instead just tell me their true beliefs.

5 Unfortunately, this protocol is only communication-efficient. To actually update their estimates, Alice and Bob may potentially need to do a very large amount of computation at each step.

6 Interestingly, Chen and Waggoner (2017) showed that under a (different) informational substitutes condition, traders in a prediction market are incentivized reveal all of their information right away by trading. This question of incentives is different from the question of my thesis chapter: my chapter can be interpreted as making the assumption that traders will trade on their information, and asking whether the market price will end up reflecting all traders’ information. Taken together, these two results suggest that market dynamics may be quite nice indeed when experts have substitutable information!

http://ericneyman.wordpress.com/?p=2266

Extensions

My hour of memoryless lucidity

Eric Neyman May 4, 2024

Yesterday, I had a coronectomy: the top halves of my bottom wisdom teeth were surgically removed. It was my first time being sedated, and I didn’t know what to expect. While I was unconscious during the surgery, the hour after surgery turned out to be a fascinating experience, because I was completely lucid but had … Continue reading My hour of memoryless lucidity →

Show full content

Yesterday, I had a coronectomy: the top halves of my bottom wisdom teeth were surgically removed. It was my first time being sedated, and I didn’t know what to expect. While I was unconscious during the surgery, the hour after surgery turned out to be a fascinating experience, because I was completely lucid but had almost zero short-term memory.

My girlfriend, who had kindly agreed to accompany me to the surgery, was with me during that hour. And so — apparently against the advice of the nurses — I spent that whole hour talking to her and asking her questions.

The biggest reason I find my experience fascinating is that it has mostly answered a question that I’ve had about myself for quite a long time: how deterministic am I?

In computer science, we say that an algorithm is deterministic if it’s not random: if it always behaves the same way when it’s in the same state. In this case, my “state” was my environment (lying drugged on a bed with my IV in and my girlfriend sitting next to me) plus the contents of my memory. Normally, I don’t ask the same question over and over again because the contents of my memory change when I ask the question the first time: after I get an answer, the answer is in my memory, so I don’t need to ask the question again. But for that hour, the information I processed came in one ear and out the other in a matter of minutes. And so it was a natural test of whether my memory is the only thing keeping me from saying the same things on loop forever, or whether I’m more random/spontaneous than that.1

And as it turns out, I’m pretty deterministic! According to my girlfriend, I spent a lot of that hour cycling between the same few questions on loop: “How did the surgery go?” (it went well), “Did they just do a coronectomy or did they take out my whole teeth?” (just a coronectomy), “Is my IV still in?” (yes), “how long was the surgery?” (an hour and a half), “what time is it?”, and “how long have you been here?”. (The length of that cycle is also interesting, because it gives an estimate of how long I was able to retain memories for — apparently about two minutes.)

(Toward the end of that hour, I remember asking, “I know I’ve already asked this twice, but did they just do a coronectomy?” (The answer: “actually you’ve asked that much more than twice, and yes, it was just a coronectomy.”))

Those weren’t my only questions, though. About five minutes into that hour, I apparently asked my girlfriend for two 2-digit numbers to multiply, to check how cognitively impaired I was. She gave me 27*69, and said that I had no trouble doing the multiplication in the obvious way (27*7*10 – 27), except that I kept having to ask her to remind me what the numbers were.

Interestingly, I asked her for two 2-digit numbers again toward the end of that hour, having no memory that I had already done this. She told me that she had already given me two numbers, and asked whether I wanted the same numbers again. I said yes (so I could compare my performance). The second time, I was able to do the multiplication pretty quickly without needing to ask for the numbers to be repeated.

Also, about 20 minutes into the hour, I asked my girlfriend to give me the letters to that day’s New York Times Spelling Bee, which is a puzzle where you’re given seven letters and try to form words using the letters. (The letters were W, A, M, O, R, T, and Y.) I found the pangram — the word that uses every letter at least once2 (pangram answer) — in about 30 seconds, which is about average for me, except that yesterday I was holding the letters in my head instead of looking at them on a screen. I also got most of the way to the “genius” rank — a little better than I normally do — and my girlfriend got us the rest of the way there.

A couple hours later, when I was home, I could remember two of the Spelling Bee letters — W and O.3 I asked my girlfriend to give me the letters again, and this time it took me longer to get the pangram: about two minutes.

Overall, this suggests that I was not cognitively impaired at all during that hour (except of course for the my memory). I find this really interesting, because I would have expected that whatever mechanism knocked me out and severely impaired my memory would also give me like a 50-point IQ drop. But apparently not!

The surgery went pretty well, but there was a strangely-textured fluid by my bottom-right wisdom tooth. If it turns out to be a scary sort of fluid4 (I’ll find out on Tuesday), then I may need a second operation. On the one hand, that would suck. But on the other hand, now that I have a better sense of my post-operative condition, I can plan some more experiments!

My friend Drake suggested an experiment that I’d be really excited to try: my girlfriend (or whoever’s with me) could repeatedly ask me to generate a random number between 1 and 100. Yesterday showed that I’m pretty deterministic; but am I so deterministic that I would say the same number every time I was asked? My guess is no, but also that I wouldn’t be great at generating random numbers. It turns out that if you ask ChatGPT for a number between 1 and 100, it’ll say 42 10% of the time, 47 and 57 about 7% of the time, and some numbers (like 30, or anything below 15) pretty much never. Would I be better or worse than ChatGPT at producing uniformly random numbers? Who knows!

I could also be repeatedly asked a question that I don’t have a cached answer to (like “What’s your favorite geological formation?”) and see if I produce the same answer every time.

More ambitiously, I could be given a streaming problem to solve. Streaming algorithms are algorithms for computers that have extremely limited memory. For example, in the count-distinct problem, you’re given a long list of numbers and are asked to count the number of distinct entries in the list. If you can remember every number that has appeared so far in the list, then this problem is easy. If you can’t remember the numbers, then you can’t reliably count the exact number of distinct entries, but there are clever schemes for getting close to the right answer! I think this is cool because it lets you overcome a deficiency (lack of memory) with cleverness. I don’t know how fun it would be to try this particular problem, but there’s probably some streaming algorithm that would be fun to implement!

Drake (who suggested the random number experiment) also told me about a truly wild experiment that he conducted while getting his wisdom teeth removed:

Several years ago, I was thinking about worthwhile precautions to take against strange scenarios and wanted a way to defend against erasure of my short-term memory, e.g. by the CIA or alien abductions. I’d heard the factoid that every time you think through a memory you end up overwriting it, which suggested a loophole: build up a long-term memory ahead of time, with a designated piece of the memory to fill in as necessary. Then, when you want to send a message through the barrier of your future amnesia, you think through the old memory with your target message inserted, thereby writing directly into your brain’s long-term storage and bypassing the cache that would otherwise get erased. I kept up a habit of occasionally thinking through a false ‘memory’ of coming downstairs on Christmas morning and opening up a large present, but ending the scene before I saw what was in the box.

I had the opportunity to try this out while getting my wisdom teeth removed, when I was put under the influence of a relaxation and memory-inhibiting drug. When I left the operating room, I consulted my memory and found that the box contained a Martian landscape with a drill boring into the bottom right corner. Judging by the other bits of context that felt associated with the scene, I think my drug-addled brain was trying to use the Martian landscape as a metaphor for lacking proprioception in my mouth and being uncertain of its internal topology, while the drill was a metaphor for a drill. I don’t remember anything else between 30 seconds after they put the IV in to when I walked out of the operating room.

But Drake notes:

While I’m confident that the memory really was a result of thoughts I had under the influence of amnesiac drugs, I’m only around 50% sure that this strategy worked via the intended mechanism, and it’s plausible that I just thought about this scene hard enough during the surgery to overcome the effects of the drug (and anyone focusing on a particular image for a couple minutes with the intent of remembering it in that scenario would have succeeded).

To test whether Drake’s circumvention of his short-term memory loss worked via the intended mechanism, I could ask my girlfriend in advance to prompt me once — and only once — to complete the long-term memory scene that I had been practicing. Then I could see if I have a memory of the scene after I fully regain my memory.

I would love to hear suggestions for other things I could try. If you have any, let me know in a comment!

[Edit 5/8: I just found out that the strangely-textured fluid is a dentigerous cyst, which was the best possible outcome. I won’t be needing a second surgery after all!]

1 Of course, my actions are fully determined by my entire brain state. But it seems plausible that low-level effects that change unpredictably (like which particular neurons happen to be firing) would affect my words and actions, and also plausible that these low-level changes wouldn’t affect my words and actions.

2 The pangram was MOTORWAY. (This was in fact the only pangram.)

3 My theory is that we had been playing Spelling Bee for enough time that the letters somewhat made it into my long-term memory.

4 The surgeon said he was “100% sure” it wasn’t cancer. But it could be a benign tumor that would need to be dealt with anyway.

http://ericneyman.wordpress.com/?p=2358

Extensions

How much do you believe your results?

Eric Neyman May 5, 2023

[Note: images may not load if you’re using the WordPress app. Try opening this post in a browser, or reading it on LessWrong.] Thanks to Drake Thomas for feedback. I. Here’s a fun scatter plot. It has two thousand points, which I generated as follows: first, I drew two thousand x-values from a normal distribution … Continue reading How much do you believe your results? →

Show full content

[Note: images may not load if you’re using the WordPress app. Try opening this post in a browser, or reading it on LessWrong.]

Thanks to Drake Thomas for feedback.

Here’s a fun scatter plot. It has two thousand points, which I generated as follows: first, I drew two thousand x-values from a normal distribution with mean 0 and standard deviation 1. Then, I chose the y-value of each point by taking the x-value and then adding noise to it. The noise is also normally distributed, with mean 0 and standard deviation 1.

Notice that there’s more spread along the y-axis than along the x-axis. That’s because each y-coordinate is a sum of two independently drawn numbers from the standard normal distribution. Because variances add, the y-values have variance 2 (standard deviation 1.41), not 1.

Statisticians often talk about data forming an “elliptical cloud”. You can see how the data forms into an elliptical shape. To put a finer point on it:

Why an ellipse — what’s the mathematical significance of this shape? The answer pops out if you look at a plot of how likely different points on the plane are to be selected by the random generation procedure that I used.

The highest density of points is near (0, 0), and as you get farther from the origin the density decreases. The green ellipse on the scatter plot is a level set of equal probability: if you were to select a datapoint using my procedure, you’d be more likely to land in any square millimeter inside the ellipse than in any square millimeter outside the ellipse — and you’d be equally likely to land in any location on the ellipse as on any other location on the ellipse.

The line of best fit is a statistical tool for answering the following question: given an x-value, what is your best guess about the y-value?

What is the line of best fit for this data? Here’s one line of reasoning: since the y-values were generated by taking the x-values and adding random noise, our best guess for y should just be x. So the line of best fit is y = x.

Huh, weird… this line is weirdly “askew” of the ellipse, and it doesn’t reflect the fact that the y-values are more dispersed than the x-values. Maybe the line of best fit instead passes from the bottom-left to the top-right of the ellipse, along its major axis. It sure looks like the points are on average closer to this line than to the previous one.

Which line is the line of best fit, and what’s wrong with the other line? I recommend pondering this for a bit before reading on.

The answer is that the first line, y = x, is the line of best fit. The problem with the second line is that it doesn’t try to predict y given x. I mean, scroll back up and take a look at how low the line is at x = -2: it’s way below almost all of the points whose x-value is near -2! This line is instead doing a different, important thing: it indicates the axis of maximum variation of the data. It’s the line with the property that, if you project the data onto the line, the data will be maximally dispersed. This line is called the first principal component of the data, but it is not the line of best fit.

Instead of going from the bottom-left to the top-right of the ellipse, the line of best fit goes from the left of the ellipse to the right. This is the line that has as much of the ellipse above it as below it, at every x-coordinate. This is what you want, because you want it the true y-value to be below your prediction as often as it is above your prediction.1

(Huh, what a weird asymmetry! I wonder why the line doesn’t instead go from the bottom of the ellipse to the top…)

II.

You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.

The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)

Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.

You’re really good at your job, so your RCTs are unbiased: if an intervention has quality 0.7 and you were to repeat your RCT a million times, on average the intervention’s performance will be 0.7. But because you can’t run your RCTs on large populations, they are noisy: if an intervention has quality Q, its performance will be drawn from a normal distribution with mean Q and standard deviation 1.

After many years of hard work, your team has conducted all two thousand RCTs. As you expected, the performance numbers you got back are normally distributed, with variance 2 (1 coming from the difference in intervention qualities, and 1 coming from the noise in your RCTs).

I have two questions for you:

True or false: the intervention with the highest expected quality, given the information you have from your RCTs, is the intervention with the highest performance.
True or false: the expected quality of an intervention with performance P is equal to P.

Consider these questions before reading on.

Secretly, the two thousand data points in the scatter plots above represent the quality (x) and performance (y) of your interventions. And I do mean secretly, because you do not know the quality of any intervention, only its performance. So while I, the omniscient narrator, see this —

— you see this:

You know the distribution of the y-values. You even know the shape of the overall distribution of the scatter plot. You just don’t know where individual interventions fall along the x-axis. The best you can do is guess.

But how do you guess quality from performance? Do you use the best fit line from earlier?

This would be a mistake. The line says that the expected performance of an intervention with quality q is also q: $\mathbb{E}[\text{Performance} | \text{Quality} = q] = q$ . That would be useful if you were guessing performance based on quality. But you know performance and don’t know quality. So while this red line has the property that for every x-value, there’s as much of the ellipse above it as below it, what you want is a line with the property that for every y-value, there’s as much of the ellipse to the left of it as to the right of it.

You want this line:

If you want, you can imagine flipping the axes, so that performance is horizontal and quality is vertical; then the line of best fit would run from the left to the right, vertically cutting the ellipse in half. If you did that, the line would have slope 0.5, not 1. The message of this line is:

$\mathbb{E}[\text{Quality}] = 0.5 \cdot \text{Performance}.$

(Why 0.5? Remember that performance is a sum of two random variables with standard deviation 1: the quality of the intervention and the noise of the trial. So when you see a performance number like 4, in expectation the quality of the intervention is 2 and the contribution from the noise of the trial (i.e. how lucky you got in the RCT) is also 2.)

Let’s return to our questions:

1. True or false: the intervention with the highest expected quality, given the information you have from your RCTs, is the intervention with the highest performance.

The answer to this is true. The better the performance, the better the expected quality. This is obvious, but I think some people are confused by it because the top of the ellipse isn’t in the same place as the rightmost point of the ellipse. But that doesn’t matter: if I select a point from the ellipse and tell you its y-value, then the larger the y-value is, the larger your best guess about the x-value will be (and in particular, your best guess will be based on that purple line).

2. True or false: the expected quality of an intervention with performance P is P.

This one’s false. The expected quality of an intervention with performance P is 0.5 times P.

Ponder this for a bit, and internalize it, if you haven’t already. You did an RCT. Your RCT was unbiased: for an intervention of quality Q, your methodology will on average give you an estimate (performance) of Q. And yet, when you see an intervention with performance 4, your best guess is that the quality of the intervention is only 2.

So when Xavier Becerra, the U.S. Secretary of Health and Human Services looks at your results and says “oh wow, with this intervention we can give people four healthy years of their life back for just ten thousand dollars,” you politely temper his excitement and tell him that despite the results, you only expect the intervention to give people two healthy years of their life back per ten thousand dollars spent.

As briefly mentioned earlier, this is because performance is a sum of two independent variables: quality and noise. And when you see a large number like 4, you think the intervention is good, but you also think you got lucky, in equal amounts.

(This is true for all of the studies: it’s not a consequence of bias from selecting the best studies. Though the absolute amount by which you need to discount your results — in this case, 2 — is larger for interventions with better performances.)

Hence the title of this post: “How much do you believe your results?” If the HHS Secretary asks you how much you believe the results of your RCTs, the correct answer is “fifty percent”.

III.

Impressed by both the quality of your trials and your honesty, Secretary Becerra appoints you to lead a new megaproject: two thousand more RCTs. This time, though, your job is trickier. While one thousand of the RCTs will be as noisy as before — normally distributed noise with standard deviation 1 — the other thousand will be much noisier. That’s because the health interventions are more involved and you won’t be able to get as large of a sample. These thousand RCTs will have noise with standard deviation 3.

As before, you do your RCTs and get back performance scores for every intervention. You don’t know the quality of any intervention, of course, but if you did, your performance versus quality scatter plot would look like this:

(We will call the interventions whose RCTs have noise 1 blue interventions, and will call the interventions whose RCTs have noise 3 red interventions.)

Of course all the interventions with the best performance are the red ones — you predicted that at the outset! It’s not that those interventions were systematically better or higher-variance: both sets of interventions have qualities that are normally distributed with mean 0 and standard deviation 1. It’s just that the best-performing interventions are the ones where you get lucky during the RCT, and there’s a ton of luck in the results of the noisy RCTs.

And so, the same question once more: how much do you believe your results? For the blue interventions we already have our answer: 50% — that is, the expected value of quality is 0.5 times the performance. Or in terms of that line from earlier — the one running from the bottom of the blue ellipse to the top, predicting quality from performance — its slope is 2. Every two units of performance increase correspond to one unit of increase in quality.

What about the red interventions? What’s the slope of that line?

Bear with me as we do a bit of math. We are interested in finding the constant $\beta$ such that $\mathbb{E}[\text{Quality}] = \beta \cdot \text{Performance}$ . To do so, we’re going to look at the expected value of quality times performance in two different ways. Abbreviating quality as Q and performance as P, we have

$\mathbb{E}[Q \cdot P] = \mathbb{E}[Q(Q + \text{noise})] = \mathbb{E}[Q^2] = \text{Var}(Q) = 1.$

On the other hand, we also have

$\mathbb{E}[Q \cdot P] = \mathbb{E}[P \cdot \mathbb{E}[Q \mid P]] = \mathbb{E}[P \cdot \beta P] = \beta \mathbb{E}[P^2] = 10\beta,$

where the 10 comes from the fact that performance is quality (variance 1) plus noise (variance 9), and variances add. Therefore, $\beta = 0.1$ .

So, how much do you believe your noisy RCT results? The answer is: just 10 percent! The best-fit line for predicting quality from performance has slope 10. And correspondingly, a performance result of 10 — absolutely stellar! you expect just one of those in your entire study! — makes you think that the intervention is… kinda good. One standard deviation above average. 84th percentile.

You come back to Secretary Becerra to report your results. He’s impressed: there’s more than 20 interventions whose performance was more than 6 — way better than last time! You caution him that the RCTs behind those performances are noisy and that he shouldn’t believe the results very much.

Becerra thanks you for your hard work and tells you that the HHS has enough funding to promote ten interventions — and that it will be up to you to decide which ones will get promoted. The rest of the studies will be shelved, as per government policy.

You wish you had known this at the outset. Then you wouldn’t have bothered running the noisy RCTs at all! (Or at least you would have worked very hard to make them less noisy.) Here’s why:

The performances of the blue interventions are normally distributed with mean 0 and standard deviation $\sqrt{2}$ . Since expected quality is 50% of performance, your best guesses about the qualities of the blue interventions after seeing the RCT results are distributed with mean 0 and standard deviation $1/\sqrt{2}$ , which is about 0.71.
The performances of the red interventions are normally distributed with mean 0 and standard deviation $\sqrt{10}$ . Since expected quality is 10% of performance for these interventions, your best guesses for the qualities of the red interventions after seeing the RCT results are distributed with mean 0 and standard deviation $1/\sqrt{10}$ , which is about 0.32.

If you draw a thousand samples from a normal distribution with mean 0 and standard deviation 0.71, and another thousand from a normal distribution with mean 0 and standard deviation 0.32, it is almost guaranteed that the top ten draws will be from the first distribution. There was essentially no chance that any of the red interventions would be in your top 10 list, after you take care to ask yourself how much you believe your results.

(On the other hand, if you were a less careful scientist who didn’t ask themself this question, your top ten list would all be red interventions, all of which would likely be much worse than you were expecting them to be.)

It gets worse. Suppose that the red interventions are systematically more effective than the blue ones, by an entire standard deviation. That is, the red interventions’ qualities are distributed with standard deviation 1 and mean 1. This means that the average red intervention is as effective as an 84th percentile blue intervention. (This seems pretty realistic, e.g. because the lowest-hanging fruit for easy-to-assess interventions has already been picked.)

Now, all red interventions’ qualities and performances are 1 unit larger than before, so the red ellipse and line from before is translated one unit up and to the right:

You still believe your results 10%, but this 10% now has a slightly different interpretation: if an intervention’s performance is better than average by some amount x, then your best guess is that this intervention’s quality is better than average by 0.1*x. Or as an equation:

$\mathbb{E}[\text{Quality}] = 1 + 0.1(\text{Performance} - 1).$

Because performance is normally distributed with mean 1 and standard deviation $\sqrt{10}$ , the overall distribution of your best guesses about the qualities of the red interventions is like before, but translated to the right by one unit:

The typical red intervention comes out looking much better than the typical blue intervention (of course), but we care about the very best interventions. Zooming in on the right tail of the graphs:

It turns out that the best-looking of a thousand blue interventions is still very likely to look better to you than the best-looking of a thousand red interventions!

IV.

You wake up from a dream. In the dream you had this really cool job as the leader of a giant megaproject of health intervention RCTs run by the HHS.

Ha, if only. You were considered for the job a decade ago, but were ultimately passed over in favor of a different academic.

You’ve been thinking about those studies recently, because the government published that second batch of RCT results — two thousand of them! (You had dreamed that government studies don’t get released, but luckily it was only a dream, if a terrifying one.) You decide to dig into the results.

It would be really nice if the studies explicitly addressed the age-old question — “How much do you believe your results?” — but of course they don’t. You only see the topline “performance” numbers and have to do the inference yourself.

If you spent a whole bunch of time on a single study, you could get some vague sense of how noisy it was. I mean, you can look at the sample size to get some sort of preliminary guess, but the real world is way more complicated than that and actually most of the noise comes from other methodological choices and real-world circumstances that might have not even made it into the papers. And there’s two thousand of them. What are you gonna do, spend the rest of your life inferring quality from performance?

Conveniently, you’ve just woken up from a dream where you learned that half the RCTs had noise 1 and the other half had noise 3. What a convenient fact to know if you want to infer quality from performance!

And so you get to work. You come up with a plan:

For each health intervention, you will take its performance P and use Bayes’ rule to figure out the probability that its RCT had noise 1 versus noise 3.
Let r be the probability that the intervention had noise 1, which you calculated in Step 1. Then with probability r, the expected quality of the intervention is 0.5*P. And with probability 1 – r, the expected quality is 0.1*P. So the overall expected quality is

$\mathbb{E}[\text{Quality}] = r(0.5P) + (1-r)(0.1P).$

How do you do Step 1 (calculate r)? Well, remember that the interventions with noise-1 RCTs have performance scores distributed normally with mean 0 and variance 2, whereas the noise-3 RCTs have performance scores distributed normally with mean 0 and variance 10. So — using the formula for a normal distribution — the probability that an intervention with performance P came from a noise-1 RCT is

$r=\frac{\frac{1}{\sqrt{2}} \exp(-P^2/4)}{\frac{1}{\sqrt{2}} \exp(-P^2/4) + \frac{1}{\sqrt{10}} \exp(-P^2/20)}.$

You plug this into the formula for expected quality as a function of performance that you derived in Step 2, and…

…whoa.

Expected quality drops in the middle of the graph, before going back up? Weird.

The above plot shows $r(0.5P) + (1 - r)(0.1P)$ as a function of P. What if we just look at r, the probability that the intervention had an RCT with noise 1, as a function of P?

Between performance 2 and 4, the probability that the intervention came from a noise-1 RCT drops dramatically. You believe the results of the study much less if it has performance 4 than if it has performance 2, in a way that trades off against the increase in performance. This explains the drop in expected quality.

(And then things pick back up again: for performance above 6, you’re basically guaranteed that the intervention had noise 3 — but once the performance is large enough, even dividing by 10 gives an impressive result.)

(Well, not that impressive. A performance of 10 — which is about the highest number you see among all the RCTs — means an expected quality of 1, which means you guess that the study is 84th percentile or so.)

(Which is kind of depressing. They did this massive RCT, you look at it and you’re like, “oh I guess this one intervention is probably kinda good, but also if I picked seven other interventions at random probably one of them would be better”.)

(You entertain yourself by making the plot of expected quality versus performance if the noisier RCTs had had noise 10 instead of 3.

That’s a kind of ridiculous chart, but makes sense, in light of the above. The noise-10 RCTs are totally useless — the correct amount to discount the results of a noise-10 RCT is by a factor of 101 — so your assessment of the quality of an intervention is pretty much just 0.5 times its performance times the probability it had a noise-1 RCT.)

***

It’s now an hour since you woke up. You’re now a little more awake and are feeling kind-of silly for taking your dream too literally. You dreamed that the distribution over the noise of the RCTs was a 50% point mass at 1 and another 50% point mass at 3, which is pretty unrealistic.

You make a more reasonable model: each RCT has an unknown amount of noise, and you decide that your prior over the amount of noise follows a log-normal distribution. So most RCTs have noise between 1 and 3, but some have more and some have less.

This is log-normal with parameters (0.7, 0.7), meaning that log(noise) is normally distributed with mean 0.7 and standard deviation 0.7.

You use the same procedure as before, using Bayes’ rule to compute a posterior distribution over the noise of each RCT (i.e. posterior to updating on the RCT result), and then forming an all-things-considered expectation about the quality of each intervention. As before, you plot this all-things-considered quality estimate as a function of performance:

Wow — the graph goes down (as before), but now it doesn’t ever go back up.

This makes sense: whereas before you were assuming that no RCT could have a noise larger than 3, now seeing a ridiculously large performance number will just make you think that the RCT had a ridiculous amount of noise, and you’ll just dismiss the result. When you see a result that looks too good to be true, it probably is too good to be true.

The most convincing performance number you could see is about 2.8. If you see that number, you guess that the intervention’s quality is 0.57 — about 72nd percentile. This means that no possible RCT result number can convince you that an intervention is in the top quartile. If you try four interventions at random, one of them will probably be better than the intervention that looks best to you after looking at all of the RCT results.2

I’ll end this post with a few takeaways, and a few questions to ponder. Here are my takeaways, roughly in order of importance:

When you encounter a study, always ask yourself how much you believe their results. In Bayesian terms, this means thinking about the correct amount for the study to update you away from your priors. For a noisy study, the answer may well be “pretty much not at all”!
1. You should interpret the words “encounter a study” very broadly. Informal experimental results — such as noticing that over the past month you’ve felt better on days when you ate broccoli — count as encountering a study, for this purpose.
Working hard to reduce the amount of noise in your measurements is super important for getting useful results — certainly more important than I would have naïvely guessed. Similarly, paying attention to how noisy a study is — including but not limited to its sample size — is super important and probably underrated.
If there’s only been one attempt to estimate the effectiveness of some intervention, you probably shouldn’t put much stock into it, unless it’s really well-done.

And here are some questions to ponder:

How robust are the conclusions of the previous sections to alternative modeling choices?
1. Except for a brief digression in Part III, I assumed that the prior over the quality of an intervention is independent of the amount of noise in the intervention’s RCT. In practice, it’s reasonable to expect them to be dependent — and in particular, for interventions whose quality you’re most uncertain about on priors to also be the interventions whose quality is the most difficult to measure precisely. What happens if you take this into account?
2. Effective altruists argue that intervention quality is not normally distributed — that it has much fatter tails than that. Likewise, measurement noise likely follows a distribution with fatter tails than a log-normal distribution. What happens if you modify the distributions of both quality and noise to reflect this belief?
There is a longstanding debate in the effective altruist community between allocating resources toward super well-evidenced interventions (e.g. insecticidal malaria nets) and allocating resources toward super speculative interventions with a potentially huge upside (e.g. funding a researcher to work on some strategy for aligning AI that has some small chance of working but might also inadvertently advance AI capabilities). Those advocating for more speculative interventions point to calculations suggesting that the expected value of their interventions is extremely large. What implications, if any, does the question “How much do you believe your results?” have for this debate?
In this post I’ve talked about noisy, unbiased measurements of an underlying truth: whatever the true quality of an intervention is, your measurement process will stochastically produce a measurement whose expected value is equal to the true quality. You can instead consider noiseless, partial measurements — ones that only consider some of the effects of an intervention, without considering others. (For the unmeasured effects you just stick with your priors.) Such interventions are “unbiased” in a different, more Bayesian sense: whatever your measurement is, your best guess for the quality of an intervention is equal to your measurement.
1. Is it possible for a measurement to be unbiased in both senses?
2. Are real-world measurements more like the first kind of unbiased or the second kind, or are they both noisy and partial, or does it depend?
3. To what extent do the lessons of this post generalize to partial measurements?

I hope to write about some of these questions soon!

1 Under our assumptions about how the data was generated.

2 Interestingly, this model seems to imply that if people were rational, experimenters would hope that their RCT results would turn up kind of good but not very good.

http://ericneyman.wordpress.com/?p=2220

Extensions

Grading my 2021 predictions

Eric Neyman Apr 17, 2022

In December 2020, I made 100 probabilistic predictions for 2021. As promised, I’ve come back to evaluate them on two criteria: calibration and personal optimism/pessimism. I also challenged readers to compete with me. More on this later, but first, here are my predictions, color-coded black if they happened and red if they didn’t. I. US … Continue reading Grading my 2021 predictions →

Show full content

I. US Politics

Jon Ossoff wins his election: 45%
Raphael Warnock wins his election: 60%
Ossoff and Warnock both win their elections: 42%
Democrats hold the Virginia State House: 61%
Andrew Yang is elected mayor of New York: 24%
The average Democratic overperformance in margin in congressional and state legislative elections, as calculated by FiveThirtyEight (see e.g. here), is at least 5%: 21%
…at least 0%: 38% (this resolved to -0.3%, so very close)
…at least -5%: 66%
Major* legislation not directly related to COVID (excluding international agreements) passes: 45%
Major* infrastructure legislation passes: 18%
Joe Biden signs an executive order authorizing a major cancellation of student debt: 59%
Biden is the president of the United States at the end of 2021: 94%
Donald Trump receives a presidential pardon (possibly from himself): 35%
Hunter Biden is charged with a crime: 15%
Donald Trump is charged with a crime: 28%
At least one member of the Senate stops caucusing with the party they are currently caucusing with: 23%
Donald Trump has a TV show or network at some time in 2021: 21%

* For legislation to be considered major, a substantial amount of effort/political capital needs to be spent on it. Major legislation passes on average once every 2-3 years. Examples include the 2009 stimulus bill, Obamacare, and Trump’s 2017 tax law.

II. COVID

I receive my first dose of a COVID vaccine by the end of March: 12%
…the end of April: 34%
…the end of May: 60%
…the end of June: 74%
…the end of July: 82%
…the end of August: 87%
…the end of 2021: 97%
At least 50% of people living in the U.S. receive at least one COVID vaccine dose by the end of 2021: 75%
At least 60%: 58%
At least 70%: 43%
At least 80%: 20%
At least 90%: 4%
Per official statistics, at least 100 thousand Americans die of COVID in 2021: 82%
…at least 200 thousand: 64%
…at least 500 thousand: 25% (this one was very close to being true)
…at least 1 million: 8%
I or one of the seven people I share 25% of my genes with tests positive for COVID: 30%
I test positive for COVID: 5%
I go to my office at Columbia at least once by the end of May: 32%
EC is held at least partially in Budapest: 36%
Canada/USA Mathcamp is held in person (I have no inside information on this): 25%
SPARC is held in person (I have no inside information on this): 38%

III. Miscellaneous

China is involved in an international (counting Taiwan and Hong Kong) conflict that has 1,000 casualties: 9%
A normalization of relationships between Israel and at least one majority-Muslim country is initiated in 2021 during the Biden administration: 48%
Putin is the president of Russia at the end of 2021: 88%
Benjamin Netanyahu is the Prime Minister of Israel at the end of 2021: 57%
Scott Alexander starts publishing again: 85%
Taylor Swift releases her tenth studio album: 65% (45b, won’t be graded — Taylor Swift releases her eleventh studio album: 13%) (ambiguous, since she released re-recordings with new songs, but my designated ambiguous prediction resolver resolved this one positively)
“Foklore” wins a Grammy for Album of the Year: 61%
The third book in the Kingkiller Chronicle has a publication date set by the end of 2021 (the date doesn’t have to be in 2021): 13%
Roger Federer wins a grand slam tournament in 2021: 26%
Someone besides Djokovic, Nadal, and Federer wins a men’s singles grand slam tournament in 2021: 59%
Serena Williams wins a grand slam tournament in 2021: 32%
All women’s singles grand slam tournaments in 2021 are won by different people: 68%
P vs. NP is widely considered resolved by the end of 2021: 1%
A (non-trivial) update on GPT-3 is released: 62%

IV. Personal A. Academic

I summarize for my blog, or review for a journal, at least 20 papers: 75% (oops)
…at least 30 papers: 65%
…at least 40 papers: 48%
…at least 50 papers: 20%
I attend EC (counts if I go to at least five talks): 74%
The paper I’m writing on aggregating predictions is accepted to a conference or journal: 73%
I write and submit a paper on prediction aggregation and online learning (this is a different one from the one in #59): 77% (this happened in February 2022, which was too late)
…and that paper is accepted: 47%
I resolve the “preventing arbitrage from collusion” scoring rules problem: 40%
My scoring rules paper from a while ago finally gets accepted somewhere: 55%
I publish, or begin writing with the intention to publish, a paper following up directly on “No-Regret and Incentive Compatible Online Learning”: 45%
I publish a computer science paper in a conference held in 2021 or a journal edition issued in 2021: 85%
I publish a paper or note on Zipf’s law: 28%

B. Blog

I write 10 or more blog posts in 2021: 92%
I write 20 or more blog posts in 2021: 78%
I write 30 or more blog posts in 2021: 50%
I write 50 or more blog posts in 2021: 9%
The total number of views of my blog in 2021 is at least 5,000: 95%
The total number of views of my blog in 2021 is at least 10,000: 83%
The total number of views of my blog in 2021 is at least 20,000: 64%
The total number of views of my blog in 2021 is at least 50,000: 27%
The total number of views of my blog in 2021 is at least 100,000: 11%
(Intentionally vague to avoid spoliers) I write a blog post about big aliens: 42%
I publish a blog post on setting the right price: 36%
I publish a blog post about Pi: 33%
I publish a blog post about slowly converging series: 40%
I publish a blog post on Zipf’s law: 80%
I publish a blog post on Bayesian injustice: 25%

C. Other

I vote in the Democratic primary of the New York mayoral election: 93%
I rank Andrew Yang first in the Democratic primary of the New York mayoral election: 51% (I ranked Kathryn Garcia first)
I stick to my virtue points system, or some variation, through the end of 2021: 70%
I’m a SPARC staff member in 2021: 31%
I’m a Mathcamp mentor in 2021: 55%
I (co-)run some OBNYC (NYC rationalist) meetup in 2021: 47%
I consider myself a vegetarian at the end of 2021: 29%
I consider myself a vegan at the end of 2021: 4%
I make a donation of at least $500 to a third world poverty charity in 2021: 66%
I make a donation of at least $500 to existential risk/long-term in 2021: 79% (I donated in January 2022, so technically this didn’t happen)
I make a donation of at least $500 to animal welfare in 2021: 23%
I have a tentative plan to take a gap year (or I take a gap year): 24% (an exciting development that I’ll hopefully be writing about!)
I play squash on at least 10 days in 2021: 65%
I play squash on at least 20 days in 2021: 44%
I visit a country that is not Hungary: 23% (Iceland)
I publish a non-academic piece of writing in some publication in 2021: 33%
I read a book in 2021: 65% (The Scout Mindset)
I read at least two books in 2021: 44%
I read at least three books in 2021: 30%

Calibration

The orange line represents perfect calibration. The blue points represent how I did: for example, among predictions in the 60% to 70% bucket, 74% actually happened. The error bars represent a 95% interval around perfection given my sample size: that is, if I’m perfectly calibrated, each blue dot should be within the corresponding error bar 95% of the time.

So, this looks pretty good! Maybe this is slight evidence for underconfidence? But these results are definitely consistent with perfect calibration.

Personal optimism

For each of my predictions related to personal achievement, I assigned a (private) “importance” score. For example, “my prediction aggregation paper gets published” (#59) had an importance of 5, whereas “I attend EC” (#58) had an importance of 1. My total score for the year would then be the sum of the importance scores of all predictions that would come true. If my predictions were calibrated in terms of optimism — neither optimistic nor pessimistic about what I’d accomplish — the expected value of my score would be 49.6.

My actual score ended up being 35. This is primarily attributable to not blogging as much as I expected: I only published 12 posts (I predicted 30), and in particular I didn’t post any academic paper summaries (I predicted that I would summarize about 40 papers).

Was I incorrectly calibrated (overoptimistic) or did I get “unlucky”? Or to put it another way: if I am correctly calibrated, how surprised should I be about having only accrued 35 points? This is hard to model, because many of my predictions were correlated. But I made some assumptions, did a simulation, and found that if I were perfectly calibrated, I would get a score of 35 or less in 6% of worlds.

Modeled distribution of scores if I were perfectly calibrated in terms of optimism, compared with my actual score (35).

It’s possible that I got unlucky, but my best guess is that I was in fact improperly calibrated. In retrospect I really should have predicted that I’d end up blogging less than I did.

Mini-contest

In my predictions post, I wrote:

You can email me with your predictions for a subset of my predictions with your own prediction. I’ll judge your predictions against mine using the logarithmic scoring rule. I have the advantage of having chosen the questions, but you have the advantage of having seen my predictions. In theory this gives you the upper hand: for example, you could put down my probabilities for every event except a few where you think I clearly messed up. Consider this an exercise in second-order knowledge: figuring out how much weight to put on my probabilities despite not knowing my reasoning behind them.

Stephen Malina emailed me with five predictions that differed from mine. He outperformed me; our biggest disagreement was on the probability that I would summarize 20 academic papers on the blog. He said 60% compared to my 75%, I reviewed zero papers, and so his prediction was better. I’d say it was legitimately better, as opposed to him getting lucky. Congrats to Stephen!

Mike Winer got back to me with thirteen predictions. Most of our disagreement came from Mike being more pessimistic than me about Covid, both in terms of when I would get vaccinated and how many Americans would die. In fact I got vaccinated pretty early and fewer than 500,000 Americans died of Covid, so I ended up doing better. Better luck next time!

Where are my 2022 predictions?

I think there isn’t a lot of value for me in continuing to check my calibration; at this point I’m pretty confident that in general when I say 80%, I really do mean 80%. Other potential biases such as optimism are more interesting, and at some point I may make some personal predictions to further examine whether I have this bias. But I think most of the value of forecasting comes from having informative forecasts. To paraphrase George Washington, “Calibration is easy, young man. Classification is harder.”

And to gauge informativeness, I’d need to compete against other forecasters. I think Metaculus is pretty good at this; I’ve now made some forecasts there, though not many. I’m a bit more excited about prediction markets, where you can gauge how good you are based on how much money you make. I’m particularly excited about Manifold Markets, which I hope to use a lot over the coming years. Hopefully I’ll have a post about that soon!

http://ericneyman.wordpress.com/?p=2166

Extensions

Introducing EA-nasir

Eric Neyman Apr 1, 2022

[Originally posted on the EA Forum] When I ask effective altruists who their hero is, it’s always the same names. Peter Singer. Stanislav Petrov. Jonas Salk. No one ever mentions Ea-nasir, the ancient Sumerian coppersmith and businessman. Which is a shame, really. I guess it makes sense. Most people haven’t heard of him, and those … Continue reading Introducing EA-nasir →

Show full content

[Originally posted on the EA Forum]

When I ask effective altruists who their hero is, it’s always the same names. Peter Singer. Stanislav Petrov. Jonas Salk.

No one ever mentions Ea-nasir, the ancient Sumerian coppersmith and businessman. Which is a shame, really. I guess it makes sense. Most people haven’t heard of him, and those who have only know Nanni’s side of the story:

Sorry, I suppose some of you aren’t literate in Akkadian cuneiform. Here’s a translation.

When you came, you said to me as follows : “I will give Gimil-Sin (when he comes) fine quality copper ingots.” You left then but you did not do what you promised me. You put ingots which were not good before my messenger (Sit-Sin)…
What do you take me for, that you treat somebody like me with such contempt?… You have treated me with contempt by sending them back to me empty-handed several times, and that through enemy territory… You alone treat my messenger with contempt!…
Take cognizance that (from now on) I will not accept here any copper from you that is not of fine quality. I shall (from now on) select and take the ingots individually in my own yard, and I shall exercise against you my right of rejection because you have treated me with contempt.
Nanni, to Ea-nasir

Ea-nasir kept this tablet — and other complaints like it — proudly displayed in his home. I suppose I shouldn’t blame people for concluding that he defrauded his customers with glee.

It’s time to set the record straight. Nanni didn’t say it in his message, but he would have used the copper ingots to make swords to use against his enemies. It was a pointless local skirmish, and Ea-nasir wasn’t willing to be complicit in Nanni’s murders.

These complaints that he displayed in his home — each of them was a precious token of a life saved.

How do I know all this? You see, I am part of a clandestine organization dedicated to the life and work of Ea-nasir. We’ve gone by different names throughout the years. The Ea-nasir Force. Ea-nasir’s Dog Will Open This One (don’t ask, some sort of Sumerian pun). Liberté, Ea-nasirté, Fraternité. Until recently, Friends of Ea-nasir. We’ve existed ever since his time, our causes and methods changing as the world changed around us. The only common thread — besides our desire to do good — is our distinctive approach: the fine art of shoddy craftsmanship.

For nearly four thousand years, we’ve been pulling at the strings of history, attempting to better the world. We haven’t always been successful, but we’ve never stopped trying:

The Leaning Tower of Pisa? We designed the plans for it, anticipating that it might aid in physics experiments that could lead to the discovery of what we now call Newton’s laws of motion.
We designed the winter apparel for Napoleon’s army to have tin buttons, anticipating that it would deteriorate when they tried to conquer Russia. This led to Napoleon’s eventual defeat.
During World War II we infiltrated a Czech munitions factory that was making weapons for Nazi Germany. We proceeded to make bombs as usual, with just one minor modification: we left out the explosive charges. This ended up saving hundreds of lives.
The Enigma was broken in part because it never mapped a character to itself. We introduced that flaw into the design.
To mitigate the threat of nuclear annihilation, we made sure that American warplanes had flawed circuitry, so that they couldn’t make it all the way to Russia. Unfortunately this backfired, when in 1958 a B-47 malfunctioned and dropped a nuclear weapon over Mars Bluff, South Carolina. Fortunately the bomb didn’t explode: it was one of the few bombs whose fissile core we had managed to steal just a few years earlier.
To mitigate the potential for further disaster, we aggressively introduced flaws into most of the American nuclear stockpile. This saved the town of Goldsboro, this time in North Carolina, when a similar incident occurred a few years later.
You can thank us for the fact that the Watergate break-in was discovered. We made sure the wiretaps were defective, and although Nixon’s people ended up realizing this, they got caught replacing them.
In the 1970s we became concerned with the threat of climate change. We infiltrated your local DMV and made it as bureaucratic and inefficient as possible, in an attempt to reduce car use. The studies we did on whether this worked came back inconclusive. Sorry, our bad.
Using extremely sophisticated techniques, we were able to predict that the 2000 U.S. presidential election between Al Gore and George Bush would come down to a few hundred votes in Florida. We did a cost-benefit analysis which found that Gore was the better candidate, so we stacked the odds a little bit in his favor by making the Florida punch card ballots a bit thicker near Bush’s name. Unfortunately the order of the candidates was swapped at the last minute so we ended up disadvantaging Gore. That’s on us.

We’ve always tried to figure out how to do good as effectively as possible. So when we discovered effective altruism a few months ago, it felt like finding a soulmate. We read all the 80,000 Hours articles and decided to rebrand around this wonderful philosophy, even changing our name to EA-nasir.

We have since developed a number of plans and started executing on some of them. I’m excited to share these with the broader EA community and hope to receive some useful feedback in the comments!

1. Treacherous turn

In a treacherous turn scenario, a superintelligent AI acts normally while gathering power behind the scenes, until we are powerless to stop it. It then wipes out humanity, creates nanobot factories, and turns the universe into paperclips. The AI safety community is racing to figure out how to prevent this. Unfortunately they might be running out of time, and we might need to buy them more.

This is where EA-nasir steps in. Over the coming years, we intend to place EA-aligned personnel inside the world’s largest semiconductor manufacturers. Our workers will do their jobs admirably, working hard to improve chip designs, earning trust, rising through the ranks. Then — once the singularity is near — they will use their influence to botch the chip designs: make them slow, unreliable, and easily breakable. Chip shortages unlike anything seen before will bring the AI industry to a screeching halt.

At EA-nasir, we are the treacherous turn.

2. Gene drives for wild animals

Defective mosquitoes are among the most exciting technological developments of the last few years. In short, scientists were able to create mosquitoes with a gene that (1) spreads to all offspring (rather than the usual one-half), and (2) renders female mosquitoes infertile. If these mosquitoes were released into the wild, the gene would spread exponentially and the mosquito population would collapse. This technique — called the gene drive — could potentially eradicate malaria.

*Taken from* *here. Red mosquitoes are the defective ones.*

While admirable, the scientists and advocates behind mosquito gene drives are not ambitious enough. Wild animals have lives that are full of suffering: net negative, not worth living. At EA-nasir, we are working to use this same gene drive technology to create defective wild animals of all species. Once these genes spread throughout the population, wild animal suffering will be over once and for all.

3. Good for the greatest number

At EA-nasir, we believe in bringing the greatest good to the greatest number. While global health charities have done an admirable job on the “greatest good” front, the “greatest number” part has been sorely neglected, with birth rates plummeting across the world. Governments have tried implementing pronatalist policies, with little or no success. We need a different approach.

This is just the job for us. At EA-nasir, we have begun the process to manufacture millions of ineffective contraceptive aids: permeable condoms, placebo Plan B pills, you name it. We have also initiated talks with government officials in pronatalist countries such as Hungary and China to distribute our contraceptives as widely as possible.

4. Operations mismanagement in low-impact organizations

80000 Hours ranks operations management in high-impact organizations as the 6th most effective career path. Unfortunately, many altruists who could make the world a much better place are stuck working in low-impact organizations.

EA-nasir is training operations mismanagers, whom we will be placing in various ineffective organizations. We are using the Simple Sabotage Field Manual, originally written by the CIA as a guide to undermining Nazi Germany’s operations in occupied territory. A small sample of its myriad invaluable tips:

* When possible, refer all matters to committees, for “further study and consideration.” Attempt to make the committee as large as possible — never less than five.
* Haggle over precise wordings of communications, minutes, resolutions.
* Refer back to matters decided upon at the last meeting and attempt to re-open the question of the advisability of that decision.
* Advocate “caution.” Be “reasonable” and urge your fellow-conferees to be “reasonable” and avoid haste which might result in embarrassments or difficulties later on.
* In making work assignments, always sign out the unimportant jobs first. See that important jobs are assigned to inefficient workers.
– Simple Sabotage Field Manual, CIA, 1944

By making work at ineffective organizations as intolerable as possible, EA-nasir aims to cause the most effective workers to quit their jobs in favor of more effective organizations.

You might ask: why are we breaking our silence? The truth is, for recruitment purposes we’ve been trying to break our silence for a while now! At first we thought the world would pick up on our existence: how can so much be broken in a world where so few are evil? Instead of taking this observation to its natural conclusion, you all decided to name your blogs after it. Then we tried distributing leaflets, but unfortunately our printers kept breaking.

This post, though, will leave no doubt. In fact, we were extra careful, scheduling this post for March 31st so as to avoid April Fools’ Day.

*Thanks to Lizka Vaintrob for the logo, and to Ea-nasir for supplying the copper filament.*

Thanks to Emily Ryu, Rachel Wonnacott, Lizka Vaintrob, Mattie Y, and (especially) Mike Winer for some of the ideas that made it into this post.

http://ericneyman.wordpress.com/?p=2085

Extensions