PosGeo — GeistHaus

Using NeuralOperator for Geophysical Surrogate Modeling

posgeo Feb 13, 2026 Updated Feb 13, 2026

This post is part of GP³, an ongoing series on Gaussian Processes and their applications in Geology and Geophysics. The full notebook and source code are available for you to explore on GitHub! A few months ago, the PyTorch blog announced that NeuralOperator has now joined the PyTorch ecosystem! Neural operators are one instance of […]

Show full content

This post is part of GP³, an ongoing series on Gaussian Processes and their applications in Geology and Geophysics. The full notebook and source code are available for you to explore on GitHub!

A few months ago, the PyTorch blog announced that NeuralOperator has now joined the PyTorch ecosystem! Neural operators are one instance of a broader concept: surrogate simulation (also called surrogate modeling or emulation). Many problems in science and engineering involve a computationally expensive simulation. In geophysics, this might be a seismic wave propagation code, or an expensive 3D time-domain electromagnetic simulation. These simulators take a model of the subsurface (density, susceptibility, velocity, etc.) and compute what you would observe with your surface instrumentation. They are grounded in physics and they are accurate, but they are often slow without simplifications. A surrogate is a fast approximation of that simulator, with potentially advantageous properties for downstream applications, like differentiability. You generate a dataset of input-output pairs by running the expensive simulator many times, then, in an offline setting, you train a machine learning model to reproduce the same mapping. Once trained, the surrogate can replace the simulator in online applications that require many evaluations: optimization, uncertainty quantification, survey design or planning, or real-time prediction to name a few.

An Illustrative Example

To show a concrete example, I’ve built a small demonstration project that trains an FNO surrogate for the forward operator in aeromagnetic surveying. Here I’ll walk through the key ideas.

I’m showing an example of simulating total magnetic intensity from an aeromagnetic survey. Here we have to suspend our disbelief a little bit… I know that these simulations are linear and super fast, so making a surrogate model doesn’t exactly make sense. But let’s pretend for now that instead the simulation is super time or resource intensive.

The project has three parts:

1. Generate 3D magnetic susceptibility models using a Gaussian process prior (GPyTorch)

2. Train an FNO surrogate to map susceptibility to magnetic anomaly (SimPEG + NeuralOperator)

3. Evaluate the trained FNO on arbitrary survey geometries without retraining

Part 1: Generating Geologic Models

To train the FNO, we need many examples of the forward relationship: pairs of (true susceptibility model, observed magnetic anomaly). For the susceptibility models, I used a Gaussian process prior, contining a theme from my earlier posts on GPs in geoscience. Here, the GP is used purely as a generative model, producing smooth, spatially correlated 3D fields that look plausibly geological. There is no conditioning on data, no inference. The GP is a prior in the literal sense: it encodes our beliefs about what susceptibility distributions look like before seeing any measurements. I did some fun randomized transformations to make the field as expressive as I needed it to be. Each realization has a different orientation of spatial correlation, producing a diverse set of geologically plausible models. (I might write about this modeling in a upcoming blog post!)

I’m sampling these models on a modest 24x24x6 grid spanning a 100m x 100m x 20m domain. Small enough to generate hundreds of training samples on a laptop, large enough to capture interesting spatial structure. At this scale, I can reasonably fit the dataset into memory, as well as avoid discretization artefacts when simulating the magnetic response.

Part 2: Offline Training the FNO Surrogate

To generate the training data, first I need to simulate the “true” response of the geology models using a high-fidelity physics simulator. Specifically, I’m calculated the Total Magnetic Intensity (TMI) field resulting from the spatial configurations of subsurface magnetic susceptability. To do this, I used the great geophysics simulation package SimPEG, which provides convienient constructors and physics solvers for simulation setups. I generated a dataset of 3000 pairs of (geology model, magnetic field), with observation locations in a 24×24 grid 15m altitude above the model domain.

Once I have the dataset, setting up the FNO is incredibly easy thanks to the simple interface provided by the NeuralOperator team:

fno = FNO(
    n_modes=(12, 12, 4),   # Fourier modes for 3D input
    hidden_channels=32,
    in_channels=1,
    out_channels=1,
    n_layers=4,
)

At a modest 1.8 million parameters, this model is a fine size for a admittedly simple forward operator. The input is a 3D susceptibility field with shape (1, 24, 24, 6) and the output is a 3D field of the same shape. We take the top z-slice of the output as the predicted 2D magnetic anomaly at the observation surface. For training, I used a relative L2 loss (the L2 norm of the residual divided by the L2 norm of the target). This is a toy-scale dataset, but it is enough to demonstrate the concept. After X epochs, the FNO learns a reasonable approximation of the smooth, elliptic forward operator. The loss curves show the model converging, and visual comparison of predicted vs. true anomaly maps shows that the FNO captures the main spatial structure of the magnetic response.

Part 3: Online Discretization-Free Evaluation

Now here’s the payoff! The FNO was trained on a fixed observation grid. But imagine now we want to evaluate it on completely different survey geometries: irregular polygon-shaped survey regions scattered across the domain, with observation points sampled on a different grid inside those polygons. The application here might be you’re optimizing a survey design for a new project, and you want to know what geometry gives you the best value for its cost. If we were conducting true simulations for this search, we’d have to do expensive online evaluations for each geometry, which is infeasible. This is the key distinguisher from traditional ML methods! Even though a simpler ML approach may be able to offload learning a fast forward model in an offline step, it would not generalize to new survey geometries. You’d have to retrain the model on each new geometry, which is infeasible.

For our trained FNO model, what we do instead is: (1) generate random polygon survey regions, (2) evaluate the FNO on its native grid (a single forward pass, effectively instant), (3) interpolate the FNO output to the new observation locations within the survey polygons. The last step is the critical one. Because the FNO has learned a continuous operator, its output on the native grid is a smooth field that can be meaningfully interpolated to arbitrary points. This is conceptually different from, say, interpolating the output of a standard CNN, because the FNO’s spectral representation in Fourier space naturally encodes smooth, continuous structure.

What This Enables

The immediate application is flexible survey design. If you have a trained FNO surrogate for your forward operator, you can cheaply evaluate the predicted field for any candidate survey geometry. This opens the door to:

Monte Carlo methods at scale. Uncertainty quantification, Bayesian inversion, and sensitivity analysis all require running the forward model many times, often thousands or tens of thousands of times. This is prohibitively expensive with a full physics simulator. With a surrogate, it becomes tractable. You can push an ensemble of susceptibility models through the FNO and get a distribution of predicted observations in seconds.

Differentiability for free. Because the FNO is a PyTorch module, it is automatically differentiable. You get gradients of the predicted field with respect to the input susceptibility model, which enables gradient-based optimization and inversion without having to derive or implement adjoint equations.

Resolution flexibility. The discretization invariance of the FNO means you can train on a coarse grid (fast to generate training data) and evaluate on a finer grid at inference time. You are not locked into the resolution of your training data.

Composability with the PyTorch ecosystem. The FNO is just a torch.nn.Module. It plugs into existing PyTorch workflows: optimizers, learning rate schedulers, data loaders, distributed training, mixed precision, etc. without custom infrastructure.

More broadly, this project illustrates a pattern that applies across geophysics: train a neural operator on the expensive physics once, then use it as a drop-in replacement for the forward simulator in downstream tasks that require many evaluations. The NeuralOperator library makes this accessible with a clean PyTorch API!

Conclusion

I hope that this blog post has shown that Fourier Neural Operators have some cool potential applications in geophysical forward modeling! The forward operator I showed here, aeromagnetic surveying maps a 3D field (susceptibility) to a 2D field (magnetic anomaly), is simplistic, but one can imagine FNOs being applied to more computationally expensive forward operators, such as electromagnetic or seismic wave propagation. Once trained, the FNO provides a fast, differentiable, and discretization-free surrogate that can be evaluated on any survey geometry.

I think that the combination of GPyTorch for model generation, SimPEG for forward physics, and NeuralOperator for the surrogate is a powerful and composable toolkit. Each library does one thing well, and they fit together cleanly in a PyTorch-native workflow.

The full notebook and source code are available on GitHub.

http://posgeo.wordpress.com/?p=3076

Extensions

AI in Mineral Exploration: 2025 in Review

posgeo Jan 12, 2026 Updated Jan 14, 2026

2025 was a blockbuster year for both AI and critical minerals! Pretty much every other week either we got news of a new discovery or governmental initiative, or an AI company releasing a new model or product. To break through the noise and try to understand the trends of the year, I thought it would […]

Show full content

2025 was a blockbuster year for both AI and critical minerals! Pretty much every other week either we got news of a new discovery or governmental initiative, or an AI company releasing a new model or product. To break through the noise and try to understand the trends of the year, I thought it would be fun to write a “Year in Summary” of AI in geoscience (with an exclusive focus on mining and subsurface inference), following the pattern set by some excellent yearly summaries I’ve read over the last few days: 2025: The year in LLMs by Simon Willison, and Databases in 2025: A Year in Review by Andy Pavlo.

Disclaimer: The contents of this blog post are headlines pulled from company press-releases, news agencies, and aggregators, woven with my personal takes. Opinions displayed here represent my own work, and not any opinions of my employer.

Let’s get into it!

Funding

In the funding world, KoBold kicked off the year by raising a whopping $537 million in a Series C round (post-money valuation of ~$3B). They claim that funding will be put towards global exploration, R&D, and advancing their Mingomba site towards production. VerAI followed the next month with a $24 million Series B round, promising to accelerate their 60+ mineral projects. Meanwhile, GeologicAI closed a $44 million Series B round, aiming to deliver “high-resolution decision engineering” on drill core, integrating cutting-edge core scanners and proprietary AI to deliver “a new standard in high-resolution data and decision-making” on site. By December, they had also acquired Lumo Analytics, makers of a compact Laser-Induced Breakdown Spectroscopy (LIBS) rock-scanner, adding fast light-element and REE analysis to their sensor suite.

These funding raises are impressive, and give me great confidence for the future of the concept of AI in exploration, both for vertical business models as well as service providers! It’s particularly impressive to see that KoBold can continue to bring in high-value investors like T. Rowe Price, demonstrating both the value of AI and that exploration is open to this class of investors. KoBold claims that AI let them find something that traditional methods missed, but I’m interested to see what will happen next: can they use AI to accelerate development? GeologicAI’s tech stack sounds super cool! Fast access to data should be able to accelerate decision-making, and I’m eager to see them put it into practice.

AI Products

Big AI companies dropped a string of blockbuster products in 2025: reasoning LLM models like GPT5; professional-grade generative image models like Nano Banana; and revolutionary agentic coding frameworks like Claude Code. On the other hand, the AI tech stacks being built for mineral discovery feel like a very different “flavor” of AI. Companies like KoBold, VerAI, and Vrify are mainly solving inverse problems: inferring the unknown subsurface given sparse signals. In press releases they largely reference feature extraction/engineering, pattern recognition, and other pragmatic machine-learning approaches, rather than LLMs and image models. (That being said, the sensor-first, data-fusion pipelines of GeologicAI stand apart from the crowd here.)

That’s not to say that LLMs and generative image models are irrelevant to mining. In December, VRIFY released their 2025 Mineral Exploration Tech Report, claiming that 56% of people working in exploration are using AI tools in their day-to-day work (based on a survey of 135 industry professionals). But it’s one thing to have mineral exploration professionals, say, using ChatGPT to offload tedious tasks like summarizing vendor reports. It’s quite another thing to use or create novel, bespoke AI tools and products. I suspect that many C-suite leaders at these companies are being required to have “AI initiatives” that are (hopefully) more in line with the latter. Exactly how they accomplish this is not immediately apparent: exploration decision-makers demand resource quantifications and traceable error metrics, not generative hallucinations of questionable provenance. Definitely an area to keep an eye on in 2026!

Research Corner

In the academic world, there were quite a few publications this year that stood out to me:

Extracting data from maps: Lessons learned from the artificial intelligence for critical mineral assessment competition. This paper summarizes a machine learning competition (run in a joint project between USGS, DARPA, JPL, and MITRE) to investigate data extraction from historic geologic maps. Despite this challenging task, teams showcased some really promising AI-driven workflows!
Synthetic Geology: Structural Geology Meets Deep Learning. This paper, posted on arXiv, appears to still be in review at JGR: Machine Learning and Computation, and shows some amazing progress with generative modeling of 3D subsurface structure!
Checking the consistency of 3D geological models. I really enjoyed seeing this paper on applications of set theory and logic to models of the sub-surface. Although “the map is not the territory,” the ideas in this paper give us the tools to tell when our subsurface models (generative or traditional) are obviously wrong.

Despite the amazing progress in the research world, I don’t think we’ve had our “Attention Is All You Need” moment in AI subsurface modeling yet. The closest parallel we have was back in 1987: Occam’s Inversion by the Constables and Bob Parker, the landmark paper on regularized inverse theory. Perhaps 2026 will house the breakthrough in research we need? Maybe it’ll be a NeRF, a Geospatial Reasoning Model, or something we haven’t even named yet!

My personal 2025

2025 was a big year of growth for me. I’m still at Terra AI, doing cool geophysics and subsurface inference work (no updates in this blog post, but hopefully I’ll be able to talk about the work in the future). This year I leaned heavily into using LLMs, both as coding agents as well as API tools, and ended up winning the “Best use of API” the official Meta Llamacon hackathon, using their Llama 4 models to turn geological documents into 3D models! I also presented at the CMU Workshop on Neural Simulation-Based Inference. It was wonderful meeting so many talented statisticians, and telling them all about the compelling problems we have in geosciences. Here’s to another year of exciting tech releases and advances in geosciences!

Cover Photo by Sebastian Pichler on Unsplash

http://posgeo.wordpress.com/?p=3065

Extensions

LLMs, Partial Evaluation, and the Three Futamura Projections

posgeo Dec 2, 2025 Updated Dec 2, 2025

Earlier last month I went to a Google Gemini “vibe coding” hackathon in SF, hosted by Cerebral Valley. Vibe coding, for the unfamiliar, is the practice of heavily using AI-assistant tools for software development, letting you focus on the high-level structure and results rather than the exact details of implementation. The theme of the hackathon […]

Show full content

Earlier last month I went to a Google Gemini “vibe coding” hackathon in SF, hosted by Cerebral Valley. Vibe coding, for the unfamiliar, is the practice of heavily using AI-assistant tools for software development, letting you focus on the high-level structure and results rather than the exact details of implementation. The theme of the hackathon was to build new tools to enable and accelerate vibe coding, leaning into the exploratory, improvisational nature of working with LLMs. People were building everything from MCP server tools to multi-modal agent-based systems, it was great!

While working on a project there, I kept noticing a structural pattern that seemed to be common in many LLM tools, services, and products. It’s the simple separation between two parts of a prompt:

The static part: the “system” prompt, the pre-defined rules, tone, instructions, domain knowledge, constraints.
The dynamic part: the “user prompt”, the actual user input, like a new question, attached data, or other input not known in advance.

This separation reminded me of a classical idea in programming languages: partial evaluation. Further investigation led me to think about the static/dynamic separation in the world of Futamura projections, a set of ideas about interpreters, compilers, and specialization. When you map those onto how LLMs are used in practice, the analogy becomes surprisingly interesting! This post is my attempt to unpack that connection!

Static vs. Dynamic

A typical LLM service has a structure something like this:

system_prompt = """
You are an experienced geochemist. 
When given a CSV of elemental abundances, extract structured information
and output JSON following this schema...
"""

def run(csv_text):
    return llm.chat([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": csv_text}
    ])

The system prompt here is static. It defines target behavior like the instructions, persona, or things like formatting guidance and predefined heuristics. You set it (or optimize it) ahead of time. The point here is that we’ve partitioned the input objects of an LLM runtime into two objects: static and dynamic. One way of formalizing this partitioning of inputs is described on the Wikipedia page for partial evaluation:

$prog:I_{\text{static}}\times I_{\text{dynamic}}\to O.$

In classical partial evaluation, if part of a program’s input is known in advance, we could “precompute” everything that depends only on that part. We somehow take $( prog,I_{\text{static}})$ and transform it into $prog^{*}$ , with:

$prog^{*}:I_{\text{dynamic}}\to O.$

The result is a specialized program, or a residual program, that’s simpler (and maybe faster) to run when the dynamic input arrives. Now, there is a particularly interesting case of this partial evaluation when you consider $prog$ being an interpreter for a programming language. Then, the act of specializing on some source code $I_{\text{static}}$ yields a program that a version of that interpreter that acts like an executable: it only requires the runtime data $I_{\text{dynamic}}$ . It’s almost like it has been compiled!

In the previous example for partial evaluation for a programming language, we had three components: (1) the static program; (2) the dynamic data; and (3) the interpreter. Doesn’t this feel awfully similar to LLM setups? If we treat the LLM like a runtime interpreter of instructions, then:

The system prompt is analogous to a program.
The user input is analogous to runtime data.
The LLM runtime act like the interpreter.

So if we’re able to apply partial evaluation to an interpreter, yielding an executable, can we do the same to an LLM runtime? One might think that this is just some conceptual reframing of the same underlying implementation. A kind of higher-order, functional programming, “shell game” trickery, where we just shuffle around the order that functions receive their inputs. But in the interest of pre-computation we can actually make concrete changes! Even today there are approaches that enable the specialization of an LLM runtime, including techniques like prompt caching [1], or preprocessing system prompts into vector forms, a.k.a. “soft prompts” [2]. These approaches enable exactly what we want: a specialization given a fixed system prompt, so that we don’t need to reprocess the entire system + user prompt every time!

But can we go further? If one keeps reading that Wikipedia article on partial evaluation, you’ll eventually reach a section curiously titled the “Futamura projections,” of which there appear to be three [3], each one more mind-bending than the last. What will we find in our exploration between the links between partial evaluation and LLMs?

Futamura Projections and LLMs

The Futamura projections come from the world of programming languages (PL) theory, but we don’t need heavy formalism to understand them (however if you want to read more, check out the resources here [4]). Let’s begin with roles, how they behave (what they take as inputs and/or produce as outputs, if any), and draw analogies to concepts from LLMs:

PL TermInputOutputLLM AnalogyProgramNoneNoneSystem promptInputNoneNoneUser prompt OutputNoneNoneResponse InterpreterProgram + InputOutputLLM RuntimeExecutableInputOutput???CompilerProgramExecutable???

I’ve also spent quite a bit of time drawing diagrams of the forthcoming Futamura projections, so I hope you enjoy them (even if they do look a little “Pepe Silvia“)!

First Futamura Projection

With the links between PLs and LLMs drawn, I can reveal that the first Futamura projection is our earlier example of partial evaluation: specializing an interpreter for a given program (i.e., source code), yielding an executable. The resulting object (the executable) is an object that takes in only dynamic data and produces an output. Compare this with the interpreter, which takes in both the program and dynamic data, producing an output. When we use the executable to produce an output, we say “we have executed.” And likewise, when we use the first Futamura projection to produce an executable, we say “we have compiled.”

Top left: Typical LLM operations. Top right: Specializing an LLM runtime for a given system prompt. Bottom: Behavior of the specialized LLM runtime.

We saw earlier what this looks like translated to LLM world: a simplified, or residual LLM runtime that only takes in a user prompt and returns a response! We already covered practical implementations when talking about caching and soft prompting [1–2], but I think model distillation or fine-tuning could also be viewed as implementations of this idea [5].

Second Futamura Projection

Let’s step it up a gear! The second projection asks the question: “Instead of specializing the interpreter for a given program, what would happen if we specialized the specializer for a given interpreter?” This time, the resulting object is a compiler: an object that takes in only a program, and produces an executable (a thing that takes in only dynamic data and produces an output). When we use the compiler to produce an output, we say “we have compiled.” There isn’t a single past tense verb to describe the action of using the second Futamura projection to produce a compiler, but we can say “we have made a compiler.”

Top: Specializing a specializer for a given LLM runtime. Bottom left: Behavior of the return object, taking a system prompt and yielding a specialized LLM runtime. Bottom right: Behavior of the specialized LLM runtime.

What does this output object relate to in the world of LLMs? Things are getting a little abstract, but it would sort of reflect a tool or product for producing arbitrary residual LLM runtimes for arbitrary system prompts. One example could be an interface that enables you to predefine a system prompt, and get a custom LLM runtime as a service! I believe that capabilities like this already exist in frameworks such as DSPy [6].

Third Futamura Projection

Now things start to get really mind-bending. This time, we are specializing our specializer for a given specializer(!). The output object of this process doesn’t really have a widely-accepted name, but we could call it a “compiler-compiler.” It’s an object that takes in an interpreter, and produces a compiler (an object that takes in only a program, and produces an executable (a thing that takes in only dynamic data and produces an output)). At this point, we’re out of commonly used verbs to describe what’s going on. When using this object we would say “we have made a compiler,” and describing the action of using the third Futamura projection would be something like “we have made a thing that makes compilers.”

Top left: Specializing a specializer for a given specializer. Top right: Behavior of the return object, taking an LLM runtime. Bottom left: Behavior of the return object, taking a system prompt and yielding a specialized LLM runtime. Bottom right: Behavior of the specialized LLM runtime.

Does this mean anything for our LLM analogy, or have we lost the thread? Maybe not! It would sort of be like a service where you would BYO-LLM (bring-your-own-LLM), and the service would return to you a product that generates a tool, specific for your LLM, to generate residual LLM runtimes for arbitrary system prompt use-cases. This is speculative, but I wouldn’t bet against seeing it (or something like it) in 2026. I’ll definitely be leaving that one as a conceptual horizon, and a direction to think about for the future!

Zooming Out: Why This Matters

We’ve traveled a fair distance from the energy of a 2025 vibe coding SF hackathon to the abstract realm of 1970s programming language theory. But I hope this jaunt through the Futamura projections hasn’t just been an exercise in academic nostalgia. I think by mapping LLMs onto partial evaluation we get more than just a fun analogy, we get a map for where this technology might be going.

The immediate utility of this framing is a shift in mindset. When we look at the separation of system and user prompts, we stop seeing them as “magic spells” to convince a chatbot to behave. Instead, we see the system prompt for what it really is: the static component of a staged computation. This gives us a principled way to think about optimization. In the PL world, we can optimize by moving as much computation as possible from “runtime” (dynamic) to “compile time” (static). In the LLM world, I think we’re going to start seeing the exact same movement.

This hunger for the “First Projection,” the creation of a highly efficient and specialized executable, was perfectly captured in a recent thought by Simon Willison:

What's the latest research on how much baked-in knowledge an LLM needs in order to be useful?

If I want a specialist coding model can I trim the size of that model down by stripping out detailed knowledge of human history, geography etc?

Can we even do that?
— Simon Willison (@simonw) November 10, 2025

Viewed through the lens of partial evaluation, we can interpret Simon’s desire as: Can we generate the residual program? Whether we achieve this via weight-stripping, distillation, or some new form of “model compilation” remains to be seen. But I think the destination is the same: heralding a movement away from monolithic “interpreters” toward specialized, “compiled” artifacts.

Oil and gouache on paper bordered with gouache, watercolor, and ink, mounted on cardboard — Paul Klee, Static-Dynamic Graduation, 1923. Sourced and edited from Wikimedia under CC0 1.0 permissions.

Footnotes

[1] https://platform.openai.com/docs/guides/prompt-caching

[2] Chang, K., Xu, S., Wang, C., Luo, Y., Liu, X., Xiao, T., & Zhu, J. (2024). Efficient prompting methods for large language models: A survey. arXiv preprint arXiv:2404.01077.

[3] Or are there only three? Generally when computer scientists start counting they feel the urge to continue onto infinity. The following paper posits that the forth- or even higher-order Futamura projections: Glück, R. (2009, January). Is there a fourth Futamura projection?. In Proceedings of the 2009 ACM SIGPLAN workshop on Partial evaluation and program manipulation (pp. 51-60).

[4] If you want to read more about them, I recommend this blog post: “The Three Projections of Doctor Futamura”, as well as this talk at !!Con West 2020: “Compilers for nothing, executables for free”.

[5] Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., … & Zhou, T. (2024). A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116.

[6] https://dspy.ai/

http://posgeo.wordpress.com/?p=3000

Extensions

Can an LLM be a Black-Box Optimizer?

posgeo Oct 10, 2025 Updated Oct 10, 2025

At Terra AI, I have a great collaborator who describes himself as “optimization-brained,” and always comes up with the best strategies for optimization problems, whether he’s coming up the with best linear algebra way of performing a calculation, or he’s thinking about the best way to apply a set of constraints. This is in contrast […]

Show full content

At Terra AI, I have a great collaborator who describes himself as “optimization-brained,” and always comes up with the best strategies for optimization problems, whether he’s coming up the with best linear algebra way of performing a calculation, or he’s thinking about the best way to apply a set of constraints. This is in contrast to my experience in my PhD, where I mostly applied the classic “graduate student descent” algorithm, manually fiddling with parameters and generally trying and get an intuition for how a system works. In my defence, this was how most of my colleagues did optimization for Earth-system models. Of course, if you’re doing a well-defined problem like regularized inversion of a geophysics problem, you’d use a robust optimizer approach like Gauss-Newton or L-BFGS. But often you’re working something like a complex-system model with many interacting components, and you probably don’t have access to gradients or adjoints. An example of this could be a model of the dynamics of geologic system that couples weathering, erosion, sediment transport, and biological processes, and you’re trying to fit a model to data that has a lot of noise. In Earth sciences, we end up in this situation all the time: you have this huge, complicated simulation code (probably written in MATLAB), and you have to fiddle with some input parameters to get the outputs to match your field data. Oh and also, the simulation takes 2 hours to run or your laptop. Are you really going to wrap this in an optimizer loop? Chris Rackauckas wrote a really nice blog post about this kind of thing in The nonlinear effect of code speed on productivity; Richard Hamming also writes about this back in the 1950s in his book “The Art of Doing Science and Engineering.” Of course you could use some approach like Bayesian optimization, but we’re Earth scientist! We don’t know about sophisticated techniques like that! So instead, people typically end up taking an approach to optimization that is part “grad-student descent,” and part “intuition-guided.”

But this got me thinking: a lot of human intuition is (for better or worse) encoded in written language. What if we could take the intuition that is latent in a large language model (LLM) to guide an optimization process? I’ve recently been getting better at using LLM APIs (I won an award for it!), so this seemed like a fun project to try out! I bet someone’s already done this, but I wanted to see where I could get myself, that’s why it’s called research!

Problem setup

For the optimization setup, I decided to use the classic 2D Rosenbrock function as a first test function. The function is defined as:

$f(x) = (100 - x_0)^2 + 100 (x_1 - x_0^2)^2,$

where $x_0$ and $x_1$ are the independent two variables. The aim of the optimization is to find the global minimum of the function $f$ , which is at $x = (1, 1)$ with $f(x) = 0$ . To keep things simple, I chose to optimize the function in the range $x_0, x_1 \in [-3, 3]$ .

LLM Optimizer

How do you turn an LLM into an optimizer? Well as I’ve found in previous projects, you can just tell an LLM that it has that persona in its system prompt!

You are an optimizer proposing the next point to minimize a 2D function. 
Always return ONLY a JSON array of floats representing 
the next candidate point, e.g., [x0, x1]. No extra text.

This is combined with a user prompt that specifies the history of past sample points and function values:

Problem: Minimize a 2D function.
Bounds: [(-3.0, 3.0), (-3.0, 3.0)].
Evaluation history: [
(point=[2.5, 2.5], value=934.1)
(point=[3, 3], value=2713.16)
(point=[2, 2], value=197.052)
]
Return ONLY the next point as a JSON array of floats within bounds.

Both of these are supplied to the LLM API, and I get back a response like this:

[2.5, 1.5]

So all I needed to do now is wrap this in a loop, and make a new API call for each step in the evaluation loop! I implemented the LLM optimizer in a stateless mode, which means that the LLM is given the full history of the optimization process at each step, but without any other context. An alternative approach would be to supply each point sequentially, in a chat dialoge fashion. I didn’t end up doing this because I thought the context might get a bit messy by the later evaluation steps. I originally started using o3-mini as my reasoning model of choice, mostly due to its lower cost and higher rate limits. However, after changing my API call scheduler a little bit I was also able to effectively experiment on a bigger model: gpt-5-mini.

Evaluation

Of course, I needed to be able to judge how well this LLM optimizer is doing, so I also implemented two other optimizers: Nelder-Mead, and Bayesian optimization.
Nelder-Mead is a region-based optimization technique that moves the corners of a simplex in the direction of the best point, whereas Bayesian optimization fits a Gaussian process model to the objective function and then samples the next point to evaluate based on the model’s uncertainty.
For the evaluation metric, I’ll be judging the optimizers based on the minimum objective function they find, given a fixed budget of 20 function evaluations.
Before I show the results, what do you think will happen?
Personally, I thought that Bayesian optimization would do the best, since it’s typically pretty sample-efficient, and I thought that the LLM optimizer would do second-best, ahead of gradient descent and Nelder-Mead.

A Slight Mis-Step

Almost immediately, I found a mistake with my problem setup. The LLM would immediately guess that the minimum would either be at the origin, or at $(1, 1)$.

=== LLM (stateless) prompt ===
You are an optimizer proposing the next point to minimize a 2D function. The minimum is not always at the origin, so be careful. Make large moves when necessary, as you have few function evaluations available. Always return ONLY a JSON array of floats representing the next candidate point, e.g., [x0, x1]. No extra text.
Problem: Minimize a 2D function.
Bounds: [(-3.0, 3.0), (-3.0, 3.0)].
Evaluation history: [
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1.01, 0.99], value=0.090701)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
(point=[1, 1], value=0)
]
Return ONLY the next point as a JSON array of floats within bounds.
=== LLM response ===
[1.0, 1.0]

I believe the reason for this is this is where minima are typically located in commonly used optimization test functions.
I fixed this problem with three tweaks: (1) I added a small random translation to the function, (2) set the LLM optimization routine to start with a random initial point, and (3) altered the system prompt to encourage the LLM to explore the domain more.

System Prompt:
You are an optimizer proposing the next point to minimize a 2D 
function. The minimum is not always at the origin, so be careful. 
Make large moves when necessary, as you have few function evaluations available.
Think carefully about your next move, and be sure to consider the history of points and values.
Always return ONLY a JSON array of floats representing 
the next candidate point, e.g., [x0, x1]. No extra text.

Results

With that bug cleared up, I was able to run the evaluation suite for all optimization methods! Before looking at all the results, I think it’s cool to look at an example of the LLMs responses.

=== LLM (stateless) prompt ===
<system prompt not shown>
Problem: Minimize a 2D function.
Bounds: [(-3.0, 3.0), (-3.0, 3.0)].
Evaluation history: [
(point=[-1.43248, -0.969005], value=963.06)
(point=[2.5, 2.5], value=934.1)
(point=[3, 3], value=2713.16)
(point=[2, 2], value=197.052)
(point=[2.5, 1.5], value=1644.72)
(point=[2.5, 2], value=1264.41)
(point=[1.5, 1.5], value=6.34255)
(point=[1, 1], value=16.301)
(point=[1.55, 1.55], value=11.8555)
(point=[1.45, 1.45], value=2.71255)
(point=[1.4, 1.4], value=0.702304)
(point=[1.35, 1.35], value=0.063643)
(point=[1.3, 1.3], value=0.563394)
(point=[1.345, 1.345], value=0.066107)
(point=[1.36, 1.36], value=0.0930568)
(point=[1.3486, 1.3486], value=0.0631964)
(point=[1.348, 1.348], value=0.0632766)
(point=[1.37, 1.33], value=0.577212)
(point=[1.349, 1.349], value=0.0632333)
]
Return ONLY the next point as a JSON array of floats within bounds.
=== LLM response ===
[1.3485, 1.3485]
    Completed with 20 evaluations; final best: 0.0631964

I found this so cool to see! You can sort of make out a thinking process here, where the LLM first explores wide, and then gets more and more precise.
When running experiments, I’d track the API responses in a terminal window, and I often ended up making similar next guesses to the LLM. I.e., “Ok we’re going along this line, and the value is still decreasing, so let’s take another step in that direction. Oh ok now the value is higher, so let’s take a step back, and move orthogonal instead…” and so on.

Anyway, let’s see how the LLM approach compares to the other optimizers. The figure below shows an example of the optimization paths for each method.

In this example, Nelder-Mead was not able to get close to the global minimum in the fixed budget of 20 function evaluations.
Although Bayesian optimization made a good first start, it didn’t make great progress past the 9th function evaluation.
However, as luck would have it, the LLM optimizer performed better than all the other optimizers, getting fairly close to the global minimum!
I thought this was probably a fluke though, and ran another attempt. This time, I’m visualizing the objective function per iteration step, for each optimization method.

Again, the LLM optimizer ended up doing the best! Still I wan’t convinced. So I ended up running 10 experiments (starting at random initial points) and taking the average of the results. Would the LLM optimizer be statistically better than the other optimizers?

It seems that GPT-5 is actually quite a fine optimizer, in a latency-agnostic, limited function evaluation setting!

Conclusions

So what can I gather from this? Well it’s been making me thing about the different choices one makes when deciding on an optimization (or decision-making) strategy. I think a big deciding factor is robustness. Although approaches like Nelder-Mead etc… don’t have guarrantees to converge to minima, for practical purposes they will “work,” because they have been constructed for that purpose. With a prompt-guided LLM, could really do anything: hallucinate, get caught in a loop, refuse to return formatted outputs, etc. If you have a safety-critical application, one might prefer a more robust method over a “fuzzy” LLM approach. Another big factor is latency. LLM API calls are (usually) slow, and for some settings you just can’t afford that. (On the topic of affordability, LLMs aren’t free! These experiments cost me around $10 in API calls.) Overall this makes me think of another colleague I work with, who works on the Curiosity rover, where decisions are made by committee, and latency is minutes to hours. I don’t see LLMs being used in settings like that, but there are similarities!

Will I actually use this in my work? Probably not in its current form. But if someone made a really robust LLM optimizer, I’d actually consider using it for real problems. I think in cases were you don’t have access to gradients, you have a small budget for function evaluations, and you don’t want to wrap your code in an optimizer loop, this could actually be a valid approach!
I could imaging this working in an interface where I relay inputs and outputs to an LLM that is set up as a kind of hybrid online-offline optimizer, and could see myself using it as a good addition to “intuition-guided” optimization.

I think some interesting future work could be to experiment with different formulations of the prompting, e.g., a stateful dialogue. It would also be good to compare the performance of reasoning vs. non-reasoning LLMs, as well as testing on other optimization problems (even some real settings)! If you’re interested in the code for this project, you can check it out in this github repo!

http://posgeo.wordpress.com/?p=2986

Extensions

Hacking with GPT-5: Generating Beautiful Diagrams at the OpenAI Hackathon

posgeo Aug 14, 2025 Updated Aug 14, 2025

Last weekend I had the pleasure of participating in the official OpenAI GPT-5 hackathon in San Francisco! The event was thrilling, it was amazing to see so many passionate people trying out new things with GPT-5. I figured I’d write up what I developed in my weekend project, as well as gave some thoughts on […]

Show full content

Last weekend I had the pleasure of participating in the official OpenAI GPT-5 hackathon in San Francisco! The event was thrilling, it was amazing to see so many passionate people trying out new things with GPT-5. I figured I’d write up what I developed in my weekend project, as well as gave some thoughts on the new language models!

So, what did I end up making? Well inspiration came from some work I’d seen from the CMU Geometry Gollective group (geometry.cs.cmu.edu), specifically, the work by Katherine Ye and others on a framework and DSL for generating beautiful mathematical diagram: Penrose.

Title, authors, and heading of the Penrose paper, DOI:10.1145/3386569.3392375, shared under CC BY-NC-SA 4.0.

The framework gives a way of transforming a set of declarative DSL statements into visual representation, using constraints and optimization rather than typical imperative graphics approaches (e.g., PGF/TikZ).

In my last hackathon, I experimented with LLMs to generate and process DSL code (geo-lm), so I thought it would be fun to try and use GPT-5 to generate Penrose code!

After a bit of brainstorming, I decided to attempt to make an app where a user can upload a sketch of a diagram, and then GPT-5 would view that sketch and be prompted to generate Penrose DSL code that best reproduces the sketch. This takes advantage of GPT-5’s design of taking both text and image for input, and outputting text. Here’s a demo of it working:

The architecture is pretty simple: when the user uploads an image, that image is send to GPT-5 through the OpenAI API, along with a text prompt describing the Penrose DSL grammar, and instructing it to generate the required DSL code. When the DSL code is generated, it is rendered locally, and the figure is displayed as an SVG! If anyone is interested, all the code is all open-source on Github, at: github.com/williamjsdavis/penrose-AI.

GPT-5 as a Coding Assistant

Overall, I was very impressed with GPT-5 helping me make this product. I used it within Cursor in the chat window; I’m not an agentic user just yet. I found that all the generated code worked first-time round, with minimal corrections needed from me (e.g., formatting, altering paths, etc…). Perhaps this is because the app I made was pretty simple: just a Django backend with a React frontend, using the OpenAI API for calling GPT-5. I think it’s kind of incredible though that someone like me, with hardly any full-stack experience can make a prototype app in a few hours!

GPT-5 as a DSL Compiler

Unfortunately, I had less success with GPT-5 for generating Penrose code. Even with extensive prompting, I found it very difficult to get GPT-5 to produce DSL code without errors. I encountered a similar problem with generating GemPy inputs in geo-lm; at least here the Haskell-based compiler gives very clear error messages. In fairness, the difficulties I had likely reflect the sparsity of Penrose DSL code in LLM training data. I believe that if I had instead used GPT-5 to generate more common code, like TikZ code, I’d have had greater success. (I may test this in the future!) After the event, I spoke to another hackathon participant who was also using GPT-5 as a DSL generator (this time for Manim). She said that she managed to wrangle GPT-5 into producing compliant code by: (1) prompting it to only use certain parts of the DSL language; and (2) running the code generation in a loop, i.e., if generation errors, feed the error message back into the prompt, and attempt to generate again. Apparently it took around 10 iterations for the LLM to generate compliant code! Not ideal for an interactive application.

Summary

Overall I had a lot of fun using GPT-5! I’d be happy to continue using it as my “daily driver” for coding problems. However, I don’t see myself attempting to use it for DSL generations in the future, unless I find more robust approaches. If anyone has references for alternative methods, I’d be happy to hear about it!

I might attempt DSL generation again with some of these new, fast inference language models, like Inception lab’s diffusion LLM, Mercury. Perhaps with faster iteration cycles, codegen could be more responsive.

I mentioned this earlier, but I may try and reattempt this project using a procedural diagramming language like TikZ. I think this will likely have more success, as there’s more training data out there, what with the LLMs likely having a lot of LaTeX-related code in their training data. Although this now makes me think: If there was as much Penrose code in the LLM training data, would that actually have translated into success in this project? Are LLMs better at generating declarative or imperative code? If anyone knows work that has studied this, please let me know!

http://posgeo.wordpress.com/?p=2969

Extensions

Hacking Non-Stationary Gaussian Processes

posgeo Mar 25, 2025 Updated Mar 25, 2025

Gaussian processes (GPs) are powerful tools for modeling and quantifying uncertainty over functions that are smooth and continuous. When used for regression, they are one of the only methods that allows you to do Bayesian inference in infinite-dimensional spaces, without either: resorting to approximate inference methods like RTO-TKO [1], or truncating the problem ab initio […]

Show full content

Gaussian processes (GPs) are powerful tools for modeling and quantifying uncertainty over functions that are smooth and continuous. When used for regression, they are one of the only methods that allows you to do Bayesian inference in infinite-dimensional spaces, without either: resorting to approximate inference methods like RTO-TKO [1], or truncating the problem ab initio and using typical Bayesian sampling methods (i.e., MCMC) [2]. So, GPs are great! In my work at Terra AI I use them all the time for tasks like modeling prior distributions of subsurface geologic properties (or geo-priors, as I call them), or modeling uncertain cost landscapes in Bayesian optimization. It’s also now easier than ever to get GP models up and running in production settings, thanks to the development of GP libraries like GPyTorch [3], GPyFlow [4], BoTorch [5], as well as the rich Julia ecosystem for GP modeling [6].

An important part of a GP is the covariance function, also known as the kernel, which encodes how pairs of points in the domain are related to each other. Kernel design is a key part of GP modeling: it allows you to encode prior beliefs about the function you’re modeling, while also allowing you to flexibly fit the data [7]. However, many out-of-the box GP libraries are limited to stationary kernels. These are kernels that have the same properties everywhere in their domain. This is a limitation when modeling data that has different variability over different regions of interest.

Take, for example, the problem of modeling a sedimentary system that has different variability over different regions of interest. Perhaps it has large, lateral, parallel layers interspersed with smaller-scale, channelized features. A stationary kernel would be unable to model this, as it assumes that variability is similar everywhere in the domain.

A sedimentary cross-section, showing lateral layers and a channelized feature. Image from Michael C. Rygel via Wikimedia Commons [CC BY-SA 3.0]

Although the mathematics of GPs allows for spatially-varying kernels, one might be restricted to a stationary kernel by the library they are using. If this is the case, how can you “hack” non-stationarity into your GP? In this blog post, I’ll show you three (well, two and a half) ways to do this. I’m also going to take this opportunity to announce the start of a new ongoing series of blog posts and a new GitHub repository: “The Gaussian Processes for GeoPhysics and Geology Playground.,” or GP³ for short. These posts and examples (starting with my last post on full-matrix distances for GPs [8]) will be focused on novel ways to work with GPs, with a focus on the use of GPs in geophysics and geology. Let’s get started!

Approach 0: The Naive Approach

Let’s start off with seeing how bad the problem is when we use a stationary kernel. I’m going to create a toy dataset that has different variability over different regions of interest, and then fit a GP to it using a stationary kernel. To see the code for this, you can check out the hacking-non-stationary-GPs/0_naive_stationary_GP.ipynb file. This model, as well as the rest of the models I’ll show you in this post, are implemented using GPyTorch [3].

Once I’ve trained my GP model on this data, I’ll make predictions across the whole domain. In the spirit of “vibe-coding,” the evaluation process here will be “vibe-verification,” where I’ll just see if it looks “right.”

I’m plotting everything shown in the legend; in later plots I may hide certain elements.

As we can see, the GP fits to the long-wavelength features, but shoots straight through the short-wavelength features. This is because the stationary kernel is not able to model the non-stationarity of the data. A different failure mode can occur in some (rarer) cases, shown below.

In the repo this is dataset D. I’m only plotting the posterior samples, not the mean or the uncertainty.

Here, the model fits to the short-wavelength features, giving a very wiggly fit to smoother parts of the domain, e.g., between x=0 and x=2. This behavior can be understood by looking at the minimum loss for a range of fixed length-scales.

I fix the length-scale of the GP and train the other parameters (covariance and likelihood). The plotted data is the training loss, for a range of fixed length-scales, for dataset D.

As we can see, there is a bimodal distribution of minimum loss, with a minimum at a length-scale that is too large for the short-wavelength features, and a minimum at a length-scale that is too small for the long-wavelength features. There is no way for this model to fit both parts of the variability well.

Approach 1: Warped Inputs

One idea to hack non-stationarity into a GP is to “warp” the input space. This is the idea behind warped GPs, first introduced by Snelson et al. [9]. The idea is to transform the input space in a way that makes the problem stationary. It’s key that this warping function is learnable, whether it’s a neural network or a simple function like a polynomial. In my example in hacking-non-stationary-GPs/1_warped_input_GP.ipynb, I defined the warping function as a simple neural network with a single hidden layer, and then jointly optimized the warping function and the GP parameters at the same time.

I’m plotting the predictions in true-space in the top plot, and the predictions in warped-space in the bottom plot.

Here the warping function has “squished” the left section of the domain, and “stretched” the middle section. This has the effect of making the variability the same wavelength throughout the domain. Looking at the predictions of the trained model, there isn’t much high-frequency wiggling in the smooth parts of the domain: the model has fit both the long- and short-wavelength features well with its single-length-scale stationary kernel. The parts of the domain that have been warped can be visualized by plotting the gradient of the warping function.

Bottom plot shows the scaling factor, i.e., the gradient, of the warping function.

This learned warping function happens to be monotonically increasing, but in practice it may not be. It might be decreasing, or loop back on itself, depending on what fits the data best.

The warping function, luckily, is monotonically increasing.

If you wanted to do a proper Bayesian attempt at this, you can perform inference over the warping function as well, which has been done in the literature [10, 11].

Approach 2: Segmented GPs

Another simple, but effective, approach to non-stationarity is to chop up the domain into different segments, and fit a GP to each segment. I couldn’t find a concensus name for this method (as I’d bet this approach has been in use since the onset of GPs for data modeling), but I’ve seen it refered to as “segmented GPs,” “partitioned GPs,” or just “chunking.” I’ve implemented this approach in hacking-non-stationary-GPs/2_segmented_GP.ipynb. I use scikit-learn’s KMeans clustering algorithm [12] to find the segments, although any segmentation method could be used.

I’m plotting posterior samples and the data, which is color coded by the segment of the domain to which it belongs.

Here we can see that, although the model has fit each segment well, there are painful discontinuities at the segment boundaries. If your target application needs full-domain coherence, this is a no-go. On the plus side, as this method fits multiple GPs to smaller datasets, it is much more computationally efficient than other approaches. In fact, Tazi et al. [7] recommend this approach for large-scale datasets. GPs scale in compute and memory with O(N³) and O(N²) respectively, N being the number of training points. If you divide your domain into M segments, the complexities become O(N³/M³) and O(N²/M²). Additionally, training can be parallelized across the segments.

Approach 3: Basis Function GPs

Ok so this is just a generalization of the segmented GP approach. Instead of partitioning the domain into discrete segments, you can instead can use any basis functions. The idea is to define a set of orthogonal basis (preferably overlapping) functions that sum to 1 at every point in the domain. Then, you train a GP for each basis on the data points in values of the domain where the basis function has non-zero weight. At evaluation time, you can sum the predictions of the GPs weighted by the basis function values to get the final prediction. I couldn’t find an example of this approach in the literature, but I’m sure it’s been done before. In my example in hacking-non-stationary-GPs/3_basis_GP.ipynb, I used a set of triangular basis functions.

Bottom plot shows posterior samples and the data, and the top plot shows the basis functions in different colors.

This model fits the data without the discontinuities of the segmented GP, giving a much smoother but variable fit. However this advantage is also a disadvantage: sharp changes in variability are smoothed out. The posterior samples near x=8 show this effect, where the basis function around x=8 learns the short-scale variability, and imposes this on the smoother area at x>8. Ideally, you could be able to learn the most effective basis functions for the problem at hand, but I’m not sure what format that would take.

Other Approaches (That Didn’t Work or I Didn’t Try)

There were a few other approaches I researched and/or tried, so I’ll briefly mention them here:

– Spectral GPs, both with deltas [13] and mixtures [14]: Both approaches did not perform well for me; I think they’re better suited to (semi-)periodic data.
– Fully Bayesian GPs: I also tried using a fully Bayesian GP [15], with prior distributions over the length-scale parameter. Although this produced some posterior samples that fit short-scale variability, and some that fit long-scale variability, no single sample fit all of the variability well.
– Hierarchical GPs: I found a cool paper by Lee et al. [16] pretty late into this project, and I wish I had seen it earlier. This uses a hierarchical sets of inducing points, and trains a GP for each level of the hierarchy to give a multi-scale model. I haven’t tried this yet, but it’s on my todo list.

Conclusions

This was a fun dive into GPs for non-stationary data! This went more into the implementation and modeling choices than either the math or applications of GPs, but I hope you found it useful.

I’d like to explore how these approaches perform on higher-dimensional data in the future, as well as real-world geological datasets. I’ll be posting more about this in the future, so stay tuned!

Possible topics for future GP³ posts:
– Hierarchical GPs for multi-scale modeling
– GPs for gravity problems
– Anandaroop Ray’s approaches for GP EM inverse problems

References

[1] Blatter et al., Uncertainty quantification for regularized inversion of electromagnetic geophysical data—Part I: motivation and theory, Geophysical Journal International, Volume 231, Issue 2, November 2022, Pages 1057–1074, https://doi.org/10.1093/gji/ggac241
[2] Sambridge and Mosegaard, Monte Carlo methods in geophysical inverse problems, Reviews of Geophysics, Volume 40, Issue 3, September 2002, Pages 3-1, https://doi.org/10.1029/2000RG000089
[3] https://gpytorch.readthedocs.io/en/latest/
[4] https://www.gpflow.org/
[5] https://botorch.org/
[6] https://github.com/JuliaGaussianProcesses
[7] Tazi et al., Beyond Intuition, a Framework for Applying GPs to Real-World Data. arXiv preprint arXiv:2307.03093. 2023 Jul 6. https://arxiv.org/abs/2307.03093
[8] https://posgeo.wordpress.com/2024/12/05/full-matrix-distances-for-gaussian-process-kernels/
[9] Snelson et al., Warped gaussian processes. Advances in neural information processing systems. 2003;16. https://proceedings.neurips.cc/paper/2003/hash/6b5754d737784b51ec5075c0dc437bf0-Abstract.html
[10] Lázaro-Gredilla, Bayesian warped Gaussian processes. Advances in Neural Information Processing Systems. 2012;25. https://proceedings.neurips.cc/paper_files/paper/2012/hash/d840cc5d906c3e9c84374c8919d2074e-Abstract.html
[11] Snoek et al., Input warping for Bayesian optimization of non-stationary functions. In International conference on machine learning 2014 Jun 18 (pp. 1674-1682). PMLR. https://proceedings.mlr.press/v32/snoek14.html
[12] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[13] https://docs.gpytorch.ai/en/stable/examples/01_Exact_GPs/Spectral_Delta_GP_Regression.html
[14] https://docs.gpytorch.ai/en/stable/examples/01_Exact_GPs/Spectral_Mixture_GP_Regression.html
[15] https://docs.gpytorch.ai/en/stable/examples/01_Exact_GPs/GP_Regression_Fully_Bayesian.html
[16] Lee et al., Hierarchically-partitioned Gaussian process approximation. In Artificial Intelligence and Statistics 2017 Apr 10 (pp. 822-831). PMLR. https://proceedings.mlr.press/v54/lee17a.html

http://posgeo.wordpress.com/?p=2921

Extensions

“A Cliché Title” Is All You Need

posgeo Jan 23, 2025 Updated Jan 23, 2025

Recently in my work I’ve been reading papers in areas of computer science (mostly machine learning), and I’ve noticed how the titles are markedly different to those I’m familiar with in Earth sciences. Titles are attention-grabbing, feeling playful and almost whimsical. However, there seems to be a great deal of repetition, becoming cliché at times. […]

Show full content

Recently in my work I’ve been reading papers in areas of computer science (mostly machine learning), and I’ve noticed how the titles are markedly different to those I’m familiar with in Earth sciences. Titles are attention-grabbing, feeling playful and almost whimsical. However, there seems to be a great deal of repetition, becoming cliché at times. I previously wrote about a similar phenomenon, called snowclones, in a blog post back in 2023. So, armed with an arXiv dataset and a Google Colab notebook, I thought it would be fun to explore trends in paper titles!

What might it be that you only need? Well originally it was attention, but now it appears that you “only” need: segmentation, one epoch, logarithmic pruning, one wide feedforward, causality, pretrained encoders, downstream bias mitigation, textbooks, five parameters, zoom and shift, and pretty much anything else you can imagine! Overall, I found 364 papers that claim that their thing(s) is/are the only thing you need! One for every day of the year, with a day off for Christmas!

What is considered harmful then? Well famously it was the “Go To Statement“, but supposedly the warning has been extended to cross-band interference, parallel training, bicycle attacks(?!?), dot-product attention, replication studies, and 22 other things. And that’s just on arXiv, not counting other journals!

But on the brighter side, we have 74 things that claiming to be “unreasonably effective”, including: tensor products, nonstandard analysis, address clustering, Monte Carlo simulations, CNNs, and many more…

But how do these cookie-cutter phrases compare to traditional snowclones?

Interestingly, “to X or not to X” is a very popular term in arXiv papers. These days, Hamlet might be contemplating parallelizing, boosting, normalizing, aggregating, or 370 other possible actions. Even though this phase seems to have a long, popular history, it pales in comparison to the meteoric rise of “X is all you need”.

The good old “X is the new Y” appears again, along with the less pleasant “make X Y again”. The latter seems like an peculiar choice for a paper title, especially when you’re talking about things like “graph-based referring expression comprehensions”. See below for a directed graph showing the relations found in the data.

Relations “make X Y again” found in the titles of the arXiv dataset. Nodes show the occurrences of words/phrases that make up X and Y. The directed edges (arrows) show the relations pointing from X to Y.

Pithy comments aside, as someone coming into the field, I’m especially grateful that these resources are made freely available, and not locked behind journal paywalls. So, what does the future hold for cliché paper titles? I’d love to try and use the arXiv dataset to try and automatically discover emerging trends in paper titles. I tried to do this with n-grams in NLTK, but I couldn’t get a good predictive model working effectively. If anyone has any suggestions, please let me know!

References

This article used the “arXiv Dataset” by Cornell University under the CC0 licence, downloaded from Kaggle.

http://posgeo.wordpress.com/?p=2890

Extensions

Full matrix distances for Gaussian Process kernels

posgeo Dec 5, 2024 Updated Dec 5, 2024

This post is a summary of a series of experiments I recently conducted, see the following gist: link. In this post, I explore the flexibility of Gaussian process (GP) kernels in GPyTorch. Specifically, I’m interested in their ability to learn appropriate length-scales inherent in spatial data. I think Gaussian processes are great! They are a […]

Show full content

This post is a summary of a series of experiments I recently conducted, see the following gist: link.

In this post, I explore the flexibility of Gaussian process (GP) kernels in GPyTorch. Specifically, I’m interested in their ability to learn appropriate length-scales inherent in spatial data.

I think Gaussian processes are great! They are a mathematical / machine learning model that can fit to data, and critically they provide a (Gaussian) framework for uncertainty quantification. Once you fit/trained a GP to data, you can generate posterior samples—or evaluate the marginal variance—at test points, and use the resulting objects for reasoning, hypothesis testing, decision-making, or anything else!

Kernels and length-scales

But Gaussian processes are only as good as the kernel function that is used to define them. The kernel, or covariance function, defines how nearby points are correlated with each other. These kernels are often parameterized with a few scalars that govern the shape of scale of said correlation. For example, the radial basis function (RBF), a common choice for GP kernels, is defined as

$k(x, x') = \exp\left(-\frac{1}{2} (x - x')^\top \mathbf{\Lambda}^{-1} (x - x')\right),$

and is parameterized with a distance matrix $\Lambda$ . A keen eyed reader will recognize the expression in the exponent as the Mahalanobis distance, a generalization of the z-score, measuring of distance between a multivariate normal distribution and a point. The structure of the matrix $\Lambda$ determines how distance is measured along each dimension.

Lines of equal distance in 2D space for three main parametrization classes of $\Lambda$ . Left: Spherical distance; a positive scalar multiple times identity matrix $\mathbf{\Lambda} = \ell^2 \mathbf{I}$ . Center: Diagonal distance; a matrix with positive diagonal entries. Right: Full matrix distance; a symmetric positive definite matrix. Source: Wikipedia, CC BY-SA 3.0.

I hope you can see that the full matrix distance provides the most amount of freedom for how one could define distances for the kernel, and how the diagonal and spherical distances reduce the distance measurements to progressively smaller sub-spaces, becoming simpler but less flexible.

Why would you ever need a full distance matrix?

Thinking about these things mathematically is all well and good, but let’s see an example where the full distance matrix is required for a GP to adequately model data!

Consider the following two true functions we’d want our GPs to model. True function A is a function in 2D Cartesian space, showing sinusoidal waves oriented at angles to the coordinate axes. True function B is similar, but with the waves aligned parallel to the y-axis. Consider sampling data points from these true functions, at a dense grid of points in $[0,1]\times [0,1]$ , as well as a cluster of points around $(-0.5, -0.5)$ . The reasoning behind this data setup it to use the grid to learn the dominant length-scales, and use the cluster as a test point to evaluate predictions.

What would happen when we train GPs to fit this data? Well, one would hope that, for each data set, the GP would learn two length-scales: one short scale corresponding to the wavelength of the sinusoids, and another long scale (possibly unbounded) corresponding to the along-wave direction. Then, when sampling the GP at new test points, it should be able to extrapolate away from training data along the wave crests: it should learn the long length-scale! But does this actually happen?

Testing the length-scale learning hypothesis

I used the following setup in GPyTorch (see the gist for the full details: link). I used a 2D, exact GP model with a scaled RBF kernel, with a Gaussian likelihood, and set the number of learnable length-scales equal to the number of input dimensions (i.e., 2). I also set the likelihood noise to be very low, and unlearnable (as we have no noise in the data). During training, the marginal log-likelihood is evaluated and used to update the learnable hyper-parameters: the length-scales of the kernel. The predictions for each dataset are shown below.

Oh dear, what happened? Let’s look at the predictions for dataset B first. The function prediction appears to be good above and below the training data, for both the gridded block, as well as the cluster. It seems that the GP was able to learn a short length-scale for the sinusoid wavelength and a long length-scale for the along-crest direction, and extrapolates the function predictions above and below the training data with the knowledge of the long length-scale in that direction. (In fact, when plotting the length-scales during training in the gist https://gist.github.com/williamjsdavis/0996735b68f976f165ca9f9feb87fa96, the long length-scale appears to be increasing without bounds.) But what happens for dataset A? Well it seems that the GP can only learn the axis-aligned components of the length-scale (think about it only seeing the perpendicular lengths of a right-angled triangle). When extrapolating outside the training data regions, the GP only confidently predicts as far as the longest inferred length-scale. Not ideal!

How can we fix this?

There isn’t a direct way to specify that you want to learn the whole symmetric positive definite $\Lambda$ matrix in GPyTorch (or if there is a way, please let me know). Instead, I followed some issue threads (here and here) to find a solution. The solution is a little detailed (maybe I’ll make a separate blog post about it someday), but the general idea is to transform, or project, the data $x \to P x$ with a learnable projection matrix $P$ . Then, the kernel is computed on transformed inputs $P x$ . So if there is a certain projection that provides the best fit to the data, the GP will learn that projection!

And it seems to work! Now, for dataset A, the GP was able to learn a short length-scale for the sinusoid wavelength and a long length-scale for the along-crest direction, in addition to a rotation. Now it is able to extrapolate the function predictions along the crest of the the training data, with the knowledge that there should be a long length-scale in that direction!

Some final notes

Apparently the projection method investigated here can be viewed as a special case of something called Deep Kernel Learning (DKL). I hadn’t come across this before, but it seems like a combination of neural networks and GPs, and the idea is that the GP kernels operate on transformed input (and maybe the output?) spaces. Regardless, they’re something I’d like to look at in the future!

This investigation into Gaussian processes was inspired by work I’m doing regarding inverse problems and uncertainty quantification on infinite-dimensional vector spaces, and related functional analysis and inverse theory problems. So far, it seems that Gaussian processes seem to be one of the only serious ways of performing Bayesian inference in these settings. If anyone has any recommendations of other methods to look at, I’d be happy to hear about them!

http://posgeo.wordpress.com/?p=2841

Extensions

Advent of Code in Julia v0.1

posgeo Dec 1, 2024 Updated Dec 2, 2024

After taking an unintended break from blogging, I’m back! Quick update, I decided to leave my fellowship position at IGPP, and now I’m a senior computational scientist at Terra AI! I plan to write a longer blog post about this when I have the time, and talk about career choices, and the things I learned […]

Show full content

After taking an unintended break from blogging, I’m back! Quick update, I decided to leave my fellowship position at IGPP, and now I’m a senior computational scientist at Terra AI! I plan to write a longer blog post about this when I have the time, and talk about career choices, and the things I learned along the way. But in the meantime, I’m going to renew my blogging efforts by talking about the yearly Advent of Code challenge. Since I use Julia as my primary language in my day job at Terra AI, I thought it would be interesting to see how functional and usable the earliest versions of Julia were!

I’m running Julia version 0.1.2, which is the oldest release I can find MacOS dmg binaries. It’s unclear exactly when this release was made. The last commit on the v0.1.2 github branch was made on March 7th 2013, but there are no related artifacts on the Julia github release page (the earliest appears to be a release candidate for version 0.2). Anyway, the binaries I downloaded seem to be running fine on my M1 macbook!

Here is my shot at the first part of the problem for day 1. I’m focusing here on a simple, readable solution, rather than something that is computationally efficient or elegant.

raw_data_mat = readdlm("input-data.txt", Int64)
left_column = raw_data_mat[:, 1]
right_column = raw_data_mat[:, 2]

# Calculate sum of absolute differences
res = sum(abs(sort(left_column) - sort(right_column)))

println(res)

This is almost exactly the same as modern Julia! The only difference is that broadcasting over the vector inputs is implicit over the infix negation operator and abs; in Julia nowadays you’d have to specify .- and abs.(...), or use the broadcast function. This old syntax feels a lot like MATLAB-post-2016, where broadcasting rules are a lot more implied. I expect that I’ll be seeing more of this as I progress, since the early to mid 2010s where times where both MATLAB and Julia were changing up the ways they interpret matrix operations; see this good blog post on the changes in MATLAB by Mathworks’ Loren Shure and Steve Eddins: “MATLAB arithmetic expands in R2016b.”

Moving on to the second part of the question (skipping the data loading):

# Count entries in right list
num_freq_dict = {u_val => sum(right_column .== u_val) for u_val in unique(right_column)}

# Multiply the frequencies by entries in left column and sum
res = mapreduce(x->x*get(num_freq_dict,x,0),+,left_column)

println(res)

Here we can use some old dictionary comprehension syntax that has been since discontinued, with curly braces like in Python. It seems that this syntax was discontinued in version 0.4 back in October 2015 (see the release notes here), and in modern Julia you use Dict(...) instead. It’s interesting that Julia dropped this syntax so early on. It’s possible it was done to avoid misinterpretation with the {...} syntax for function types, or perhaps it was a concerted effort to not look like Python?

Anyway, I’m going to continue doing the Advent of Code puzzles over the following 25 days, probably irregularly, and making a few blog posts here and there when I find interesting things to talk about! It’ll be exciting to continue exploring how the syntax of Julia has changed over the years!

References

Cover art by me, own work (CC BY-SA 4.0)

http://posgeo.wordpress.com/?p=2868

Extensions

Visualizing Deep Flows in the Earth

posgeo Nov 20, 2023 Updated Nov 20, 2023

The Earth beneath out feet might seem stable and immovable, but over millions of years a different picture emerges. Over geologic time, continents break apart, collide, and shear against one another. We know that these forces were even strong enough to even pull apart Pangaea, forming Africa, South America, and other continents. But what is […]

Show full content

The Earth beneath out feet might seem stable and immovable, but over millions of years a different picture emerges. Over geologic time, continents break apart, collide, and shear against one another. We know that these forces were even strong enough to even pull apart Pangaea, forming Africa, South America, and other continents. But what is the origin of these incredible forces? The answer lies in the mantle.

A cartoon cross-section of the Earth (Source: William Davis)

The mantle is a layer the Earth, between the crust and the core. Over 2900 km thick, and composed of silicate rock, it makes up most of the Earth. To our perspectives, it behaves like solid rock: it is rigid, and responds elastically to earthquake waves. But if we were able to observe it over millions of years, we would see that is behaves more like a liquid, flowing and dissipating [1]. It convects like a pot of viscous syrup: material is heated from the core at the base, this material rises buoyantly, and is cooled by the crust at the top, upon which it sinks. Over millions of years, these convection forces are able to push, pull, and drag the continents of the crust.

As scientists, we want to investigate the physics of this process. However, we can’t simply drill to the mantle and observe it directly. (The deepest we’ve ever drilled is just over 12 km.) Instead we investigate mantle convection through computer simulation. This is done by solving partial differential equations for the transport and evolution of momentum, heat and composition. We discretize these equations, subject them to set boundary conditions, and solve them numerically on fine spatial grids over many millions of time-steps.

3D visualization of mantle convection (Source: Shijie Zhong, CU)

The output flows of these simulations model what we think mantle convection looks like in our planet. Can we visualize these flow fields? For printed publications, only “snapshots” of simulation properties can be shown. These might show 3D temperature iso-surfaces, or flow streamlines. Web-based videos and animations allow more satisfying visualizations, showing time-dependent flows.

However the geometry of the mantle means it is often difficult to visualize the volumetric data from the simulation. Many details are occluded from view by the core. One option is to transform the data to a different map projection. A novel example of this is the “3D cartographic” projection, where the latitude and longitude simulation coordinates are projected onto a 2D plane, and the simulation radius is projected as a height.

Conceptual 3D cartographic projection of mantle convection (Source: Coppin P.W. 2021)

My Project: Shell Texturing for Mantle Convection

I wanted to experiment with the 3D cartographic projection method, and see if it would work well with a volumetric rendering technique called “shell texturing.” I used previously computed mantle convection simulation data (source). For this simple experiment, I chose to visualize surfaces of equal temperature. At a single simulation time-step, I compute and visualize the contour of the 2D temperature field data for each vertical (radial) shell level. I change the shell color to make height more discernible. Finally I repeat this for each simulation time-step, and output the visualization.

This view is very satisfying, as you can see the temperature evolution on all sides of the domain at once. One can see the hot plumes rising from the core, and lateral movement of higher temperature anomalies. There are some small visual artefacts at the far edge of the visualization, were the gaps between the shells are more noticeable. This can be avoided by raising the view angle, which also highlights the “sheet-like” shape of the temperature anomalies.

Overall Perspective

Overall, I had a lot of fun visualizing this data using shell texturing! It’s definitely a technique I’ll keep in mind in the future when I’m visualizing volumetric scientific data (perhaps for my magnetohydrodynamic simulations).

One difficulty I encountered was with the speed of my visualization pipeline (code available here). Unfortunately, the simulation data files are very large (the field data for each time-step is ~3 GB), and rendering takes a very long time. An animation with 100 frames takes approximately 1 hour to render, which is not ideal. This is compounded by my choice of visualization library: Matplotlib. In the future, I’d like to experiment with implementing shell texturing in ParaView, which hopefully should be more performant.

Notes

[1] I have made some slight simplifications to the story here. The crust and mantle are actually distinguished by chemical composition. It’s actually the crust and the upper part of the mantle that behave rigidly: together they are called the “lithosphere.” The remaining lower part of the mantle that is mechanically weak is called the “asthenosphere.”

http://posgeo.wordpress.com/?p=2811

Extensions