Alexia Jolicoeur-Martineau, Ph.D.

Any-Property-Conditional Molecule 🧪 Generation with Self-Criticism 👩‍🏫 using Spanning Trees (STGG+)

Alexia Jolicoeur-Martineau Jul 15, 2024

Update 2025-03-12: We have since improved STGG+ and added active learning (STGG+AL). It beats RL method at generating molecules with complex properties. The molecules we get are much nicer than the ones from the original paper. Molecule synthesizability can be improved simply by adding constraints such as max-ring-size ≤ 6 and removing too large molecules … Continue reading Any-Property-Conditional Molecule 🧪 Generation with Self-Criticism 👩‍🏫 using Spanning Trees (STGG+)

Show full content

Update 2025-03-12: We have since improved STGG+ and added active learning (STGG+AL). It beats RL method at generating molecules with complex properties. The molecules we get are much nicer than the ones from the original paper. Molecule synthesizability can be improved simply by adding constraints such as max-ring-size ≤ 6 and removing too large molecules (since STGG+ already takes care of ensuring proper valency rules). See below for an example of a molecule made by STGG+AL.

——————————————————————————————————–
Paper / Code

Twitter (sorry 𝕏 ) is obsessed with Large Language Models (LLMs) lately, so we hear very little about other cool applications of generative AI. Molecule generation is an exciting area of generative AI since it can serve to generate new drugs or materials (such as Organic-LED; the material used in the screen of your smartphone and even newer TVs ). In this work, we describe a powerful new method for property-conditional molecule generation with self-criticism using the lesser-known Spanning Tree-based Graph Generation (STGG) method. We derive a lot of exciting techniques, such as self-filtering through property-prediction and random classifier-free guidance.

The molecules we love

Molecules generated by our STGG+ model. Left: conditioned on logP=-13.6292; Right: conditioned on logP=28.6915. logP is related to water solubility, so the left one is very hard to dissolve in water, while the right one is easy to dissolve.

[O-LED molecules with different colours for your smartphone screen]

What we care about for real-world molecule generation

Generating valid instead of invalid molecules: Generative models trained on standard molecule representations (SMILES, 2D Graph) will lead to many invalid molecules, especially with large molecules. SELFIES is a well-known method to prevent invalid molecules through a specific grammar, but it often leads to worse performance. STGG is a lesser-known way of preventing invalid molecules through specific if/else rules during next-token sampling, which mask invalid tokens. We start from STGG as a base approach since it’s one of the best-performing molecule representations (along with GEEL) for unconditional generation.

Property-conditional generation: Most of the research on molecule generation focuses on unconditional molecule generation. However, in the real world, we care about generating molecules with some desired properties rather than unconditionally, which has little practical use.

Any-property-conditioning: We also want to consider any combination of desirable properties without retraining the model every time. We make this possible by masking random combinations of properties during training.

A powerful, fast, and modern architecture: We improve on the STGG base Transformer using all the tricks used in Modern LLMs: RMSNorm, projection weight initialization, no bias terms, Rotary embeddings, Flash-Attention-2, SwiGLU, and changes in hyperparameters.

Self-criticism 👩‍🏫 : Synthesizing a molecule can take days, weeks, or even months. Thus, we cannot expect overworked chemists to synthesize and measure the properties of all our generated molecules! We need a way to filter out the molecules we provide to chemists. We propose the following idea: have the generative model predict the properties of its own molecules! We give the model the ability to predict properties and thus self-criticize its own generated molecules, allowing it to automatically filter out molecules with incorrect properties.

Classifier-free guidance: Classifier-free guidance (CFG) is a technique for improving the performance of diffusion models. It has also been shown to be useful for language models. We also found it to improve molecule autoregressive generation.

Out-of-distribution properties: We may seek to generate novel molecules with out-of-distribution properties that have never been observed before in order to expand the range of our molecular knowledge. These properties generally involve an extreme range of values, sometimes leading to worse performance when using classifier-free guidance with large guidance (w>1). We propose to randomly sample a guidance w ∼ U(0.5, 2) for each sample, ensuring a mix of low (w<1) and high (w>1) guidance. Then, our method selects the best-out-of-k molecule from the molecules generated at different guidance levels, indirectly allowing the model to determine by itself which guidance is best for each sample.

The resulting method (STGG+)

Figure 1: Our STGG+ architecture. The molecule is tokenized and embedded. The number of started rings and embeddings of continuous and categorical properties are added, and the output is passed to the Transformer. The Transformer output is then split to produce 1) the predicted property and 2) the token predictions (masked to prevent invalid tokens).

Figure 2: Generation and self-prediction using STGG+. We autoregressively generate K molecules conditional on desired properties using classifier-free guidance. The unconditional model predicts the properties of the K molecules, and the molecule assumed to be closest to the desired properties is returned.

This image has an empty alt attribute; its file name is guidance.drawio-1.png

Some results

Final words

Check out the paper for more details! We hope this gets you more interested in molecule generation. This field has many exciting applications. This work paves the way toward real-world applications. As the next step, at Samsung, we will apply this method to search for novel materials. Stay tuned!

http://ajolicoeur.ca/?p=675

Extensions

Fashion repeats itself: Generating tabular data via Diffusion and XGBoost 🌲

Alexia Jolicoeur-Martineau Sep 19, 2023

Paper / Code Since AlexNet showed the world the power of deep learning, the field of AI has rapidly switched to almost exclusively focus on deep learning. Some of the main justifications are that 1) neural networks are Universal Function Approximation (UFA, not UFO 🛸), 2) deep learning generally works the best, and 3) it … Continue reading Fashion repeats itself: Generating tabular data via Diffusion and XGBoost 🌲

Show full content

Paper / Code

Since AlexNet showed the world the power of deep learning, the field of AI has rapidly switched to almost exclusively focus on deep learning. Some of the main justifications are that 1) neural networks are Universal Function Approximation (UFA, not UFO ), 2) deep learning generally works the best, and 3) it is highly scalable through SGD and GPUs. However, when you look a bit further down from the surface, you see that 1) simple methods such as Decision Trees are also UFAs, 2) fancy tree-based methods such as Gradient-Boosted Trees (GBTs) actually work better than deep learning on tabular data, and 3) tabular data tend to be small, but GBTs can optionally be trained with GPUs and iterated over small data chunks for scalability to large datasets. At least for the tabular data case, deep learning is not all you need.

In this joint collaboration with Kilian Fatras and Tal Kachman at the Samsung SAIT AI Lab, we show that you can combine the magic of diffusion (and their deterministic sibling conditional-flow-matching (CFM) methods) with XGBoost, a popular GBT method, to get state-of-the-art tabular data generation and diverse data imputations .

Figure: Comparing Forest-flow (our method) to real data and deep-learning diffusion methods on the Iris dataset

Score-based diffusion models are powerful techniques to generate data; they work by transforming real data into noise through a forward stochastic process and learning to reverse the process from noise to data. Conditional-flow-matching (CFM) methods work similarly but do so in a deterministic fashion (moving deterministically from both data to noise and noise to data).

Left: VP-diffusion, Right: Conditional flow matching
(from https://github.com/atong01/conditional-flow-matching)

For both flow and diffusion models, the objective function is a least-square loss function conditional on time (t=0 is real data, t=1 is pure noise), which is summed over each variable/feature (since its a vector field). We train XGBoost regression models as replacements from neural networks to minimize these losses. We train one model per variable/feature (p) and time t (t = 0, 1/n_t, 2/n_t, …, 1; for a total of n_t=50 different time values). For categorical data, we treat them as dummy variables and round them to the nearest class after generation.

Diffusion and flow-based models usually rely on mini-batch training with deep learning. This means that random Gaussian noise is sampled during training of the same size as the real data, and it is used to calculate the noisy data (moving from real data to noise) at time t. Since XGBoost needs the full data, we cannot rely on mini-batches. In order to associate multiple different noise samples per real data sample, we duplicate the rows of the real data K times (going from size [n,p] to size [nK,p]) and then generate noise data of the same shape. Then, we compute the forward diffusion/flow step for each time t. See the paper for more details on the algorithm.

A lot of the tabular data generation/imputation papers focus only on one or two machine learning metric(s). We take a broader approach by building a very thorough and difficult benchmark using 24 datasets and tackling four quadrants of metrics: closeness in distribution, diversity, prediction, and statistical inference.

The main results for generation are shown below (our methods are Forest-VP and Forest-Flow):

Figure: Tabular data generation with complete data (24 datasets, 3 experiments per dataset); mean (standard-error)

As can be seen, our method obtains incredible performance across all metrics. See the paper for more experiments and explanation of the different metrics.

Missing data

Amazingly, XGBoost is naturally able to handle missing values through careful splitting. Thus, our method can be used to generate new samples (with no missing values) while trained on data with missing values! We can also use our method for imputing missing values. See the paper for more details.

Choice of tree-based method

Our method can be used with any type of tree-based method. In practice, we found XGBoost and LightGBM to perform best, but XGBoost is much faster due to its efficient parallelization, so we use it exclusively.

Figure 2: Different choices of tree-based methods when training Forest-Flow on the Iris dataset

Paper with code

Check our paper for more details!

To make it accessible to everyone (not just AI researchers but also statisticians, econometricians, physicists, data scientists, etc.), we made the code available through a Python library (on PyPI) and an R package (on CRAN). See our Github for more information. [Note: The R code will be released soon]

http://ajolicoeur.wordpress.com/?p=629

Extensions

Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Alexia Jolicoeur-Martineau May 22, 2022

In this joint work with Vikram Voleti and Christopher Pal, we show that a single diffusion model can solve many video tasks: 1) interpolation, 2) forward/reverse prediction, and 3) unconditional generation through a well-designed masking scheme 🧙‍♂️. See our website, which contains many videos: https://mask-cond-video-diffusion.github.io. The paper can be found here. The code is available … Continue reading Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Show full content

In this joint work with Vikram Voleti and Christopher Pal, we show that a single diffusion model can solve many video tasks: 1) interpolation, 2) forward/reverse prediction, and 3) unconditional generation through a well-designed masking scheme 🧙‍♂️ .

See our website, which contains many videos: https://mask-cond-video-diffusion.github.io. The paper can be found here. The code is available here: https://github.com/voletiv/mcvd-pytorch.

A lot of the existing video models have poor quality (especially on long videos), require enormous amounts of GPUs/TPUs, and can only solve one specific task at a time (only prediction, only generation, or only interpolation). We aimed to improve on all these problems. We do so through a Masked Conditional Video Diffusion (MCVD) approach.

Using score-based diffusion, we get very high quality and diverse results that retain their quality better over time (as shown in Figure 1 below) due to the Gaussian noise injection of such models.

Figure 1: Comparing future prediction methods on Cityscapes: SVG-LP (Top Row), Hier-vRNNs (Second Row), Our Method (Third Row), Ground Truth (Bottom Row). Frame 2, a ground truth conditioning frame is shown in first column, followed by frames: 3, 5, 10 and 20 generated by each method vs the ground truth at the bottom.

To allow our models to solve more than a single task, we devise a masking approach. We condition on past and future frames to predict current frames. During training, we independently mask all the past frames or
all the future frames. The magic here is that during testing, we can mask past and/or future frames to do interpolation, prediction, or generation! See Figure 2 below for more details.

future/past prediction: when only future/past frames are masked
unconditional generation: when both past and future frames are masked
interpolation: when neither past nor future frames are masked.

Figure 2: One model can solve many tasks through masking

This means that a single general model can solve multiple tasks! To further be able to generate long sequences, we use autoregressive block-wise predictions to generate more frames than the number of current frames. See Figure 3 below; conditioning on 2 previous frames, we generate 5 current frames and use the last two generated ones to generate another 5 frames.

Figure 3: Autoregressive blockwise prediction

With MCVD, we obtain state-of-the-art performance on many datasets on tasks such as interpolation and prediction without training specialized models made for these specific tasks! We highlight some of our most exciting results below (for the actual videos, see the website):

Figure 4: Cityscapes prediction, 2 starting frames to predict 28 frames

Figure 5: Video interpolation. Given p past + f future frames interpolate k frames; smaller p+f and bigger k is harder.

In practice, given enough capacity, we observe that our more general models (trained with past and future frames with masking), actually perform better than our specialized models (trained only for one task)! This means that training Generalists is more beneficial than training Specialists! See Figure 6 below for some examples of this. Note that we are still running some general models as the paper contains fewer models with future-frames masking; we will keep everyone up-to-date as new results arrive.

Figure 6: SMMNIST and BAIR prediction; general (past/future masking) models do better

Now, this is all great and impressive, but are there limitations to this approach? The main limitation of our work is limited compute. We are from academia, and thus we were limited by Compute Canada infrastructure, which is 1 to 4 GPUs (with extremely long queue times for 4 GPU models and lots of shutdowns at the worst times possible ). Given our GPU constraint, our number of parameters was limited, and thus we could not reach the state-of-the-art on unconditional generation for very hard datasets such as UCF-101. See Figure 7 below.

Furthermore, to ensure comparisons to existing results on prediction baselines (on SMMNIST, BAIR, Cityscapes), we used the same amount of starting frames as conditioning frames (5 for SMMNIST, 1-2 for BAIR, 2 for Cityscapes) as other papers used. Although we obtain high quality on even long-term videos, limiting the number of previous frames in such a way limits our long-term consistency. For example, in SMMNIST, when two digits overlap during 5 frames, a model conditioning on 5 previous frames will have to guess what those numbers were before overlapping, so they may change randomly. See an example below.

Figure 8: Left is real video, Right is fake video, both starting from 5 frames. The 0 and 5 become a 0 and 6 after some time because the 0 and 5 overlap over a small time window; thus, a model conditioned on a small number of frames must guess what those numbers were before the overlap.

Nevertheless, despite our compute limitations, our models have excellent quality/diversity, and they improve a lot through scaling! Thus, we encourage users with access to massive resources to scale our method beyond our 4 GPUs limit. The masking approach is extremely powerful; it’s just a matter of significantly increasing the number of conditioning (past and future) frames and the number of channels/layers! STACK MORE LAYERS.

Contrary to some of our competitors , our code and checkpoints will be fully open-source (we are finalizing the code; we will release it within the end of the week)!

Website with lots of video samples

Code

Paper

http://ajolicoeur.wordpress.com/?p=466

Extensions

Alternative losses for Relativistic GANs

Alexia Jolicoeur-Martineau Oct 1, 2018

Further investigation needs to be done, but I suspect some variants of Relativistic average GANs (RaGANs) might be more sensible than the ones I proposed in my paper. If you are using Relativistic GANs, you might be interested in trying out also variant 3 which is the most promising. For simplicity, let’s assume we use the … Continue reading Alternative losses for Relativistic GANs

Show full content

Further investigation needs to be done, but I suspect some variants of Relativistic average GANs (RaGANs) might be more sensible than the ones I proposed in my paper. If you are using Relativistic GANs, you might be interested in trying out also variant 3 which is the most promising.

For simplicity, let’s assume we use the non-saturating loss and that we have symmetry, i.e., f1(-y)=f2(y). (This is true in HingeGAN, LSGAN with -1/1 labels, Standard GAN with sigmoid activation).

1) This is the RaGAN formula proposed in the paper.
loss_v1

2) This variant works as well as the original RaGAN. I know this because I used it by mistake before and it made no difference in the results. The generator loss doesn’t make much sense, but as discussed in GANs beyond divergence minimization, the generator can minimize pretty much anything related to the divergence estimated (the loss function of the discriminator) and it will likely still work. GANs don’t actually minimize the divergence.
loss_v2

3) This variant is the most promising, but I did not have the time to test it. It follows the same divergence as the one above since it uses the same loss function for the discriminator. The difference is that now the generator wants every fake sample to be a little better than the average of the real samples which is more sensible.
loss_v3

http://ajolicoeur.wordpress.com/?p=267

Extensions

https://ajolicoeur.wordpress.com/feed

Posts