Stage II: Deep Generative Model

The molecules recommended by the classical recommendation methods in Stage I are limited to the molecules which appear in the MolRec data. Therefore, in Stage II, we use a Variational Autoencoder to generate novel molecules which do not appear in the MolRec data. To align the generated molecules with researcher interests, we input the molecules recommended in Stage I to the VAE.

The VAE architecture we used is JTVAE, a GNN VAE which takes a 2-stage approach to molecule generation. In the first stage, a coarse view of the molecule is generated by assembling a tree representation of the molecule, where each node in the tree is a chemical substructure, and an edge indicates that at least one of the atoms in each substructure share a bond. These structures are selected from a vocabulary of valid chemical structures, which ensures that the overall generated molecule is chemically valid. In the second stage, the finer details of the molecule are resolved by generating the bonds between atoms in neighboring substructures. The figure below taken from the JTVAE paper illustrates this framework.

Since there are 2 stages, there are two latent vectors per molecule, one for the tree structure and one for the bonds. Given these two vectors for each of the reference molecules, we get each of the latent vectors for the new molecule by averaging over the latent vectors for the reference molecules. We then generate the recommended molecule by passing the averaged latent vectors to the decoder of the VAE.

We trained JTVAE for 20 epochs on the ZINC250K dataset of 250k molecules. Below, various training metrics are visualized on the y-axis, with epoch on the x-axis. beta indicates the weight of the KL regularizer which drives latent vectors to be approximately normally distributed, KL is the Kullback-Leibler divergence, and assm/topo/word are reconstruction accuracies. The green curve shows a model trained with the KL regularization weight kept at 0, making it a regular autoencoder, while the black curve anneals the weight up to 0.01. As can be seen, increasing the strength of the regularizer decreases the divergence as the latents tend to Gaussianity, but it reduces reconstruction accuracy. We employ the model shown in black for recommendation.

We next show several recontructions from the test set. Further reconstructions are available on GitHub and in the reconstruction gallery below.

Add your own content here. Click to edit.

We finally show recommendations conditioned on 2 reference molecules. To obtain the recommendation, we sample latent vectors for each of the reference molecules from the encoder. We then set the latents for the recommended molecule as the average of these latent vectors. The recommended molecule is then generated by passing the averaged latents through the decoder. Further recommendations are available on GitHub and in the recommendation gallery below.