Summary on Fourier Neural Operator for Parametric Partial Differential Equations by Ferdinand
Developed a method for learning the operator for parametric PDEs, giving the solution without having to specify the exact parameter values, boundary conditions or the discretization. This is achieved by exploiting the fact that a kernel integral operator can be represented as a convolution in the fourier space (via linear transformations), and the parameters are learnt there, meaning you can project up/down to any dimension of discretization, including much higher resolutions than used in training. One caveat is that the accuracy is still significantly lower than traditional discretized methods (which they chose not to compare FNO with), which may be excused considering the low training time, and an inference time ~3 orders of magnitude lower than the computation time for 2D navier-stokes PDEs.
Summary on Message Passing Neural PDE Solvers by Nils
The paper aims to learn the solution to various kinds of partial differential equations via an autoregressive model. They deviate from the more typical "LLM-style" autoregressive approach in two main ways, which they call the push forward and temporal bundling trick. The necessity of the push forward trick is best motivated as in the paper using the distribution shift problem. Each time the model makes a prediction for the next timestep(s), the prediction will have a slight error. Now during training, this is not a big problem as we have the correct label for the next tilmestep and can just use that for further predictions. However during test time (or when actually using the model) we don't have labels so we need to use the models previous predictions to make a new one, meaning our test time distribution looks very different as it accumulates errors in model predictions. The paper uses an adversarial stability loss, implemented via a shortened backward pass in the neural net to try and combat this problem (this is what the call the push forward trick). The temporal bundling trick is conceptually easier. It simply means to predict more than one time step into the future from the current state. This naturally also helps with distribution shift, as the model is simply called less. For the actual architecture they use a relatively simple GNN with message passing, this is motivated by the fact that both the finite difference method, finite volume method and WENO can be seen as a specific form of message passing. This architecture is then evaluated on three different types of equations: Burgers’ equation without diffusion, Burgers’ equation with variable diffusion and a mixed scenario. Here the models performance is compared to WENO and two different Fourier Neural Operators (FNO). It seems to outperform all in both speed and minimum error (except for the FNO on the Burgers' equation without diffusion).
Alternate title: Message Passing GNNs are all you need for solving PDEs
Summary on GraphCast: Learning skillful medium-range global weather forecasting by Nils
GraphCast rivals more traditional numerical medium range weather forecasting methods using a GNN approach. It uses the worlds weather data as 0.25 degree grid points across earth as its input. It then autoregressively learns to predict the weather in 6h hour time intervals via a message passing GNN architecture. The most unique part of this architecture is that the grid points are mapped to a so called multi-mesh via the encoder and mapped back from the multi-mesh for the actual predictions. The multi-mesh can be thought of as an isocahedron for which each side is continually split into more triangles (in this case repeated 6 times) resulting in a graph with over 40.000 nodes. However, this graph maintains all the edges from the intermediate splits, i.e. all the edges of the original isocahedron, the first split and so on until the sixth and final split. This allows for more complex and longer distance message passing. The model is trained on weather data up until 2017 and then evaluated on newer data, i.e. 2018 and onwards against ECMWF’s High RESolution forecast (HRES). GraphCast mostly outperforms HRES, with the exception of low pressure areas in the stratosphere. While not specifically trained for extreme weather events, the paper also shows that GraphCast can help predict things like tropical cyclones and extreme heat and cold decently well.
Summary on Improved protein structure prediction using potentials from deep learning by Ferdinand
A novel method to predict protein structures is presented, which, contrary to most previous approaches, never relies on templates of homologous proteins, making it more useful for predicting unknown proteins. The prediction of a specific protein structure is performed by gradient descent on pairwise distances of protein residues, whilst a CNN is trained to predict these distances in the form of a histogram of pairwise distances. The training data was based on the Protein Data Bank, but used HHblits to construct/extract information about the relation between different protein sequences and structural constraints, and trains on this contructed data. The CNN consists of 220 residual blocks that cycle through dilation rates for the convolutions to propagate information quickly, where each block contains batch norm, projection, convolution, and nonlinear layers. Using the obtained distances, a potential of a protein is constructed and fed into a differentiable geometry model that finds a geometry minimizing the potential using gradient descent. As the method requires many steps of pre-processing the training data and post-processing the model output, even performing gradient descent on a function of the model output to obtain the desired structure, this is very much a hybrid model. Whilst very promising it is also prone to weaknesses in the model assumptions made in the pre and post-processing steps.
Summary on Highly accurate protein structure prediction with AlphaFold by Sarthak
Let's start the journey. We assume modelling long-range dependencies over MSAs with vanilla transformers would suffice, since that would be the correct inductive bias for reasoning over a protein sequence, it doesn't, therefore they implement row and column attention (to capture the mutational information). Since that'd be computationally too expensive, they implement cropping during training, which we assume would defeat the idea of capturing long-range dependencies, it doesn't. Now we assume that doing this 48 times we could predict the protein by manipulating the 2D distogram, we can't, we have to do an end-to-end prediction, and transformers, it turns out, are not the right inductive bias for graphs, since they don't represent the edges. Directly going from 1D to 3D doesn't work.
Since they need to represent edges, they add a bias term to the row attention for the MSA, which encodes the pairwise relationships (edges), they search structure data for this. We may think doing so once suffices, it doesn't, it needs to be repeated 48 times, in both directions, so that better edge representations can also be learned in parallel. We may want to fold right after, but this is not enough, since these 2D Maps are mathematically inconsistent and don't follow the triangular inequality. We may think that we have to hardcode that, but it would hurt the end-to-end differentiability, so they do it *informationally* and get down the violations to an acceptable minimum.
Now we may think that picking the top row and the 2D pairwise would suffice for building the 3D representation, since we now already have the positions and orientations of the residues, not true, they do away with the peptide bonds and only get a bag of triangles, in order to capture long-range dependencies and de-emphasise local peptide bonds, which is analogous to what transformers do for words. We might think this would be enough, not quite, the feature map for triangles would have to be SE(3)-equivariant, which leads them to add the IPA to the original attention, making it geometrically aware. The (final) backbone is then gained only by the refining the orientations through eight blocks in the structure module, each with its own penalty and subsequent updates.
This suffices for the backbone, but what about the side chains?
They take the first row of the MSA to predict torsion angles with it, which combined with the backbone through local transformations gets them the side chains. We now have a final representation and a ground truth with a loss function. We think a simple loss function would of course suffice, not true, they include FAPE Loss (chirality aware loss), auxiliary loss, distribution loss, MSA loss and confidence loss. We assume doing so once would suffice, it doesn't, they recycle it three times.
Alternate title: Unfolding AlphaFold2
Summary on PINNS by Niklas
In one sentence the main application of a physically informed neural network (PINN) is to learn the solution of a PDE by creating custom loss and activation functions that resemble physical constraints of the solution. That way multiple loss terms such as starting or boundary conditions or even special properties like a divergence of zero are all calculated separately and added up into a final loss term. The most important loss term which always needs to be present is the approximation of the PDE solution by the neural network. In the paper two cases are discussed. The data driven solutions approach tries to predict the hidden state of the system given fixed parameters of the PDE while the data driven discovery tries to discover the underlying PDE and its parameters that best match certain data. For both approaches continuous and discrete cases are discussed as well. Generally they only tested their methods for simple MLPs but the ideas can be extended to more advanced architectures. With an easy creation of virtual points within the boundary conditions the methods generally do not need large amount of real data during training which helps especially when only sparse measurements are available. Another main advantage to previous work is that PINNs are not limited to linear PDE's or do not have to linearize them first, which often results in far better results when dealing with non-linear PDE's. Additionally no prior assumptions are necessary and the paper showed that PINNs are somewhat robust to noise. However, one thing that users need to be very careful about the weighting of the added loss functions, which can result in extensive testing or hyper parameter tuning. Sometimes the inference time can be quite slow, and it might be better to use a numerical solver. All in all the paper introduced a simple and very effective way to enforce physical properties into the approximations of a neural network which forces the network to not hallucinate as much and come up with realistic solutions. Thereby the authors made a significant contributions towards the field of deep learning for natural sciences.
Alternate title: How to make your neural network start caring about physics and stop hallucinating.
Summary on A Self-Attention Ansatz for Ab-initio Quantum Chemistry by Leon
The authors introduce the PsiFormer, which is designed to solve the many-electron Schrödinger equation using self-attention.
In quantum chemistry, the Schrödinger equation defines the physical behaviour, with the Hamiltonian describing the details of the system. Variational Mote Carlo (VMC) methods optimize a parametric wave function to find the ground state solution. This is done by the Slater determinant to satisfy physical constraints like the Kato cups conditions. To mitigate for infinite potential energies due to electron overlap the Jastrow factor can be introduced.
The main advancements over existing neural networks like FermiNet, Paulinet and SchNet lie in the use of self-attention with a sequence of multihead self-attention layers. The motivation behind it is that the electron-electron dependence in the Hamiltonian introduces a complex dependence in the wavefunction, which can be introduced by self-attention.
Compared to FermiNet, where only electron-nuclear features used to compute the self-attention, Psiformer follows these with a MLP, opposed to a concatenation with electron-electron features after a MLP. In the Psiformer, the electron-electron features are instead used to compute a Jastrow, which enforces the Kato cusp conditions.
The Psiformer was evaluated against FermiNet with SchNet convolutions for small and large molecules. It could outperform them for small molecules, despite having less parameters. The differences become more pronounced when looking at larger molecules, where Psiformer not only outperform FermiNet by a significant margin but perform also better then the best conventional methods (DMC energy).
One limitation of the paper is, that it does not show the inference times for the Psiformer. Particularly, with multihead self-attention layers that scale quadratically with the number of inputs, a comparison for the different molecule sizes would have been insightful. Additionally, an ablation against the scaling of conventional non-DDP methods and previous methods like FermiNet would have been interesting.
Summary on An-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions by Ferdinand
A novel approach for predicting molecular properties is presented, which allows for quantum chemistry predictions of molecules without separate training passes for each energy potential surface. This is achieved by leveraging FermiNet, a previous model used for single geometries to model wave functions, but supplying inputs from a GNN rather than hand crafting the input based on a specific molecular geometry. With a wave function model training on a wide range of possible input geometries, the GNN learns specific combinations of inputs that lead to waves similar to the ones belonging to the desired molecular geometry. To achieve this, all training is done on distances only, whilst directionality is preserved through transforming geometries and coordinates into a equivariant coordinate system based on PCA. The resulting model required shorter training times than previous NN based approaches such as PauliNet and FermiNet, even though it was less susceptible to produce false results (from e.g. wrongly assumed spherical symmetry) and had a higher accuracy. However, it still requires training for each different molecule, and e.g. cyclobutadiene required almost 20x the time of a simple hydrogen chain due to having more complex energy surfaces.