Protein Design Games (Winter 2024)

Team #1
Daniella Pretorius, Felipe Engelberger, Ramith Hettiarachchi, Yehlin Cho, Yo Akiyama        
dp5117@ic.ac.uk, felipeengelberger@gmail.com, ramithhettiarachchi@fas.harvard.edu, yehlin@mit.edu, yo_aki@mit.edu

Design Strategy Overview

Our team aimed to enhance the cleavage activity of the Tobacco Etch Virus (TEV) protease by incorporating mutations that improve thermal stability and solubility. We employed a dual approach, utilizing both the ThermoMPNN and PROSS workflows to identify potential optimizing mutations (Dieckhaus et al., Goldenzweig et al.). A recent study leveraged ProteinMPNN (Sumida et al., J. Dauparas et al.) to increase TEV expression, stability, and function. We decided to use the designed sequence with the highest activity (kcat/Km), which they named HyperTEV60. All designs include the S219V mutation for reduced self-cleavage and an N-terminal 6xHis tag for purification purposes. Aside from optimizing HyperTEV60, we also include a de-novo TEV protease sequence and structure using RFdiffusion and ProteinMPNN.

Design Methods

ThermoMPNN optimization pipeline

To further optimize the HyperTEV60 sequence, we employed ThermoMPNN. ThermoMPNN extracts features from the pre-trained ProteinMPNN model and uses it to train a stability prediction module. The model is trained on a megascale stability dataset (Tsuboyama et al.) to predict changes in thermodynamic stability () of point mutations given the structure and sequence. We intuit that this additional fine-tuning will propose mutations in a semi-orthogonal manner to ProteinMPNN. We optimize the TEV sequence using an iterative workflow with ThermoMPNN (Fig. 1). The workflow takes a starting sequence and structure as input. At each iteration, ThermoMPNN returns predicted changes in thermodynamic stability for all possible point mutations (i.e., site-saturation mutagenesis) (Fig. 2). Our workflow selects the mutation with the single most negative  and updates the sequence (Fig. 3).

The design process starts with the HyperTEV60 sequence, and AlphaFold2 (Jumper et al.) predicted structure as input to ThermoMPNN. At each of these iterations, we analyze other metrics described in validation methods and determine which mutations to include. The mutated sequence is then used as the input to ThermoMPNN in the following iteration. Like Sumida et al., we preserve the amino acids of the top 50% of evolutionarily conserved residues and residues within 7Å of the active site.


Figure 1: ThermoMPNN-based design pipeline

 Figure 2: Predicted thermodynamic stability for the top mutation at every iteration

Figure 3. Structure representation of ThermoHyperTEV60. In blue, we highlight the active site residues fixed during the design pipeline; in sphere representation, we show the catalytic triad. In yellow are the 50% conserved residues in Sumida et al. 2023.   In green, the mutations to hyperTEV60 were introduced in Sumida et al. 2023, which we kept fixed. In red, we highlight the positions that were not fixed in hyperTEV60 but were not designed by ProteinMPNN. In purple, we highlight the eight mutations our pipeline #X suggested, giving rise to ThermoHyperTEV60. Finally, we superimpose the substrate analog from PDB 1LVM represented with a pink ‘ball and stick’ representation.

PROSS optimization pipeline

We used the PROSS (Protein Repair One Stop Shop) method (Goldenzweig et al., Weinstein et al.), which has previously been shown to be effective at generating variants with a large net stabilizing effect and increased solubility without negatively affecting function. By using the workflow defined below, we use evolutionary information and computational modeling to guide the design of potentially more stable and soluble HyperTEV60 variants (each number corresponds to the workflow segment in Fig. 4):

1. Use Natural Sequence Diversity

2. Apply User Determined Constraints

3. Rosetta Mutational Scanning

4. Rosetta Combinatorial Sequence Design

Figure 4: PROSS-based design pipeline

De-novo design using RFdiffusion and ProteinMPNN

For our final design, we used RFdiffusion to construct a de-novo scaffold for the TEV protease catalytic domain (Watson et al.). Here, we fixed the positions and amino acid identities of any residues within 7 Å of the substrate in a ligand-bound complex, and performed partial diffusion (25 noising steps) on all other residues (Figure 5). Partial rather than full diffusion allows us to use HyperTEV60 as a prior for the de-novo design.

We then use ProteinMPNN to design a sequence given the generated structure (Dauparas et al.). We design eight sequences using a temperature of 0.1 and excluding cystines. Importantly, the amino acids near the active site remain unchanged.

Using ESMFold, we measure self-consistency with the RFdiffusion-generated backbone for all 8 ProteinMPNN sequences and select the sequence that results in high pLDDT and low RMSD to the HyperTEV60 catalytic domain.

Figure 5. De-novo design superimposed against HyperTEV60. In grey, we highlight the active site residues fixed during the design pipeline. Pink represents the ESMFold predicted structure of HyperTEV60 from Sumida et al. 2023, while green represents the predicted structure of our de-novo design. For the de-novo design, ESMFold predicted pLDDT is 83.2, RMSD to HyperTEV60 is 5.3Å, and catalytic domain RMSD is 2.5Å.

Figure 6: De-novo design pipeline. We fixed the active site starting from HyperTEV60 both in structure (Partial Diffusion) and then we also fixed the active site in the sequence design step (ProteinMPNN).

Validation methods

ESM-2 & ESMfold

We used language model pseudo-perplexity scoring and designability tests to validate our designed sequences to measure their stability and fidelity to the original structure without significant structural changes.

ESM-2 is a protein language model that provides valuable insights into the effects of mutations on protein fold and function. ESM-2 pseudo-perplexity, the exponential of the native pseudo-log-likelihood of a sequence, estimates how well the model predicts the masked tokens of individual tokens in a sequence and enables the prediction of mutation effects. Here, we measured the pseudo-perplexity of designed sequences compared to the wild-type sequence, where lower pseudo-perplexity indicates higher favorability by the ESM-2 model.

In the designability test, we refolded the designed sequences back into their original structures using ESMFold, which is a fully end-to-end single-sequence structure predictor, by training a folding head for ESM-2. We measured pLDDT (predicted Local Distance Difference Test), total RMSD (Root Mean Square Deviation), and active site RMSD for each sequence. Our aim here is to achieve a high model pLDDT and low total RMSD and active site RMSD to maintain similarity to the wild-type structure (Fig. 7).

Figure 7: ESM-2 and ESMFold-based validation

Rosetta

We employed the Rosetta(Leman et al. 2020) Energy Breakdown (EB) analysis via the ENDURE pipeline(Engelberger et al. 2023) as an orthogonal validation method for our primary design strategies. This approach allowed us to dissect the energetic contributions of mutations in our designed protein variants. We started by predicting the structures with AF2 for our designed sequences. Then, we ran the PDB files of our designs and the reference TEV protease structure through the ENDURE pipeline so we could systematically prepare our proteins for analysis and execute EB calculations. This process involved structure cleaning, relaxation to a low-energy state using RosettaScripts, and mutation identification. Subsequently, the Rosetta EB analysis provided a detailed view of our designs' energetic landscape, breaking down each residue's pairwise contributions. This granularity enabled us to scrutinize sidechain and backbone interactions, identifying how specific mutations may influence the TEV protease variants' energetic stability.

We used the energy breakdown protocol’s ability to categorize changes in pairwise interactions. It allowed us to understand how mutations affected protein interactions, including indirect effects on non-mutated residues. This comprehensive analysis was pivotal in identifying mutations that either bolstered or detracted from our design goals. Additionally, the built-in residue depth analysis offered insights into the spatial distribution of energetic changes, aiding the differentiation between surface and buried mutations. This analysis was ideal for selecting mutations that improved stability without compromising the protein's stability (Table 1 & 2).

Table 1: ThermoMPNN pairwise mutant summary

Position

Sum Pairwise Rosetta Energy Breakdown

39 : P ➡️ N

-0.688

 49: R ➡️ W

-7.234

58: T ➡️ I

0.786

75: H ➡️ Y

-3.265

86: E ➡️ R

-1.812

114: R ➡️ V

-3.429

121 : P ➡️ W

1.306

131: T ➡️ V

-1.826

SUM

-16.162

Table 2: PROSS pairwise mutant summary

Position

Sum Pairwise Rosetta Energy Breakdown

15: S ➡️ A

-0.767

19: V ➡️ C

1.632

28: Y ➡️ H

-4.575

65: T ➡️ V

-0.973

67: S ➡️ K

0.058

71: T ➡️ Q

-0.328

78: E ➡️ P

-1.73

82: L ➡️ M

-2.671

83: V ➡️ I

-0.04

86: E ➡️ Q

-3.306

96: T ➡️ Q

-1.926

97: N ➡️ K

0.096

99: V ➡️ K

1.211

102: E ➡️ Q

0.036

104: V ➡️ K

-1.554

105: V ➡️ K

1.406

108: E ➡️ R

-2.27

110: V ➡️ C

2.984

128: V ➡️ T

0.79

159: T ➡️ H

1.388

162: S ➡️ C

-0.207

185 : D ➡️ N

-0.353

188: R ➡️ E

4.698

195: L ➡️ N

0.199

197: K ➡️ E

0.989

SUM

-5.213


                                                                     

Design Details

All of the following sequences have mutations suggested by the respective pipeline methods (black) and the previously characterized S219V mutation (purple) (Kapust et al.).

  1. Seq1: ThermoMPNN Refined Mutations
  1. Seq2: PROSS Enhanced Specificity Mutations
  1. Seq3: Combined ThermoMPNN and PROSS Mutations
  1. Seq4: top 4 ThermoMPNN + top 4 PROSS
  1. Seq5: Denovo design using RFdiffusion and ProteinMPNN

References

Sumida, Kiera H., et al. "Improving protein expression, stability, and function with ProteinMPNN." Journal of the American Chemical Society 146.3 (2024)

J. Dauparas et al., Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

H Dieckhaus et al. "Transfer learning to leverage larger datasets for improved prediction of protein stability changes." Proceedings of the National Academy of Sciences 121.6 (2024):

K. Tsuboyama et al., Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).

A. Goldenzweig et al. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression and Stability. Mol. Cell (2016)

Z. Lin et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction." BioRxiv 2022 (2022)

J.J. Weinstein et al. "PROSS 2: a new server for the design of stable and highly expressed protein variants." Bioinformatics 37.1 (2021)

J. Jumper et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021)

R.B. Kapust et al. The P1′ specificity of tobacco etch virus protease. Biochem Biophys Res Commun. (2002)

Watson, Joseph L., et al. “De novo design of protein structure and function with RFdiffusion.” Nature 620, 1089–1100 (2023).