Jonathan Ash - ja961@scarletmail.rutgers.edu

Ali Mobedi - Mobedia@stanford.edu

Liberum Bio TEV Protease Design Competition

Team 5 Writeup

Introduction

The Khare Lab at Rutgers University has been working with the Tobacco Etch Virus protease (TEV) to determine its specificity profile using the Protein Graph Convolutional Network. In combination with other physics based and deep learning techniques, the predictions made by the model can be used to inform our design process, narrowing the search space to find TEV designs that greatly enhance native catalytic activity.

Thermostabilization

To begin, the native TEV protease structure was relaxed using constraints placed between the catalytic residues H46, D81, and C151, as well as S308, or P1 of the peptide. After relaxation, to ensure that all designs would express exceptionally well, the established thermostabilization protocol from the Baker lab was employed [1]. To accomplish this, HHBlits was used to conduct 4 iterative searches against UniRef30 at e-value cutoffs of e-4 , e-10 , e-30 , and e-50 [2]. The identified sequences were combined, aligned, and filtered to exclude any sequences with more than 90% identity redundancy. All sequences needed at least 50% query coverage and 30% identity to the native TEV protease to be included in the final alignment. Contrary to the Baker lab protocol, percent identities were calculated for each index of the native TEV sequence with respect to the other sequences in the alignment. The positions were then sorted from most to least conserved. The 30%, 50%, and 70% least conserved indices were designed using ProteinMPNN in separate runs, with the rest of the protein staying fixed [3]. The interface and all catalytic residues were also kept constant in these designs. Sequence generation was performed at temperatures of 0.1, 0.2, and 0.3 for all index sets. 20 sequences were created for each run across all 9 combinations, yielding 180 total candidates. These designs were grafted onto the native backbone using PyRosetta FastRelax with the aforementioned enzyme constraints and backbone coordinate constraints [4]. The sum of each design’s atom-pair constraints, angle constraints, dihedral constraints, and coordinate constraints were recorded. Interaction energies were calculated for each residue along the substrate. All values were then compared to the native, and filtered according to the differences. Passing designs could not have any constraint sum or interaction energy more than 1 REU above the corresponding native value, to ensure no contacts with the substrate were being compromised during thermostabilization. 14 designs passed these criteria.

S219 Mutations

There are multiple point mutations at S219 that have been reported to improve the TEV protease. S219V was discovered to inhibit autocatalysis, thereby improving solubility and catalytic activity. Additionally, S219N was verified to increase catalysis by nearly twofold over S219V [5]. As all passing designs were not altered at 219, each point mutation was installed, yielding 28 total starting structures for the next round.

Interface Design

ProteinMPNN was then used to sample sequences at the interface for the 28 thermostabilized proteases. To determine which residues should be kept constant, the Protein Graph Convolutional Network (PGCN) training data was examined. PGCN was trained and validated on a variety of protease structures, including the TEV protease. During the training process, several residues along the interface stood out as being particularly crucial in governing protease activity and specificity. These indices were N171, N177, N176, W211, T30, V209, S170, M218, G149, H214, K45, D148, and K215 [6]. Sampled interface sequences would either keep the 3 most important residues constant (N171, N177, and N176), keep all of the above residues constant, or keep none of them fixed, thereby designing the entire interface. All designs were not allowed to alter the catalytic residues or V/N219. Temperatures of 0.1, 0.2, and 0.3 were used in combination with the different fixed residue sets. 20 sequences were generated per run, yielding 4654 unique designs. All sequences were grafted onto their corresponding starting structure. As before, constraint and interaction energy differences to the native were computed for all designs. All interface designs had to have enzyme constraint differences under 1 REU, coordinate constraint differences under 2 REU, and interaction energy differences at p5 and p6 under 1 REU. All other interaction energy differences needed to be under 0 REU. Finally, to expand the set of feasible protease candidates, all sequences were allowed to fail at most one of these checks. 24 interface designs passed these criteria.

Comparison to Literature

In addition to their thermostabilization protocol, the Baker lab also provided 6 full TEV design sequences which demonstrated greatly enhanced expression. Furthermore, the Ting lab performed directed evolution on the TEV protease, discovering a number of different point mutations which improved catalytic activity. 3 variants were reported to be successful. Their mutations were S153N in uTEV1, S153N and T30A in uTEV2, and S153N, T30A, and I138T in uTEV3 [7]. Residues outside of the interface on the 24 old designs were mutated to match each of the 6 Baker designs and 3 Ting sequences. In this fashion, 192 unique new designs were generated from the 24 previous proteases and subsequently filtered by the same criteria applied to the initial interface designs, yielding 70 new passing TEV sequences. Combined with the original filtered designs, 94 total TEV proteases were created. All candidate designs were re-folded with AlphaFold2 for validation [8]. All sequences possessed average pLDDTs above 90%, and RMSDs between the Rosetta and AlphaFold models around 1Å. Given the excellent consistency of these metrics across the design set, neither the pLDDTs nor the RMSDs were used to filter further.

Sorting and Scoring

The PGCN TEV training data was again consulted to inform the sorting procedure. Along with the identified key interface residues, the network was also able to determine which substrate indices are most vital in governing specificity. The three most important indices were p6, p5, and p2. The interaction energy differences at these three locations were prioritized. Any design which did not possess more favorable interactions than the native for at least two of the three indices were thrown out. Additionally, all passing designs had to have interaction energy differences within 1 REU across all three indices, to avoid extremely unfavorable values at any key location. Finally, the average interaction energy difference was computed across p6, p5, and p2 for each candidate. Filtered designs were required to have an average interaction energy difference less than 0. In total, 6 designs passed the above filters. 3 of them were mutated to match the Baker lab sequences outside of the interface. 2 of them had the Ting point mutations installed, and just one of them belonged to the original interface design set. All sequences had the S219N point mutation. These designs were sorted according to average interaction energy differences, and the lowest 4 were taken as the final TEV protease design set.

Supplementary Materials

All code and materials used for this competition will be uploaded to the github repository here: https://github.com/JonathanEAsh/Team-5-TEV-Protease-Design

References

[1] Sumida, K. H., Núñez-Franco, R., Kalvet, I., Pellock, S. J., Wicky, B. I. M., Milles, L. F., Dauparas, J., Wang, J., Kipnis, Y., Jameson, N., Kang, A., De La Cruz, J., Sankaran, B., Bera, A. K., Jiménez-Osés, G., & Baker, D. (2024). Improving Protein Expression, Stability, and Function with ProteinMPNN. Journal of the American Chemical Society, 146(3), 2054–2061.

[2] Remmert, M., Biegert, A., Hauser, A., & Söding, J. (2011). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods, 9(2), 173–175.

[3] Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Courbet, A., de Haas, R. J., Bethel, N., Leung, P. J. Y., Huddy, T. F., Pellock, S., Tischer, D., Chan, F., Koepnick, B., Nguyen, H., Kang, A., Sankaran, B., Bera, A. K., … Baker, D. (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science (New York, N.Y.), 378(6615), 49–56.

[4] Chaudhury, S., Lyskov, S., & Gray, J. J. (2010). PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics (Oxford, England), 26(5), 689–691.

[5] Nam, H., Hwang, B. J., Choi, D. Y., Shin, S., & Choi, M. (2020). Tobacco etch virus (TEV) protease with multiple mutations to improve solubility and reduce self-cleavage exhibits enhanced enzymatic activity. FEBS open bio, 10(4), 619–626.

[6] Lu, C., Sarma, V., Stentz, S. Z., Wang, G., Wang, S., Khare, S. D. (2023). Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network. Proceedings of the National Academy of Sciences, 120, 39.

[7] Sanchez, M. I., & Ting, A. Y. (2020). Directed evolution improves the catalytic efficiency of TEV protease. Nature methods, 17(2), 167–174.

[8] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).