Protein Design of TEV with (i)ncreased and (l)egendary properties - the TEVil wears ProDe

Submission from Team ‘TEVil wears ProDe’ to Protein Design Games (Winter 2024)

Matt Cummins matthew.cummins@chem.ethz.ch
Seulhoo Lee dgg0808@knu.ac.kr

2024-03-28

Introduction

Recently, researchers have developed powerful machine-learning methods that allow for the design of better enzymes. However, these methods still require testing many design sequences before identifying successful variants. Thus, we combined rational design methodology to identify the best sequences from machine learning outputs. Below is a summary of our methods.

TEVil 0 – hypertev60 with backmutations

Since we are limited to only 5 sequence submissions, we decided to use a safe method for designing our first sequence. Thus, our first sequence consists of only literature-validated mutations. First, we had to select a starting sequence as a template. We chose hypertev60 (Sumida et al. 2024) Sumida et al. Improving Protein Expression, Stability, and Function with ProteinMPNN JACS 2024 https://doi.org/10.1021/jacs.3c10941 because it outperformed TEV variants commonly used in labs. Despite the success of hypertev60, Sumida and colleagues showed that many sequences from ProteinMPNN were inactive. Thus, we hypothesized that even hypertev60 may have detrimental mutations that can reduce activity or stability.

Looking the crystal structure of TEV PDB:1LVM 1LVM https://doi.org/10.2210/pdb1LVM/pdb, we identified several mutations from hypertev60 that either eliminated interactions between residues, included hydrophilic residues not heavily interacting with many other residues nearby, or reduced hydrophobic clustering in the protein’s core regions. Based on this analysis, we reverted the residues of hypertev60 back to wild-type. Last, we deleted the first seven residues on the N-terminus because they have no density in the crystal structure. While these mutations are not verified in the literature, the missing density indicates that these residues are highly flexible and are likely unimportant.

Additionally, taking into account that the C-terminal region is near the active site and partially interacts with the relatively long substrate (the N-terminal end of protein ENLYFQS or G), we decided that attaching the His-tag to the N-terminal region, which is farther from the active site, would be preferable. Our final sequence is submitted as TEVil 0.

MHHHHHHGSPRDYNPISDTIVLLTNTSDGYSTSLYGIGFGPFIITNAHLFRRNNGTLTITSKHGTFTISNTTTLKLHLIEGRDLVLIEMPKDFPPFPTNLVFREPQVGERIVLVTRNFQTKSMSSEVSDTSTTYPSSDGIFWKHWIPTKDGQCGSPLVSTRDGSIVGIHSASNFTNTNNYFTAVPPNFMRLLTDPSAQKWVSGWSLNSDSVEWGGHKVFMDKP

TEVil 1 – literature mutations only

While using a literature validated mutant is safe, it limits our potential. Thus, we decided to take our TEVil 0 sequence, and rationally implement mutations from the literature. We introduced mutations from the literature that increased stability or activity. Additionally, it is structurally predicted that the literature mutants do not interfere with the interactions of critical residues in hypertev60, nor do they affect the composition of the secondary structure. Specifically, they do not clash with the mutation points of hypertev60, and when there is an overlap with mutation points, we have introduced literature mutants proven to be effective as single-point mutations

MHHHHHHGSPRDYNPISDSIVLLTNTSDGYSTSLYGIGFGPFIITNAHLFRRNNGTLVITSKHGTFTISDTTTLKLHLVEGRDLVLIEMPKDFPPFPTNLVFREPQVGDRIVLVTRNFQTKSMSSEVSATSTTYPSFDGTFWKHWIPTKDGQCGNPLVSTRDGSIVGIHSASNFTNTINYFAAVPPNFMRLLTDPSAQKWVSGWQLNSDSVEWGGHEVFMNKP

TEVil 2 – rational design mutations

Using literature mutations is a safe method for engineering a design that will likely work. However, that method limits our potential for a variant that could be more stable or active than current variants. Thus, we wanted to introduce new mutations using rational design. We adopted a strategic approach for introducing mutation points to enhance stability without compromising catalysis, focusing on areas unlikely to impact catalytic sites. We specifically introduced mutation points at positions that effectively stabilize the protein, such as connecting flexible loops to rigid secondary structures. Specifically, we: 1) Increased hydrophobic interactions within the protein core to boost cohesion, 2) Stabilized secondary structures by removing proline residues, and 3) Introduced salt bridges and hydrogen bonds to stabilize predicted flexible loops.

Our final sequence is submitted as sequence 2.

MHHHHHHGSPRDYNPISDVIVLLTNTSDGESTSLYGIGFGPFIIVNAHLFRRNNGTLVITIKHGTFTISDTTTLKLFLVEGRDLVLIEMPSDFPPFPTNLVFREPEVGDRIVLVTRNFQTKSDSSEVSADSTTYPSSDGTFWKHWIPTKDGQCGNVLVSTRDGSIVGIHSASNFTNTINYFADVPPNFMRLLTDGSELKWVSGWQLNSDSVEWGGHEVFMNKP

TEVil 3 – ProteinMPNN design on rational positions

Sumida and colleagues showed that mutating more than 30% of TEV resulted in completely inactive variants. Thus, we hypothesized that ProteinMPNN is powerful if used on the correct residue positions. However, an important step must be identifying the correct positions before using ProteinMPNN. We ran ProteinMPNN on the same residue positions we mutated in sequence 2.

Furthermore, it has been shown that identifying the residues that undergo the most significant flexibility changes before and after substrate binding are positions with high potential for directed evolution (Bhattacharya et al 2022) Bhattacharya et al. NMR-guided directed evolution Nature 2022 https://doi.org/10.1038/s41586-022-05278-9. Thus, we used the structures of TEV protease with substrate (PDB: 1LVB) 1LVB https://doi.org/10.2210/pdb1LVB/pdb or product (PDB: 1LVM) bound, to select positions with B-factor changes of 80% or more. Additionally, we used the AlphaFold predictions of the apo and substrate bound complexes, and selected residues with pLDDT changes of 10% or more. ProteinMPNN was allowed to mutate these positions on the wild-type structure 1LVM. We selected the sequence with greatest sequence similarity to wild-type.

MHHHHHHGSPQDYTPISENIVHLENESDGETTSLYGIGYGPYIITNKHLFRRNNGTLTVKSVHGVFKIKDITTLQQHLIDGRDMVIIRMPEWFPPFNQKLKFREPKREERVVLVTTNFQTPTPSSMVSGTSCTFPSGDGTFWKHWIQTKDGQCGAPLVSVEDGEVVGIHSASNFTNTNNYFTAIPKNFMELLTNQALQQWVSGWHLNSDSVTWGGHKVFMDKP

TEVil 4 – shorter loops and dimerization

Based on the crystal structure of TEV bound to substrate (PDB: 1LVM), TEV forms a dimer. However, this dimer is likely a crystal artifact since studies show TEV is a monomer (Phan et al. 2002)Phan et al. Structural Basis for the Substrate Specificity of Tobacco Etch Virus Protease JBC 2002 https://doi.org/10.1074/jbc.M207224200. Nevertheless, we hypothesized that dimerization would yield a more stable TEV enzyme. Using RFdiffusion, we selected positions that had sidechains within 6Å of the other protomer, to design in the crystal structure 1LVM. The following mutations resulted in confident dimerization of the wild-type sequence based on AlphaFold predictions. We implemented those mutations in TEVil 1, resulting in a confident AlphaFold dimer prediction.

Lastly, we hypothesized that shorter loops would increase the stability of the enzyme. We selected the loop that connects β7 and β8 because it is long and does not form any contact with the core of the enzyme. Using RFdiffusion, we shortened the loop by one residue. This mutation was also implemented in the TEVil 1 dimer, which yielded the final sequence for TEVil 4.

MHHHHHHGSERPYRLISASIVLLTNTSDGYSTSLYGIGFGPFIITNAHLFRRNNGTLVITSVLETFTISDTTTLKLHLVEGRDLVLIEMTXGWPPFPTNLVFREPQVGDRIVLVTRNFQPTSWHGSVSGTSTTYPSFDGTFWKHWIPTKDGQAGNPLVSTRDGSIVGIHSASNFTNTINYFAAVPPNFMRLLTDPSAQKWVSGWQLNSDSVEWGGHEVFMSKP

Methods

batch.sh for designing TEVil 3

#!/bin/bash
#SBATCH -p gpu.4h
#SBATCH -c 2
#SBATCH --output=example_1.out

folder_with_pdbs="./"

output_dir="../outputs"
if [ ! -d $output_dir ]
then
    mkdir -p $output_dir
fi

path_for_parsed_chains=$output_dir"/parsed_pdbs.jsonl"
path_for_assigned_chains=$output_dir"/assigned_pdbs.jsonl"
path_for_fixed_positions=$output_dir"/fixed_pdbs.jsonl"
chains_to_design="A"
fixed_positions="1 2 3 4 5 6 7 15 20 25 26 29 30 31 32 33 39 44 45 46 49 50 57 59 61 62 63 65 67 70 71 72 73 75 76 78 79 80 81 82 86 92 94 96 97 98 99 103 105 106 108 111 113 114 115 116 117 118 122 123 124 125 126 128 130 131 132 133 134 136 137 139 140 141 142 143 144 145 146 147 148 149 150 151 156 165 167 168 169 170 171 172 173 174 175 176 177 178 179 184 185 191 192 193 197 199 200 201 202 204 205 207 208 209 211 213 214 215 216 217 218 219 220 221"

python /cluster/home/mcummin/proteinmpnn-main/vanilla_proteinmpnn/helper_scripts/parse_multiple_chains.py --input_path=$folder_with_pdbs --output_path=$path_for_parsed_chains

python /cluster/home/mcummin/proteinmpnn-main/vanilla_proteinmpnn/helper_scripts/assign_fixed_chains.py --input_path=$path_for_parsed_chains --output_path=$path_for_assigned_chains --chain_list "$chains_to_design"

python /cluster/home/mcummin/proteinmpnn-main/vanilla_proteinmpnn/helper_scripts/make_fixed_positions_dict.py --input_path=$path_for_parsed_chains --output_path=$path_for_fixed_positions --chain_list "$chains_to_design" --position_list "$fixed_positions"

python /cluster/home/mcummin/proteinmpnn-main/vanilla_proteinmpnn/protein_mpnn_run.py \
        --out_folder $output_dir \
        --num_seq_per_target 200 \
        --sampling_temp "0.1 0.2 0.3" \
        --batch_size 1 \
    --omit_AAs 'C' \
    --fixed_positions_jsonl $path_for_fixed_positions \
    --jsonl_path $path_for_parsed_chains \
    --chain_id_jsonl $path_for_assigned_chains

RFdiffusion for designing TEVil 4 dimer

contigs='C/D/A9,A11,A14-15,A17-59,A64-91,A93-115,A117,A126,A128-221/B9,B11,B14-15,B17-59,B64-91,B93-115,B117,B126,B128-221'
pdb='1LVB'
symmetry='auto'

RFdiffusion for designing TEVil 4 shorter loop

contigs='A1-86/4/A92-221'
pdb='1LVM'