Table of Contents
Introduction to Deep learning for Protein Sequencing
In the past decades with the advancements in science and technology we have been able to explore the field of molecular biology with greater depth and understanding of what is happening within an organism and what are primary biological components that makes us who we are. Among them are proteins. They are a very essential component to any life form that we know of. All the important functionality is carried through proteins.
A protein is a primary molecule of a cell that carries out various functions some of which are: digestion, repairing damaged tissues, providing cellular structures, strengthening the immune system, catalyzing the metabolic system, DNA cloning to mention a few. They are the building block of the human body and in fact, the human body (or any other organic body for that matter) is mostly made up of proteins.
Proteins have different structures. The structure depends upon the arrangement of amino acids. These amino acids give the protein a 3D structure which is known as protein folding apart from that they also dictate the shape of a protein through the way they react with each other and the functionality of the protein depends upon the shape of the protein.
If we could understand the interactions of these amino acids then there is a possibility of understanding the working of every biological species both micro and macro. This could open new doors to enhance the life as we know of and also to develop drugs that could help us to create antibodies for Sars-Covid-2, HIV, Alzheimer’s disease et cetera.
Deciphering protein folding holds the answers to many complex problems that we are currently facing, especially the CoronaVirus pandemic and the problems that are still not revealed to the human eye.
So how many different protein foldings are there in the human body?
To answer that question we need to ask how many different amino acids are there in the human body. And as it turns out, the human body has 20 to 22 different amino acids. In order to find the different types of protein that a human body has we just need to know the different combinations in which the amino acids can be arranged to form a protein fold, which turns out to be approximately 1 billion.
“The number of different proteins comprising the human proteome is a core proteomics issue. Researchers propose numbers between 10,000 and several billion different protein species.”
— The Size of the Human Proteome: The Width and Depth
“Currently, there are around 200 million known proteins (in the world), with another 30 million found every year.”— Excerpt from Deepmind’s AlphaFold research.
So far a number of 20,000 protein folding is accepted globally or is known to be present in the human body. And figuring out the exact structure and sequence of a protein remains an expensive and tedious task even with the state of the art microscopic techniques.
A less expensive way to explore and learn about protein structures is using artificial intelligence techniques such as deep learning. The premise being that if a large amount of data is fed to a deep learning system then it will be able to find correlations between the different amino acids and learn the primary law that creates these interactions between them.
On 30 November 2020, Google’s Deepmind AlphaFold trained a deep learning model on protein data containing 100,000 known amino acid sequences and their relative protein structures, and it was able to learn the structure and could predict the same based on its sequence of amino acids. This program ignited a fire all around the globe and laid a firm foundation that deep learning can solve almost any complex problem.
What is Protein Sequencing
Proteins are made of amino acid sequences and the arrangement of these sequences varies which leads them to take a proper structure. The amino acid sequence is a linear chain of amino acids which are organic compounds of amino (-NH2) and carbonyl C. Mostly a protein has five stages of transformation:
- Primary
- Secondary
- Tertiary
- Quaternary
- Supramolecular
Source: Harvey Lodish, Arnold Berk, Paul Matsudaira, Chris A. Kaiser, Monty Krieger, Matthew P. Scott, Lawrence Zipursky, James Darnell – Molecular Cell Biology-W. H. Freeman (2008)
The primary structure of protein is the first stage which is the linear arrangement of amino acids or amino acid sequences. In the secondary stage and the stages to follow the protein starts to transform into 3D structure by interacting with the amino acids within the sequence itself and then the side starts to interact with other side chains to create more complex protein structures.
So in a nutshell protein sequencing is just a method of finding the arrangement of these amino acids from protein structures or protein folding which may help us to study the interactions between the amino acids to design new sequences which can eventually help us determine its functionality.
As a matter of fact, there are approximately 100,000 known proteins and the urgency to find the remaining protein structure is still an ongoing process.
Types of Protein Sequencing
There are two methods of protein sequencing:
- Mass spectrometry
- Edman degradation using a protein sequenator
Both of these methods are slow. In an article released in the Nature journal called “AI protein-folding algorithms solve structures faster than ever” states that the protein folding algorithms happen to solve the protein sequencing problem faster than any other approaches, and given enough data the error rates reduce significantly.
Challenges of Protein Sequencing Protein sequencing is a difficult process because of the complexity it brings when it starts to interact and take complex 3D structures. The goal here is to create a deep learning model that can model the 3d structure of the protein and decompose it into strings or sequences of amino acids.
The conventional methods are slow and expensive and may take forever so we turn to computational processes that can find sequences much faster through quick iterations over different combinations. But even though deep learning is fast, the challenge remains in proof checking the results, model inferring, benchmarking and the question of transferring of the gained knowledge to other areas in biotechnology.
When to use DL for Protein Sequencing
We have seen greater advancements in computer technology like the production of faster chips and an increase in memory, computational methods have been driving a lot of complex problems in molecular biology, and software and algorithms are emerging to solve these problems. Even though these software reduces the research cost when compared to mass spectrometry they are still not that accurate and writing these algorithms can be a very tedious task because they are rule-based.
Deep learning algorithms on the other hand are not rule-based systems, they instead learn rules that govern the distribution of the given data. And these algorithms are faster and can produce accurate results if enough data is given. Although deep learning systems need computing power, if a sufficient amount of data and computing power is given then they can predict good results in a matter of days or hours.
Deep learning falls into the computational methods of protein sequencing or predicting protein sequences and it is known as protein design. Protein design aims to predict protein sequences i.e. they can predict the amino acid sequence that can be folded for a particular protein function.
These deep learning algorithms and architectures can not only save time by predicting new sequences but also can learn to decompose protein structures in a way that can possibly help to design new and unseen sequences that can accelerate drug design and discovery.
DL Architectures
The deep learning algorithms have evolved at a staggering pace and especially in the last 5 years and the deep learning community has been able to produce architectures and algorithms that have revolutionised the way we look at things. Among all the architectures available on the internet five architectures have played a vital role so far in contributing to the research in protein designing and as a matter of fact in most of the DL domains are:
- Convolutional Neural Networks
- Variational Autoencoder
- Generative Adversarial Networks
- Recurrent Neural Network
- The Transformers
Convolutional neural networks are good for finding patterns and representations in a spatial format of data like images and videos, and in the field of computer vision it is widely used. Variational autoencoders and GANs that are used for generative images mostly use CNN to generate images of high quality and almost resembling the real ones.
Recurrent Neural Networks and transformers are used for sequential data like you find in texts, language processing, audio, or any time series data. Although it is not a strict rule that you have to use these architectures in their respective domains, in fact there are transformers architecture that can generate images and CNN architectures that can generate sequences.
When it comes to protein sequencing or designing you will have both the architectures simultaneously. It is all about getting the correct results.
How does Protein Sequencing with DL works
The process is as usual as it is with any other deep learning techniques. You feed the data into the deep learning algorithm and it yields output based upon how you design the architecture.
For instance, you can create a model that can:
- Learn the structure or sequence of the protein and predict the functionality.
- Learn the amino-acid sequence and predict the 3d structure of the protein.
- Learn the functionality or structure and predict the sequence.
The image above shows three Major Tasks in Protein Modeling
For the sake of this article we will do the latter, learning the protein structure and predicting the sequence.
Key concepts of Protein Sequencing with DL
Some concepts of protein sequencing or protein designing that you should be familiar with is:
- Fundamental of proteins
Understanding the fundamentals of molecular biology is important. You need to learn why proteins are important and how amino acids are formed. You need to understand the protein foldings and the energy functions as well. This helps you build an understanding of how proteins transform from the primary stage to a supramolecular structure.
- Monte Carlo method
Since protein length can vary it is important to understand how to evaluate the correct protein sequence or structure. For every output we obtain from our model we need to test it with a software called Rosetta Abinitio, which takes a sequence and tries to fold it into a 3d structure using the Monte Carlo method. So it is essential to know how the monte carlo algorithm works.
- Energy Function
Whenever we create a protein structure we generally assume that the protein transforms from a primary structure to a supramolecular structure. But during the transformation, it tries to achieve the lowest energy state. Being familiar with energy functions can help you better understand the transformation process and also help to build models that find a global minima.
- Root mean square deviation
The Root mean square deviation (RMSD) is used to measure the backbone of the protein structure. A backbone usually refers to the repeated amide N, carbon C, and carbonyl C atoms of each amino acid residue.
How to get started with Protein Sequencing with DL
In this section I will mention a few steps that can help you get started with protein sequencing with deep learning. First and foremost you need data.
Getting Data
Protein data bank offers you a huge amount of data. You can download it in a fasta format and start playing with it. The link to the website is this.
Another way to get access to the data is by reading research papers in this domain. It is also worth noting that when you google “protein sequencing with deep learning” will not get accurate results and most of the information will be directed towards “protein structure prediction with deep learning” which is okay if you want to predict protein structure. But if you are looking to work with deep learning models that can predict protein sequences then you google “protein design with deep learning”. You will find a lot of research papers related to this subject, most of the papers will have the link to the dataset and the Github repo, make sure to find it, it is mostly at the end of the paper or in the footer of the conclusion section.
For instance, check out the images below.
Protein Sequencing with DL Techniques
As discussed earlier protein sequencing is designing and predicting new protein sequences via computational methods that can fold and have functionality. Since we are dealing with sequences we will use techniques similar to natural language processing or NLP. The reason we will be doing that is because both NLP and protein designing work with sequential data.
For instance, in NLP we use words and sentences as sequences, similarly in protein modeling we also work with sequences which looks something like this:
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
This sequence belongs to a structure that looks like this:
In our case, the protein structure that you see in the image above will be our input x and the ground truth will be the sequence mentioned previously.
In such cases, algorithms like RNN are extremely useful because they model a sequence based on its position in the sequence. RNN has something called a hidden state where all the information of the sequence is stored with respect of their position at each iteration thus RNN can capture:
- Structure of the sequence
- Order of the sequence in terms of short term memory
But RNNs fail to capture the above two properties when the sequence is long. Hence they can be replaced with transformers. This architecture not only captures order and structure but can also capture long-term dependencies which RNNs and LSTM fail to do. Hence this article will focus on the transformer architecture.
Protein Sequencing frameworks & Tools
Here are some frameworks and tools that I would like to mention that you can use for protein sequencing, protein design, protein structure prediction. Starting with frameworks we have:
When it comes to the tools there are some amazing compiling and visualisation tools like:
Remember that Rosetta Abinitio is a structure prediction tool which is used to check whether the predicted sequence is valid or not.
Protein Sequencing with DL
So far we learned all the theories that we needed to start with protein sequencing, this section will focus on the practical way to use deep learning algorithms for protein sequencing. We will be using transformer architecture to predict amino acid sequence. The methods mentioned in the following section will include a lot of information so we break it into different subsections for easier interpretation.
How to build better models for Protein Sequencing
Protein data or any other data which represents a molecule or more interestly a graph or a network cannot be represented or processed in a conventional fashion i.e. representing them in a vector that will not work. An image for instance can be represented as a vector because all the pixels in an image are in order, there is no permutation to arrange them differently. Whereas a molecule can have different permutations depending upon the length of the molecule sequence.
The image above shows a caffeine molecule and its representation in the form of a vector. But the same molecule can be arranged in a different permutation. (See the image below.)
When we work with data that represents a graph or a network then how can we process and model them such that the data:
- Retaining its structure
- Does not lose any information.
The easier way to process such information is through representing the structure as a graph.
In protein sequencing or designing, we represent these molecules into graphs with node features describing each residue i.e. the amino acid and the edges capturing the relationships between them. Representing these structures in a graph yields computational efficiency, inductive bias and representation flexibility.
Preparing your data
Preparing data is an important step to any deep learning project. All the features of the data should be structured properly so that the model can extract useful information so that it can make predictions. In our case the model needs to decompose protein structures and we need to provide the model with coordinates of the amino acids, along with name of the structure, number of chains and the primary sequence which will be our target variable.
Below is the example of how we need to preprocess and structure the data.
'CATH': ['3.20.20', '2.40.37'],
'coords': {'C': array([[ nan, nan, nan],
[ nan, nan, nan],
[ nan, nan, nan],
...,
[ -2.712, -68.057, -9.848],
[ 0.188, -67.072, -8.56 ],
[ 2.881, -68.829, -7.052]]),
'CA': array([[ nan, nan, nan],
[ nan, nan, nan],
[ nan, nan, nan],
...,
[ -3.759, -67.915, -10.924],
[ -1.303, -66.893, -8.27 ],
[ 2.391, -67.537, -7.582]]),
'N': array([[ nan, nan, nan],
[ nan, nan, nan],
[ nan, nan, nan],
...,
[ -3.532, -68.985, -11.899],
[ -2.183, -66.935, -9.411],
[ 0.915, -67.495, -7.528]]),
'O': array([[ nan, nan, nan],
[ nan, nan, nan],
[ nan, nan, nan],
...,
[ -2.45 , -69.148, -9.351],
[ 0.619, -66.816, -9.675],
[ 2.141, -69.607, -6.384]])},
'name': '3wqc.A',
'num_chains': 16,
'seq': 'GHHHHHHAMSMQDTLLTLDTPAAVIDLDRMQRNIARMQQRMDAQGVRLRPHVKTSKSVPVAAAQRAAGASGITVSTLKEAEQFFAAGTTDILYAVSMAPHRLPQALQLRRRGCDLKLIVDSVAAAQAIAAFGREQGEAFEVWIEIDTDGHRSGVGADDTPLLLAIGRTLHDGGMRLGGVLTHAGSSYELDTPEALQALAERERAGCVQAAEALRAAGLPCPVVSVGSTPTALAASRLDGVTEVRAGVYVFFDLVMRNIGVCAAEDVALSVLATVIGHQADKGWAIVDAGWMAMSRDRGTARQKQDFGYGQVCDLQGRVMPGFVLTGANQEHGILARADGAAEADIATRFPLGTRLRILPNHACATGAQFPAYQALAADGSVQTWERLHGW'},
Once the data is arranged properly we can then convert the numpy arrays to pytorch tensors before feeding them into the network.
Building the model
Our transformer model will have two important sun-networks:
- Encoder
- Decoder
Both the encoder and decoder will contain a self-attention mechanism which is nothing but a scale dot-product operation and position-wise feedforward network. The self-attention mechanism which is the main component allows the transformer to store important pieces of information which can be used as context for prediction as well as it aggregate neighborhood information while the feedforward network processes the local information.
The parallelization trick of the transformer through its multihead attention allows the model’s encoder to decompose the 3D structure by developing a sequence-independent representation which the decoder part of the transformer uses to predict the amino acid sequence. The prediction of sequence depends upon the structure x as well the preceding amino acid.
The image above is a high level representation of how the model will look like. You will find that the node embeddings which represent the amino acid residue are fed only to the encoder while the edge embeddings which represent the relationship between those residues are fed to the encoder and as well as the decoder. This is graph representation.
Node embeddings are created by calculating k-nearest neighbors, followed by calculating the 3 dihedral angles of the protein backbone (P,Q,R) and embedding them with sine and cosine.
Edge embeddings are created by concatenating the structural encoding with positional encoding.
How to train model with Protein Sequencing
Once your data and model is ready you can start training the model. It is always advisable to train the model using the GPUs. Initially train a smaller model for 2 epochs so that the data can fully traverse in the model. This will ensure that the model is working properly then you can increase the complexity of the model.
When defining the training steps make sure that you print important details like the number of epochs completed and number of epochs still remaining. It is advisable to run to the training and validation step in the same loop and print their losses so that you will know if the model is underfitting or overfitting. You can also save the model’s output at a certain epoch so that you can interpret what the model is doing.
Lastly, make sure to save the model’s parameter. You can save all the parameters or you save the best parameters which make fewer errors in the validation.
How to interpret protein sequencing
The deep learning algorithms are hard to interpret. One should bear in mind that neural networks are a powerful regression model that can approximate a relationship between the input and the target variables, and the parameters that the model stores have a lot of information regarding the behaviour of the model.
Although it is still the area of active research, we can still find valuable insights by visualizing the latent variables and also exploring the embeddings since they have learnable parameters. Like mentioned before saving a few instances during the training can help understand what the model is doing. Along with making sure to plot graphs for training and validation errors and accuracy.
One should also train the model on various datasets of the same domain, this will give the model a different distribution and allow you to better understand the behaviour of the model.
How to implement Protein Sequencing
The basic steps to implement a deep learning model for protein sequencing are:
- Data collection
- To start with any DL project data collection is important and the source from which you collect data is also important. Unlike other data, protein data is not readily available which makes it difficult to obtain. But if you know the right way to place you can obtain legit and well curated data. Protein database is the best place to start. Also keep in mind that you separate the data into training, validation and testing sets before performing any sort of feature engineering, this will prevent data leakage.
- Feature engineering
- Feature engineering is a very important step and it can consume 60% to 80% of the time in the project. Feature engineering is all about structuring the data so that your DL model can process the data easily, which includes extracting patterns, mapping input to the output, and being able to visualize the data. If the data is not collected from a good source, feature engineering can be a tedious task. So make sure to collect legit data and perform feature engineering based on problem statements and what output you are looking for your model to yield.
- Building the model
- Building the model can be critical. Your experiment can go wrong if correct methods of measure are not used. It is always good to build a small model and then start adding complexity. In this case start with 1 multihead attention and then increase the complexity. A good model is any urgency.
- Training and validation
- Some of these models can be computationally expensive so try to train it in GPUs. So make sure to measure the loss and save the best model parameters so that if the model loss starts to increase during the training you will still have the best parameters which you can use for testing and inference.
- Testing or inference
- Testing is validating the performance of the model in unseen data. If the model is not performing well you can reiterate the whole process starting from data distribution till the end.
Best practices of Protein Sequencing
Some of best practices for protein sequencing are:
- Read a couple or at least 3 research papers. A best place to find paper and related codes is Paper With Code. You can find the latest papers and their codes. This step will help you isolate the problem that you are working on and not wander around the internet to get a startup code.
- Break the problem into different components. In the previous section I showed you five stages of implementation, use that. That will allow you to focus on one problem at a time and in that way you allocate time and resources for your project.
- Take time to understand the code. Most of the time we tend to just copy the code from the github repo and not understand it. Understanding the code can help you figure out how the process works and it will be easier for you to explain as well. Not to mention that you can simplify the codes by adding techniques and libraries to create a new approach that is not available on the internet.
- Try focusing on visualization. Most of the engineers don’t spend time creating visualisation functions. Visualization can help you to interpret both the data and model.
- Create different modules for different tasks. This will increase readability and mode of access. Also make sure to document everything nicely.
- Use a jupyter notebook for experimentation.
- Start small and then gradually increase.
- Play with hyperparameters. It is always good to play with different hyperparameters, especially the learning rate, and number of heads or depth of the model.
- Saving the model. Never forget to save your model. If you forget then you have to train the model again which can be annoying.
Real-life applications of Protein Sequencing
Protein sequencing is an urgency in the medical field. With each passing day new proteins are being discovered, some of which are viruses and bacterias. With deep learning such as this you could predict primary amino acid sequence that can be used to:
- Discover new drugs
- Understand diseases
- Cure diseases like HIV, Alzhiemer’s disease, Parkinson’s disease, type-2 diabetes, Sars-Covid-2, et cetera.
- Create responses for future pandemic
- Transfer the finding to the area of DNA and RNA studies
Final Thoughts
Protein sequencing is very fascinating and an interesting area to explore with deep learning. In the current era, education is available on the internet and you can learn and build anything. All you need is to understand the problem and come with a solution. This solution can be a small idea but it will intrigue you to research and broaden your perception and it will help you to reiterate your idea.
The deep learning and data bank community has opened source all the resources that we can need to build a powerful AI algorithm to eradicate the problems of the world. This article was intended to show you what protein sequencing is, how you use deep learning to predict amino acid sequences.
I am hopeful that you got a chance to learn something new. The codes are available on my Github repo, you can either download the whole repo or you can use google colab to execute the code. Try understanding the model and come up with your own.
Happy learning!!!
Resources
- Harvey Lodish, Arnold Berk, Paul Matsudaira, Chris A. Kaiser, Monty Krieger, Matthew P. Scott, Lawrence Zipursky, James Darnell – Molecular Cell Biology-W. H. Freeman (2008)
- Protein Data Bank
- CATH DB
- AlphaFold: a solution to a 50-year-old grand challenge in biology
- AlphaFold
- Deep Learning in Protein Structural Modeling and Design
- Generative models for graph-based protein design
- Biological Structure And Function Emerge From Scaling Unsupervised Learning To 250 Million Protein Sequences
- Attention Is All You Need
- Geometric foundations of Deep Learning
- Paperwithcode : Protein Design
- The Size of the Human Proteome: The Width and Depth
- UniProtKB/Trembl Protein Database Release 2021_02 Statistics