Prediction of the next mutation in Hemagglutinin protein of Influenza-A virus using the variation pattern analysis

Ravi Kant Pathak. Every year certain treatment strategies are developed to combat influenza that too only after the epidemic hits the population. The major setback in designing the treatment strategy is due to the variation in the surface antigenic determinants (Hemagglutinin and Neuraminidase) of the virus. In this work, the position specific contribution of an amino acid in the variation of the Hemagglutinin protein has been derived. Multiple sequence alignment of non-redundant sequences of Hemagglutinin from different strains has been used to derive a position specific weighted probability score matrix. The next-in-line variation in the subtype of the Hemagglutinin protein of the influenza A virus has been predicted using the calculated score matrix. Although the prediction has been accomplished with an average accuracy of 60%, the accuracy can still be improved. This strategy may be proven to be useful to design a drug before the outburst of the disease.


Ravi Kant Pathak.
Every year certain treatment strategies are developed to combat influenza that too only after the epidemic hits the population. The major setback in designing the treatment strategy is due to the variation in the surface antigenic determinants (Hemagglutinin and Neuraminidase) of the virus. In this work, the position specific contribution of an amino acid in the variation of the Hemagglutinin protein has been derived. Multiple sequence alignment of non-redundant sequences of Hemagglutinin from different strains has been used to derive a position specific weighted probability score matrix. The next-in-line variation in the subtype of the Hemagglutinin protein of the influenza A virus has been predicted using the calculated score matrix. Although the prediction has been accomplished with an average accuracy of 60%, the accuracy can still be improved. This strategy may be proven to be useful to design a drug before the outburst of the disease.

Introduction:-
Influenza A virus has been found to cause the most severe disease in human and have been reported to cause pandemics when crossed the species barrier (Klenk et. al., 2008). Since its first appearance in Spain during 1918-19, it has challenged human population in USA during 1957-58, Hong kong in 1968 (NIAID, NIH, 2011)and over 74 countries in 2009 (WHO, 2013). This virus is empowered with a unique structure and variation in the expression of surface proteins, which imparts a perfect self-defense mechanism. The viral envelop, mainly constituted of 2 types of glycoproteins, Hemagglutinin (HA) and Neuraminidase (NA), protect the core RNA genome which is segmented in nature (Bouvier et. al., 2008). There are 18 different types of HA reported till date (Tong et. al., 2013). HA has been reported to function in recognition of target host cells, and to facilitate the entry of the viral genome into the target cells (Whiteet al.,1997). This makes it the most important and primary target of neutralizing antibodies (Throsby et. al., 2008, Ekiert et. al., 2009, Sui et. al., 2009and Corti et. al., 2011. Drift from one strain to another probably depends on point mutation, which might change antigenic determinants, while the region which does not play significant role in triggering the immune response remains highly conserved between different subtypes. These single base substitutions give rise to the diversity in the pathogen (Willy et. al., 1980). Among the various strategies developed till date to solve this problem, tert-butylhydroquinone (TBHQ) (Russell et. al., 2008), neutralizing human antibodies (nABs) (Ekiert et. al., 2009), peptides (Xintian et. al., 2013) and vaccines (Chen et. al., 2011, Anne et. al., 2013 are some. However due to the high level of variability from one season to another season in the strain, these strategies have gained limited success (Ekiert et. al., 2009). The major challenge is to know which strain is going to be prevailing in the coming season, to be prepared with the defense strategy. Current work intends to address this problem and to device a method to predict the next variation in the protein.

Material and Methods:-
Data collection:-Available protein sequences of all the types of HA were taken from PDB (Berman et. al. 2000) with the keyword Hemagglutinin. Further refinement has been done with experimental method-X-RAY AND taxonomy-VIRUS AND release date between 01-01-2010 up to 31-07-2015. The results were downloaded in the form of fasta files.
Cluster Analysis and Redundancy Check:-Redundancy in the data obtained from PDB has been removed using CD-HIT (Li et. al., 2006). CD-HIT clustered the input protein sequences based on the identity of the characters. The value for identity percentage has been kept as 100 to cluster the duplicate entries. One representative sequence from each cluster has been derived for further analysis. This has been done to reduce the possibility of biasness of the analysis towards any specific type.

Multiple Sequence Alignment and Block Identification:-
Multiple Sequence Alignment (MSA) (Sievers et. al., 2011) was carried out on the representative sequences derived from CD-HIT. The results were visualized in Jalview ( Waterhouse et. al., 2009). The consensus sequence observed from MSA has been stored for further reference. An un-gapped block of all the protein sequences has been identified from the MSA.

Formulation of a Weighted Probability Score Matrix:-
A position specific 2-D weighted probability score matrix was formulated based on a score value which is obtained by the product of probability of occurrence and weight of each amino acid at that position in the un-gapped block derived from MSA. The method for calculation of this matrix is devised based on the concept of sequence logo (Crooks et. al., 2004). A stack is calculated for each of the positions in the alignment data of proteins. The frequency of occurrence of the amino acids in that position is taken into consideration. Procedure: 1. In every row I (or the y axis) the number of distinct amino acid, n j (distinct{aa}) and their frequencies, f(aa i ), was counted. 2. In every column J (or the x axis) there is the increasing order of positions of the block. 3. Probability is calculated as the ratio of number of occurrence of a particular amino acid over the total number of amino acids in jth column.
………………….…(1) P is the probability of occurrence of amino acid 4. Weight is calculated as inverse of the number of distinct amino acids in the jth column …………(2) W is the weight of each position. 5. In every cell (i,j) following formula is applied: Based on the above procedure, a 20 X (number of positions in the identified block) 2-D score matrix was calculated, it was then used a base in the prediction algorithm.

Identification of critical positions for prediction:-
Global pairwise alignment (Needleman and Wunsch, 1970) using EMBOSS NEEDLE (Rice et. al., 2000) was performed on the sequence for which next prediction is to be made and the consensus sequence that has been obtained during the MSA of all the non-redundant protein sequences of HA. The purpose of performing a pairwise alignment is to identify the significant positions having similarity in terms of function, structure or evolution.

Statistically Predicted Output:-
For each of the positions obtained after the pairwise alignment of the input sequence with the consensus sequence, the corresponding amino acid for that position is stored in the database. The values in the score matrix are updated for every position being predicted. Out of the updated values, the amino acid with maximum chance of occurrence is identified based on the calculated score. The amino acid showing a greater chance of occurrence is concluded to the amino acids that will occur next in the chronology. Weighted probability score matrix:-For every position of the un-gapped block a weighted probability score has been calculated with respect to every amino acid. The same procedure is followed for each of the 164 positions and the complete 20 X 164 score matrix is obtained. Example of the score matrix for 1 st position of the conserved block has been shown in the following table: . Since the H5 subtype has been expressed earlier than H7, sequence of H5 has been chosen to be the input to the methodology and the predicted output is expected to be of strain H7.
Predicted output:-Pairwise alignment of input sequence and consensus sequence was carried out using EMBOSS-NEEDLE with default parameters. The aligned positions were then processed through the prediction algorithm and the predicted sequence was obtained.

Accuracy:-
The method to check the accuracy of the prediction has been chosen such that a global pairwise alignment of the predicted sequence is performed with the actual sequence P' in the phylogeny of the representative sequences. P' represents that sequence in the phylogeny that stands next to the input sequence. The prediction algorithm works with the accuracy as shown in the table below: Using this method, upon calculation of an average of 10 completely random protein sequences a similarity of 60% and an identity of 53% percent was observed.

Conclusion:-
79 non redundant representative sequences have been used to perform MSA and based on the position specific weighted probability score, which represents the variation effect in the due course of evolution, a methodology has been designed to predict the next in line subtype. Although the accuracy of the method has been calculated as 60% (based on similarity), scope of improvement still lies open. The accuracy if could be increased further, it can be implemented to other viral diseases also, in which the viral pathogen adopts the same strategy of variation to bypass the immune system without getting identified. This includes HIV-AIDS, SIV and other diseases. Future endeavour would be to increase the accuracy and develop a prediction tool based on the developed methodology. The output of the prediction tool shall be helpful in designing drugs/vaccines which can be effective against any subtype.