Essential Protein Identification from Protein Networks Using Topological and Biological Properties

Essential proteins play a very important decisive role in the survival of a cellular organisms. Need for identifying essential proteins are increasing for its contribution to the field of drug analysis and synthetic biology is very huge. Centrality methods were the first used to identify essential proteins. Due to its high sensitivity towards network accuracy much more efficient methods which included the biological properties were developed. Cellular localization, biological process, gene expression and domains were some of the biological properties studied along with the network properties to predict essential proteins. This survey focus on studying various methods used to predict essential proteins and compare their performance

Human body can be considered as a building made up of bricks where the bricks could be cells, bones, nutrients oreven proteins. Proteins are large bio molecules made up of long chains of aminoacids. Proteins could be classified as essential and non -essential proteins. Here the study focus on identifying essential proteins. Essential proteins have to be included in our diet for the proper metabolic activities to happen inside our body. But as far as synthetic biology is concerned its discovery is also helpful in drug design(Clatworthy et al) and identifying anomalies causing various diseases (Furney et al,2006).
Various experimental approaches for identifying essential proteins that happened to be existing are considered as time consuming and expensive compared to the computational methods. Some of the experiments that biologist opted previously were Single gene knockouts (Giaever et al,2002), RNA interference (Cullen et al ,2005) and Conditional knockouts (Roemer et al,2003). Owing much to the high throughput technologies which generated huge amount of PPI data we are able to analyze essential proteins from its network level. (Jeong et al,2006) was the first to predict the essentiality of a protein with the help of lethality caused by the disruption of link between highly connected proteins. Most of the studies tried to rely on this fact to identify the essential proteins and paved the path for the concept of centralities. Inspired from this, other researchers tried to add biological information along with the topological information.
As there are a lot of works to identify essential proteins it is time to analyze all and tabulate them for future works. A comparison study is essential to predict the best among the methods. This paper is arranged as following sections: Section 1gives the introduction and motivation behind this survey. Section 2 discusses various methods used and a comparison study on them. Section 3 and 4 concludes the survey and discusses some future works possible.

Literature Survey: -
Development in high throughput techniques have generated large amount of PPI data. Due to the spurious and missing interaction PPI data is not highly reliable. To predict essentiality of proteins both network and biological properties can be used. In the early stages only the network properties were used and the problem was it is highly sensitive to network accuracy. In order to get accurate prediction researchers started to include biological properties. We can classify the prediction methods based on the topological and biological properties used.

Centrality Methods and Essential Proteins: -
A protein interaction network can be represented as graph G (V,E) where V represents the set of proteins and Erepresents the set of edges between pair of proteins (Wang et al,2014). An edge between two vertices u and v can be represented as e (u, v). Figure 1(B) displays an example of a yeast protein-protein interaction network (Gursoy et al,2008), and a small subgraph illustrating a hub protein (node A in Figure 1A). In graph theory, centrality means the most important vertex in the network. Borrowing the same when we knockout the central node from the network its effect will be lethal. This defines the concept of "Centrality-Lethality rule" (Xionglei et al,2006). These central nodes can act as the trigger for signaling pathways in many diseases (Abedi et al,2015). In (Jeong et al,2001) using the gene essentiality concept evolutionary rate of an organism was studied and they were able to find that central nodes have slower evolutionary rate. A growing body of research has focused on the prediction of gene essentiality using the network properties and biological features. Consequently, many computational methods have been developed. The simplest of all centrality method is Degree centrality. It gives the number of interacting proteins with protein , ( ). Degree centrality uses the basic concept of degree of a node.
To predict the essential proteins, the basic procedure used by all the centrality methods are the same. In the case of degree centrality, for all the proteins first calculate the number of interacting proteins with each protein v i . Then order the proteins based on the increasing order of degree of protein.
Using some sampling methods sample the dataset and predict the results. It is always assumed that whatever be the metric we are using to predict the (A) (B) essentiality top n percentage is assumed as essential and remaining once as the non-essential. Degree centrality is calculated as wherea u,v is 1 if there is a connecting edge between node u and node v, and 0 otherwise.
Another centrality method is Betweeness Centrality. It is the sum of all pairs shortest paths through which a vertex v pass through.
where σ (s, t) the number of shortest (s, t)-paths and let σ(s, t|v) be the number of shortest (s, t)-paths passing through some vertex v other than s, t.
Closeness Centrality is much more popular than the other two methods. Because it can predict more essential proteins than the other two. While the Betweeness Centrality tries to measure the influence of a protein has in communicating between protein pairs, Closeness Centrality gives the number of links in the shortest path between the protein pairs. It can be defined as Where N is the total number of proteins and d(i , j) is the distance between protein i and protein j. Sometime proteins having high betweeness but low connectivity form essential links in the network. HBLC (High Betweeness Low Connectivity) proteins were predicted and they tried to study the effect of betweeness on evolutionary rate. But they couldn't differentiate much between the effect of HBLC and non-HBLC on the evolutionary rate as its number is too small.
Eigenvector centrality is not restricted to any shortest path calculation. The network is represented as adjacencymatrix corresponding to the connected subgraphs and eigenvector values. This will help to portray the effect of each node on its neighbors. Since the matrix could give the effect of a protein on the entire proteins in the protein network it can be considered as extended centrality measure. if R is the adjacency matrix, is the eigen vector and β is the eigen value, then the Eigenvector Centrality can be defined as From the methods so far developed using the network properties, by evaluating the resultset of the works done we can generalize Edge Clustering Coefficient Centrality (NC) as the best one. First the edge weight is calculated as the product of parameters used for evaluating the relationship between two proteins. In NC the parameters used were GO functional similarity (GE) (FastSemSim), co-expression levels among genes (PCC) ( Where z u,v is the number of triangles which actually include the edges in the network and d u and d v gives the degree of node u and node v. NC (u) is defined as the sum of ECC of directly interacting nieghbours of node u. = ∑ ,

Drawback of centrality methods: -
All the centrality measures take the network property as the input. But the problem with any PPI network is that it is not complete and accurate. However, these data contain missing and spurious interactions (Mering et al,2002). Even the records say that for Y2H and TAP-MS the missing interaction ranges from 43 to 71 percent and 15 to 50 percent and spurious interaction is 64 and 77 percent respectively (Edwards et al,2002). From this it is quite clear that reliability of PPI network is not adequate. To overcome these problem researchers tried to include biological information along with the topological information and this led to the generation of the second category as mentioned in Table 1.

Essential Protein and Biological Properties: -
The drawbacks of topological properties led to the integration of biological properties into the prediction method.
When Hart and his fellows (Hart et al,2007) pointed out the special connection between the protein complexes and essential protein, (Ren et al,2011) used the concept to predict essential protein combined with network topology. It is said that essential proteins are more conserved than non-essential protein and using that concept (Peng et al,2012) developed an iteration method named ION considering the orthology with PPI network. Their prediction results showed high performance over the centrality methods. But they failed to provide any proof to show their performance level with other methods using biological properties.
All these methods when tried to consider only one property a machine learning based computational approach relying on network topological features, cellular localization and biological process information was developed (Gustafson et al,2006) for predicting essential genes. They used j48 algorithm to generate a decision tree to rove the importance of their parameters in predicting essential genes. More importantly they could use this decision tree to generate cellular rules governing essentiality.
Among the methods that uses the biological properties so far better results were obtained for two algorithms: PeC (Zhang et al,2012) and UDoNC (Peng et al,2015).
In Pec to predict essentiality of a protein gene expression profiles are used along with the edge clustering coefficient. So here the biological term is gene expression profiles and topological property is edge clustering coefficient. To measure the performance, they only considered the proteins from DIP database and showed better performance when compared with other centrality methods. To measure the gene expression profiles values they used the Poisson correlation coefficient.
The new centrality measure PeC(u) is defined as the sum of product if ECC and PCC. = ∑ , * , UDoNC is the most recent method developed to predict essential genes by combining topological properties and the protein domain.
Protein domain is the basic building block of protein structure. Domain confines to a particular function of a protein or it can contribute to its evolution. Sometimes similar domains tend to perform different function in different proteins. That means one protein domain type could be present in more than one protein. Based on this fact the algorithm UDoNC predicted the essential proteins from the PPI data.
An example of a protein that contains multiple SH3 domains is the cytoplasmic protein Nck. Nck belongs to the adaptor family of proteins and it is involved in transducing signals from growth factor receptor tyrosine kinases to downstreamsignal recipients. The domain composition of Nck is illustrated in Figure 2.1 below.

Figure 2.1: -Domain composition of Nck
According to UDoNC a protein is said to be essential if it consists of rarely occurring domains in other protein and as non-essential if it consists of frequently occurring domains. Essentiality of a protein was defined in term of number of protein domain and its frequency. Probability of protein u was defined as = * (9) Where NDT and SFD are number of domain types and sum of frequency of domains respectively. Finally,UDoNC was calculated as sum of product of ECC and weight of each edge. From the results of UDoNC they made it quite clear that their method is efficient than all other predicting methods. However, there is still room for improvement.

Conclusion: -
Through this survey we made a study on essential gene prediction methods. Works done show that centrality methods which used only the topological properties showed less performance when compared to the methods that included biological properties along with the topological properties. So many works on this field indicate its importance in the field of human disease analysis and drug design.

Sl. No Essential Protein Prediction Methods Topological Properties
Biological Properties 1 Degree Centrality Gene Expression(PeC) 2 Betweeness Centrality Protein Domain(UDoNC) 3 Closeness Centrality Integration of cellular localization and biological process information 4 Eigen Vector Centrality 5 Subgraph Centrality 6 Edge Clustering Coefficient Centrality Future Work: -