Amino acid substitution model for rotavirus

Abstract iii

Acknowledgements iv

Declaration v

Table of Contents vii

Acronyms viii

List of Figures ix

List of Tables x

Introduction 1

1 Background 5

1.1 Sequence evolution . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Heredity materials . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Evolution and homologous sequences . . . . . . . . . . . 6

1.2 Modeling sequence evolution . . . . . . . . . . . . . . . . . . . 9

1.2.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . 9

1.2.2 General time-reversible substitution model . . . . . . . . 11

1.2.3 Model of rate heterogeneity . . . . . . . . . . . . . . . . 15

1.2.4 Available amino acid substitution models . . . . . . . . . 16

1.3 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 17

56 trang | Chia sẻ: honganh20 | Lượt xem: 607 | Lượt tải: 2

Bạn đang xem trước 20 trang tài liệu Amino acid substitution model for rotavirus, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Q = (qxy) [19]. For sequences with a character set A, Q is a |A| × |A| matrix in which the value qxy at row x column y represents the instan- taneous rate character x is changed into character y for x 6= y. The value qxx is calculated such that the sum of every row equals 0. For the nucleotide substitution model, the Q matrix is as follow: Q =  −∑X 6=A qAX qAC qAG qAT qCA − ∑ X 6=C qCX qCG qCT qGA qGC − ∑ X 6=G qGX qGT qTA qTC qTG − ∑ X 6=T qTX  (1.2) Substitution models are usually assumed to be stationary, which means the fre- quency Π = (pix) of every available state does not change with time. The fre- quency vector Π and the instantaneous substitution rate matrix Q are dependent: ΠQ = 0 (1.3) In practice, the Q matrix is typically normalized such that the expected number of substitutions per site is one − ∑ X∈A piXqXX = 1 (1.4) Substitution models often employ the time-reversibility assumption, that is the relative substitution rates that x changes to y and that y changes to x are the same: pixqxy = piyqyx (1.5) Under the time-reversibility assumption, the Q matrix can be decomposed into two components: a relative substitution rate matrix R = (rxy) and a frequency vector Π = (pix) 11 qxy = piyrxy if x 6= y−∑u6=xQux if x = y (1.6) The R matrix is also called exchangeability coefficient matrix where the rela- tive substitution rate between x and y, or the coefficient for the substitution be- tween x and y is rxy = qxy piy = qyx pix = ryx (1.7) A so-called general time-reversible (GTR) substitution model is the model that employs all these following assumptions: • Markovian assumption: The evolution of character x is independent of its previous states. • Time-continuity: The substitution between states can occur at any time during the evolutionary process. • Time-homogeneity: The substitution rates do not change with time. • Stationarity: The frequency of all states are stable. • Time-reversibility: If two states have the same frequency, the rate x changes to y and the rate y changes to x are the same. More specifically, a GTR subtitution model for nucleotides can be described by the instantaneous substitution rate matrix Q Q =  −∑X 6=A qAX apiC bpiG cpiT apiA − ∑ X 6= CqCX dpiG epiT bpiA dpiC − ∑ X 6= GqGX fpiT cpiA epiC fpiG − ∑ X 6= TqTX  (1.8) where the exchangeability coefficient matrix is R =  − a b c a − d e b d − f c e f −  (1.9) 12 The GTR substitution model for nucleotides has 8 free parameters (3 parame- ters for the frequency vector Π due to its elements summing up to 1 and 5 param- eters for the exchangeability coefficient matrix R due to equation 1.4). The GTR substitution model for amino acids can be described in similar manner, however, with much more free parameters (208 free parameters to be precise). Calculate transition probability From the instantaneous substitution rate matrix Q, one can derive a transition probability matrix P(t) = {Pxy(t)} where Pxy(t) is the probability that character x changes to character y during the evolutionary time t. In case Q is normalized as equation 1.4, t is the duration an expected number of t substitutions occurs per site. The probability matrix P(t) is useful to construct phylogenetic trees using maximum likelihood method or to generate sequences for evolution simulation. P(t) is calculated as follow: P(t) = eQt = ∞∑ n=0 Qntn n! (1.10) The Q matrix for GTR model is diagonalizable [20], hence, P(t) can be calcu- lated efficiently using the decomposition of Q. Specifically, Pij(t) = |A|∑ v=1 Uvi × eλvt × U−1jv (1.11) where • |A| is the length of the character set (the number of possible states). • Λ = diag{λ1, λ2, . . . , λ|A|} is the |A| × |A| diagonal matrix corresponding to the eigenvalues λ1, λ2, . . . , λ|A| of Q • U = {u1, u2, . . . , u|A|} is the matrix of corresponding eigenvectors of Q and U−1 is its inverse. Estimate genetic distances The subsitution model is also useful to estimate genetic distance between homol- ogous sequences. The genetic distance d(X,Y) between two sequences X = 13 {x1, x2, . . . , xm} and Y = {y1, y2, . . . , ym} is the number of substitutions which have occured between X and Y. The naive approach to estimate genetic distance is to observe the difference between two sequences d(X,Y) = ∑m i=1 δ(xi, yi) m where δ(xi, yi) = 0 if xi = yi1 otherwise (1.12) The observed distance only reasonably represents the genetic distance when there is a small number of substitutions. When the substitution rate is high and the evolution period is long, the genetic distance can not be estimated properly from the observed distance due to hidden substitutions: • Multiple subsitutions: Two or more substitutions have occured at one position but at most one substitution is observed. For example: A changes to C, then to T but only one substitution from A to T is found by observation. If T is then changed to A, no substitution is observed (back substitution). • Parallel subsitutions: The substitution occurs in both sequences and no sub- stitution is observed. For example, the ancestral nucleotide is A and A is changed to C in both descendent sequences. The observed distance between descendants is 0 while the actual distance is 2. Having been modeled by the substitution rate matrix Q, the genetic distance d of two sequences can be estimated using the maximum likelihood (ML) approach with the following likelihood function [19] L(d) = m∏ i=1 pixiPxiyi(d) (1.13) The genetic distance d is estimated by the value d∗ that maximizes the above likelihood function L(d) d∗ = argmaxd≥0{L(d)} (1.14) 14 1.2.3 Model of rate heterogeneity The GTR substitution model as described above represents the evolution of one site in the analysed sequences. It is unrealistic that every site evolves according to the same model. For example, the third position in a codon usually has much faster substitution rates than the other two positions [21]. A model that has varying substitution rates among sites is called model of rate heterogeneity. Using different substitution models for different sites may cause overfit of the model to the data. Thus, many models of heterogeneous substitution rates have been proposed [22] [23]. A common type of rate heterogeneity model [22] is a two-state model that categorizes sequence sites as either invariable or variable. For every site i, there is a substitution rate scaling factor ri = 0 if site i is invariable1 otherwise (1.15) That is, some sites in the sequences never change while the other sites evolves according to one substitution model. Differentiating whether sites are variable or invarible is impossible because it is possible that variable sites in the dataset are happen to be unvaried or some sites undergo back substitutions. One more parameter θ is introduced to represent the percentage that sites are invariable. Another widely-used model of rate heterogeneity is Gamma-distributed rate heterogeneity model [23]. In this type of model, a Γ-distribution with expectation 1.0 and variance 1/α, α > 0 is used from which substitution rate scaling factors across sites are drawn f(r) = ααrα−1 exp(αr)Γ(α) where, Γ(α) = ∫ ∞ 0 e−ttα−1dt (1.16) One can adjust the rate heterogeneity by adjusting the α parameter. α is called the shape parameter of the Γ-distribution. Generally speaking, the smaller the 15 shape parameter is, the stronger the rate heterogeneity among sites becomes. For example, setting α = 0.5 means that most sites hardly undergo substitutions but some other sites have high substitution rates. In contrast, if α = 5, sites have highly similar substitution rates (the scaling factor of most sites are approximately 1.0). A discrete Γ-function can be used instead of its continuous version. In dis- crete Γ-distributed rate heterogeneity model, a parameter c which is the number of substitution rate scaling factor categories r1, r2, . . . , rc is introduced [24]. A hybrid model which integrates the two-state model and the Γ-distribution model is also a common model used in practice. In this type of model, some sites are invariable according to the parameter θ while the other sites are variable with the substitution rate scaling factors drawn from the discrete Γ-function with the shape parameter α and the number of rate categories c. The parameter c is typically set in advance while the invariant percentage θ and the shape parameter α are estimated from data. The estimation of these extra parameters has been implemented in some phylogenetic packages such as PHYML [25] or IQ-TREE [26]. 1.2.4 Available amino acid substitution models Amino acid substitution models have much more free parameters in comparison with nucleotide counterparts. Parameters are usually estimated from data, there- fore, substitution models for amino acids are called empirical substitution models. Different methods are proposed to estimate amino acid substitution models. Dayhoff model [5] was the first model describing the process of amino acid substitutions. The model was estimated using a counting method with 71 sets of closely related proteins and an observation of 1572 substitutions. The counting method was pioneering at the early stage of substitution model research. JTT model was estimated using the same counting method [6]. JTT, however, used a larger protein datasets in the estimation process. BLOSUM62 model [27] also employed counting approach but only alignments that were similar at least 62% were used to count amino acid substitutions. The drawback of counting method is that it is only applicable for highly related 16 sequences. A later research [28] introduced the resolvent method and estimated a VT model from various types of protein sequences. The maximum likelihood method was used to estimate mtREV model [7] based on 20 complete vertebrate mtDNA-encoded protein sequences. WAG model [8] was estimated using an approximate maximum likelihood approach which was based on 3,905 globular protein sequences from 182 protein families. LG model [4] incorpated the rate heterogeneity across sites into the estimation process using protein sequences from the Pfam database. The LG model was shown to be better than the previous models in terms of contructing maximum likelihood phyloge- netic trees. Although many models have been estimated from large and diverse databases of protein sequences from a wide range of species (these models are often referred to as general models), recent studies have showed that they might be inappropriate to represent the evolutionary process of some specific species. A number of species- specific amino acid substitution models have been introduced such as the rtREV model [9] which are used for inference of retrovirus and HIV-specific models [10] (HIVb, HIVw) estimated from HIV data. Recently, FLU model [11] was estimated for Influenza virus which is highly different from widely used existing models. Experiments showed that species-specific models better characterizes the evolutionary process of the corresponding species sequences. 1.3 Phylogenetic trees 1.3.1 Overview Phylogenetic trees are trees that represent the evolutionary relationships among sequences. Phylogenetic analysis typically use binary trees, either rooted or un- rooted, to represent phylogeny. The sequences of interest are the leaf nodes while the internal nodes represent divergence events, which means an ancestral sequence is splitted apart resulting in 2 descendent sequences. Relationships between se- quences are described as branches of the tree. Two sequences are called more related if they have more recent common ancestor. Figure 1.1 is an example of a phylogenetic tree for 5 sequences A, B, C, D, E. In this example, A and B are 17 more related than A and C because the common ancestor of A and B is more re- cent than the common ancestor of A and C. Each branch of a phylogenetic tree can be assigned a value, called branch length, to represent the genetic distance between connected nodes. Figure 1.1: A sample phylogenetic tree of 5 sequences. There are many different binary trees to represent the evolution of the same set of sequences. Without regard for branch lengths, the numbers of possible rooted and unrooted binary trees for a set of n sequences increase exponentially with n. The number of binary unrooted trees B(n) for n sequences is B(n) = n∏ i=3 (2i− 5) (1.17) The number of binary rooted trees Br(n) for n sequences is Br(n) = n∏ i=3 (2i− 3) (1.18) For n = 3 the number of unrooted trees is 1. For n = 10 that number increases to 2, 027, 025. 1.3.2 Phylogenetic tree reconstruction Three optimization criteria are often used to reconstruct phylogenetic trees from homologous sequences: minimum evolution (ME), maximum parsimony (MP) and maximum likelihood (ML). 18 The minimum evolution (ME) method is used under the assumption that the tree with the smallest total branch length value is most likely to be the true tree. There has been a mathematical foundation supporting this assumption [29]. The ME method is a distance-based method that utilizes a pairwise distance matrix D = (duv) where duv represents the distance between two sequences u and v. Typically, genetic distances are used. Since finding the best tree according to the ME criterion is a NP-complete problem [2], various approximate algorithms have been proposed. Neighbor-Joining method [30] is one popular strategy to reconstruct ME phylogenetic tree. The maximum parsimony (MP) method objective is to find the tree that has the smallest number of substitutions. The MP method is a character-based method that involves two steps: (i) reconstructing the evolutionary events such that the total branch length of the tree is minimized and (ii) searching the tree space for the tree having the minimum score found in the first step. The minimum number of substitutions can be calculated efficiently by various methods [31] [32] [33]. Identifying the MP tree, however, is an NP-complete problem [34]. Heuristic tech- niques have been proposed to reduce computational burden such as hill-climbing searches based on local tree rearrangement operations [35]. The maximum likelihood (ML) method is a character-based method that finds the most likely tree that can explain the data. Let H = (T,M) be a hypothe- sis that explains the observed multiple sequence alignment of n sequences D = {D1, D2, . . . , Dn} where T is the phylogenetic tree and model M describes the sequence evolution. The function to maximize is the likelihood function of hy- pothesis H: L(H) = Pr(D|H). (1.19) Because of the assumption that sites evolves independently, equation 1.19 be- comes: L(H) = m∏ i=1 Pr(Ai|H) (1.20) where Ai is the column i in the multiple sequence alignment. The tree in the hypothesisH∗ that maximizes the likelihood function is the best estimate of true tree according to maximum likelihood criterion. 19 H∗ = argmaxHL(H) (1.21) Finding the maximum likelihood hypothesis H∗ is an NP-hard problem [36]. Many heuristic approaches have been proposed and are implemented in phyloge- netic packages such as PHYML [25] and IQ-TREE [26]. 1.3.3 Robinson-Foulds distance The Robinson-Foulds (RF) distance is a popular metric to compare topology of two trees [37]. The Robinson-Foulds distance between two trees is based on the concept of bipartitions. When an interior edge of a tree T with a leaf set of L is re- moved, two disjoint leaf subsets LA and LB are obtained and is called a bipartition of a tree T , denoted as LA|LB. If T is unrooted, there are n− 3 interior edges in total, hence, n−3 bipartitions. For example, by removing interior nodes in tree T1 in figure 1.2 we obtain bipartitions {1, 2}|{3, 4, 5} and {1, 2, 3}|{4, 5}. Similarly, tree T2 in Figure 1.2 has two bipartitions {1, 3}|{2, 4, 5} and {1, 2, 3}|{4, 5}. The Robinson-Foulds distance between two trees is defined as the number of bipartitions that implied in one of the two trees but not the other. For n sequences, RF distance is usually standardized by dividing by 2n−6 which is the total number of possible bipartitions. The standardized RF distance between two trees ranges from 0.0 to 1.0. The smaller value of standardized RF distance indicates the more similarity between the topologies of the two trees. More specifically, when two trees are identical, the standardized RF distance is 0 and when the two trees do not share any bipartitions, that distance is 1. 1.3.4 Phylogenetic hypothesis testing The frequentist hypothesis testing can be used to test if two phylogenetic trees are significantly different. The null hypothesis for this test is: H0: If the number of data is infinite, the two trees T1 and T2 would explain the data equally well. The test statistic is the difference between log-likelihood of two trees: δ(T1, T2|X) = 2[lnL(T1|X)− lnL(T2|X)] (1.22) 20 Figure 1.2: Two trees T1 and T2 describe the same set of 5 sequences {1, 2, 3, 4, 5} but have different topologies. Tree T1 has two bipartitions {1, 2}|{3, 4, 5} and {1, 2, 3}|{4, 5} while tree T2 has two bipartitions {1, 3}|{2, 4, 5} and {1, 2, 3}|{4, 5}. Bipartition {1, 2}|{3, 4, 5} is present only in T1 whereas bipar- tition {1, 3}|{2, 4, 5} is present only in T2. Bipartition {1, 2, 3}|{4, 5} occurs in both T1 and T2. The standardized Robinson and Foulds distance between T1 and T2 is 2/4 and the expectation under null hypothesis is: EH0[δ(T1, T2|X)] = 0 (1.23) If we computed δˆ as the empirical difference in log-likelihood between two trees and assume α as the significance level of the test, then the two trees are con- sidered significantly differennt when the probability to observe this result under the null hypothesis (often called p-value) is smaller than the significance level: Pr(|δ(T1, T2|X)| ≥ δˆ|H0 is true) < α (1.24) Kishino and Hasegawa [38] suggested an approximate test, known as the K- H test, for comparing two candidate phylogenetic trees. The difference in log- likelihood for site i is δ(T1, T2|Xi) and the test statistic is the total difference between the two trees across all sites: 21 δ(T1, T2|X) = n∑ i=1 δ(T1, T2|Xi) (1.25) The variance of δ(T1, T2|X) can be calculated by applying a normal approxi- mation to the estimated log-likelihoods or by using bootstrap method which is to generate a lot of replicates from the original data and look at their variance. The K-H test is only valid if the two compared models or trees are specified in advance. The more common approach is to estimate the ML tree from the data and then test every other tree against the ML tree. However K-H test has the tendency to cause false rejections of non-ML trees [39]. In this case, the test is said to suffer from a selection bias, since the ML tree to be tested is selected or identified from the data, which are then used to conduct the test. Shimodaira and Hasegawa [39] developed another test, known as the S-H test, that corrects for the selection bias. The test is constructed by considering the worst-case scenario, that is, under the null hypothesis all members of the candi- date set have the same expected score. Consider Tˆ is the ML tree in the set, for each other tree Ti, the difference in log-likelihoods of the ML tree and the tree i δ(Tˆ , T |X) is calculated. Bootstrap method is often used to generate replicates of the data. For each tree i, the mean log-likelihood over all bootstrap replicates lnL(Ti|X∗) is used to center the bootstrapped collection of log-likelihoods: cji = lnL(Ti|Xj)− lnL(Ti|X∗) (1.26) For each bootstrap replicate j, the highest value from the centered distributions is kept: mj = max i cji (1.27) One sample for each tree and replicate is generated: δji = m j − cji (1.28) The p-value for tree Ti is approximated by the proportions of bootstrap replicates for which: δji ≤ δ(Tˆ , Ti|X) (1.29) The evolution of sequences, its characteristics that enables it to be modeled as an amino acid substitution model and the methods for phylogenetic tree recon- struction and comparison that have been addressed in this chapter are the basis of 22 our experiments that we will provide an thorough investigation in the next chap- ters. 23 Chapter 2 Method This chapters gives a detailed explanation on the method we used to model the evolution of rotavirus as well as the process to estimate rotavirus amino acid sub- stitution from rotavirus protein data. 2.1 Modeling method We use a general time-reversible model to represent the evolution of rotavirus. Such a model is characterized by a 20× 20 instantaneous substitution rate matrix Q = (qxy) where qxy (x 6= y) is the substitution rate from amino acid x to amino acid y and qxx is calculated such that the sum of every row equals zero. The model assumes the following properties of the evolution process: • Sites evolve independently and their substitution rates are independent of their past. • The process is time continuous, time homogeneous, time reversible and sta- tionary. Therefore, the instantaneous substitution rate matrix Q can be inferred from two components: a frequency vector of amino acids Π = (pixy) where pix is the frequency of amino acid x and a symmetric exchangeability coefficient matrix R = (rxy) where rxy is the relative substitution rate between two amino acids x and y 24 qxy = piyrxy, x 6= y qxx = − ∑ y 6=x qxy (2.1) The matrix Q is usually normalized so that it represents the instantaneous sub- stitution rates among amino acids when the expected number of substitutions per site is 1. The normalized version Q˙ of Q is obtained by dividing the matrix by a normalization term µ: Q˙ = 1 µ Q = ( qxy µ ) µ = − ∑ x pixqxx (2.2) Because sites evolve independently, the likelihood of the dataD for a given tree T and model Q is L(T,Q;D) = ∏ i L(T,Q;Di) (2.3) We also incorporate the rate heterogeneity into the model. More specifically, we allow the hybrid form of a two-state model with the invariant site percentage parameter θ and a discrete gamma-distributed rate model with the shape parameter α. Each site belongs to a rate category c ∈ 1, 2, . . . , C with rate ρc. There is one special category with zero rate for invariable sites. The likelihood of the data D for a given tree T and rate heterogeneity model is L(T,Q, α, θ;D) = ∏ i [θL(invariant;Di) + (1− θ ∑ 1≤c≤C 1 C L(T, ρcQ;Di)] (2.4) where L(Invariant;Di) is the likelihood of site i assuming the invariant model for Di. The goal is to maximize the above likelihood. The parameter θ and α are estimated from the data. 25 2.2 Model estimation process This section describes the process to estimate the substitution model from a set of mulitple alignments. With a set of protein multiple alignments, denoted A = {Da} where Da is a multiple alignment, the Q matrix can be estimated using a ML approach. Different alignments are considered independent of each other, hence, the likelihood of A is L(A) = ∏ a L(T a,Q;Da) (2.5) where T a is a phylogenetic tree relating the sequences in Da. Maximizing the above likelihood involves optimizing the matrix Q coefficients, branch lengths and the tree topologies, which results in a very large number of parameters. There- fore, finding both the best matrix Q and T a tree that maximize the above likeli- hood at the same time is a hard task. There have been observations that estimates of evolution model parameter do not change significantly for near-optimal tree topologies [8]. A simple approach can be used to relax the computation burden: an approximate tree T a is estimated first for each alignment Da and then these trees are used in equation 2.5. The likelihood of A then becomes L(A) = ∏ a L(Q;Da, T a) (2.6) that is, the likelihood of the data and of the T a trees, given replacement matrix Q. This greatly simplifies the optimization process since only Q coefficients have to be optimized. As long as the T a trees in equation 2.6 are similar enough to the optimal trees, the estimations of Q from equation 2.5 and from equation 2.6 should have subtle difference. The following estimation procedure is used to obtain rotavirus-specific substi- tution model: (1) First, the T a trees are inferred by ML with possible rates across sites. For each alignment Da, model selection algorithm (ModelFinder) in IQ-TREE pro- gram [40] is run with initial model sets of WAG, LG, JTT (all are general mod- els). ModelFinder method finds the best-fitting model for sequence evolution that explains the input alignments among all possible models from the model set. This 26 means comparing four types of models which a

Các file đính kèm theo tài liệu này:

amino_acid_substitution_model_for_rotavirus.pdf