Gtr models of evolution

Obviously, no evolutionary model can fully capture the genuine complexity of the evolutionary process, such that even the most adequate one merely provides an approximation of reality 21. Several methods that estimate the Bayes factor or the marginal likelihood for model selection in phylogenetic analyses have been proposed, with variable tradeoff between computation times and accuracy 15, 16, 17, 18, 19, 20.

Since the marginal likelihood for phylogenetic interpretation consists of high dimensionality and the wide range of values cannot be enumerated, its computation is not always feasible. The magnitude of the Bayes factor (BF), namely, the ratio of the marginal likelihoods of two models, quantifies the strength of evidence that one model is more appropriate to describe the data than the other 14. In contrast, under the Bayesian approach, model selection can be performed using the marginal likelihood, which is the probability of the data given the model, while marginalizing the estimates (Table 1). For example, the penalty for a parameter that distinguishes between transition and transversion would be identical to the penalty imposed for a parameter that assesses the number of invariant sites. Notably, handling the uncertainty within model testing by the ML criteria depicted above is accomplished by accounting for the number of parameters assessed in the computation, but not for the type of processes they represent. The most commonly used criteria are the Akaike information criterion (AIC) 10, the corrected AIC (AICc) 11, the Bayesian information criterion (BIC) 12, and the decision-theory criterion (DT) 13 (summarized in Table 1). Other criteria compute the ML for all the candidate models, but assign different penalties according to the data size or the number of parameters included in the model. Thus, dLRT enables a different order of hypotheses testing for different datasets 9. While in hLRT the order in which parameters are added is defined a priori, in dLRT all models that differ in one parameter are compared in parallel and the hierarchy proceeds with the model that maximizes the log-likelihood difference. For example, the hierarchical and dynamic likelihood ratio tests (hLRT and dLRT, respectively) criteria perform a sequence of likelihood ratio tests between pairs of nested models, until a model that cannot be rejected is reached.

The estimated ML scores are then compared through one of several possible criteria. Under the frequentist approach, the fit of the data to each substitution model, together with the model parameters, tree topology, and branch lengths is assessed through iterative optimizations of the likelihood function.

Selecting the most suitable model for describing the evolutionary process has been addressed under both the frequentist and Bayesian approaches, by proposing statistical criteria to compare the fit of competing models. However, the expected error of each estimate increases with the increase in the number of parameters, which is problematic mainly when data are scarce. Altogether, these produce varied alternatives that account for different processes of evolution 1, 2, 3, 4, 5, 6, 7, 8.Īccounting for more parameters grants a model the flexibility to fit different datasets and capture their complexity. Such assumptions, quantified by several parameters, determine whether the substitution rates between all pairs of nucleotides are identical or independent, whether the stationary frequencies of the nucleotides within the analyzed data are equal or allowed to vary, whether a proportion of the sites are fully conserved, and whether heterogeneous rates of evolution are allowed across the alignment sites. Over the last 50 years, a plethora of evolutionary models has been developed, each relying on a different set of assumptions regarding the dynamics of nucleotide evolution. Parameter inference, whether performed within the maximum likelihood (ML) or Bayesian inference paradigms, relies on explicit definition of the substitution process, which may vary in spatial manner (across the alignment sites) and in temporal manner (branches of the phylogeny). Probabilistic evolutionary models form the basis of sequence data analyses.