Creating phylogenetic networks from clusters
More Info
expand_more
Abstract
In biology, phylogenetics is the study of the evolutionary history of and relations between e.g. species. Such data are often represented in trees. Remarkably, trees lack the representation of reticulation events, such as hybridization, while such events are believed to be important. One of the reasons why trees are widely used, is the enormous complexity of inferring a network directly from DNA data.
Therefore, indirect approaches are studied. One option is to construct a network from trees, which have been constructed from DNA data. Another option, which is studied in this work, is to create a network which represents all clusters from the trees. We are interested in finding a network having the lowest level possible, i.e. the lowest number of reticulations per biconnected component. The Cass algorithm is one of the best algorithms in this respect. However, it does not always give optimal results. In this thesis, an optimal algorithm is presented called STCass, of which many ideas are inspired from Cass. We lay
theoretical foundations on how to build an optimal network. Furthermore, (greedy) steps are proposed to solve the problem, which seem to work very well in practice. Based on several optimizations that are proposed, an improved algorithm called MSTCass is presented, which runs faster than STCass, in general. In order to speed up any Cass-based algorithm or derivatives of it, several lower bounds are presented on the reticulation number r(N) per biconnected component N of an optimal network. Besides, a new upper bound on r(N) is conjectured which can be evaluated in seconds, although e.g. an upper bound obtained by the HybridInterleave or PIRN algorithm is mostly lower and which can be evaluated often in twenty minutes for the tested instances.