Counting Empirical Cluster Sizes Of Identical COVID-19 Genetic Sequences

More Info
expand_more

Abstract

This thesis aims to enhance existing models that infer parameters describing the spread of a virus by analyzing the distribution of empirical cluster sizes of identical genetic sequences. An approach that has gained recent popularity assumes that each individual cluster can be modeled as a Bienaymé-Galton-Watson process, with the distribution of empirical cluster sizes being equal to the law of the final size $\widetilde{Y}_\infty$ of the branching process. By employing the theory of general branching processes counted by characteristics, we demonstrate that the empirical cluster size distribution $C^\alpha$ stochastically dominates $\widetilde{Y}_\infty$ due to the exponential growth of the branching process. Under the assumption that the underlying branching tree follows either a Bienaymé-Galton-Watson process or an age-dependent process, we show that the mean of the empirical cluster size distribution can be used for a (strongly) consistent estimator for the probability of mutation $\nu$. For both branching models, we compute $P(C^\alpha=n)$ for $n=1,2$. We conjecture that $P(C^\alpha=n)$ is independent of the underlying model and that it can be expressed as a function of the mean of the offspring distribution $X$, and the probability mass function of $bin(X, 1-\nu)$. An extension of the model is considered where the probability of mutation is sampled from a distribution $\nu$ for each cluster. We show that under this assumption the empirical mean of the cluster sizes estimates the quantity $\int \nu^{-1}(r) dr$. We also show that the $\nu$ can still be estimated by the empirical mean of the cluster sizes, when the population is divided into a finite number of types with inhomogeneous offspring distributions.