GoViral: A local viral haplotype reconstruction pipeline featuring a transformer-based classification model

More Info
expand_more

Abstract

RNA viruses, characterized by high replication rates and the absence of proofreading mechanisms,
are susceptible to errors during replication. This characteristic allows them to form diverse
communities of genome mutants known as "viral quasispecies". Each individual genome
mutant is referred to as a haplotype. Analyzing viral populations involves reconstructing individual
haplotypes, using sequencing reads. This process is challenging due to the unknown
number of haplotypes, their high similarity, and varied abundances. Common sequencing
methods add to this complexity. Next-generation sequencing provides short reads lacking
sufficient information for haplotype reconstruction, while third-generation sequencing (TGS)
offers longer but error-prone reads. Recent advancements in TGS, such as PacBio HiFi reads,
deliver long, accurate reads, providing new opportunities for haplotype assembly. However,
the majority of TGS viral haplotype assembly tools rely on reference sequences and utilize
alignment-based methods. Moreover, during outbreaks suitable reference genomes might be
unavailable. In the last two decades, machine learning has enabled us to uncover patterns
within and between biological sequences, and to discover important biological attributes. In
this study, we introduce GoViral, a pipeline designed to reconstruct haplotypes specific to
genomic regions and estimate their abundances. Our pipeline features a transformer-based
classification model fine-tuned on a self-constructed dataset to classify read pairs, identifying
those originating from the same haplotype. Predictions are made irrespective of coverage and
abundance, removing reliance on alignment-based methods. In addition, our pipeline employs
a community detection algorithm to cluster and reconstruct region-specific haplotypes, estimating
their abundances. GoViral achieves high performance across real data and diverse
RNA viruses, including SARS-CoV-2, HIV-1, and HCV-1b, surpassing existing tools.

Files

Msc_Thesis_final.pdf
Unknown license
warning

File under embargo until 15-09-2025