phylo2vec: a library for vector-based phylogenetic tree manipulation

Neil Scheidwasser, Ayush Nag, Matthew J Penn, Anthony MV Jakob, Frederik Mølkjær Andersen, Mark P Khurana, Landung Setiawan, David A Duchêne, Samir Bhatt

Published: 2025/6/24

Abstract

Phylogenetics is a fundamental component of evolutionary analysis frameworks in biology and linguistics. Recently, the advent of large-scale genomics and the SARS-CoV-2 pandemic has highlighted the necessity for phylogenetic software to handle large datasets. While significant efforts have focused on scaling optimisation algorithms, visualization, and lineage identification, an emerging body of research has been dedicated to efficient representations of data for genomes and phylogenetic trees. Compared to the traditional Newick format which represents trees using strings of nested parentheses, modern tree representations utilize integer vectors to define the tree topology traversal. This approach offers several advantages, including easier manipulation, increased memory efficiency, and applicability to machine learning. Here, we present the latest release of phylo2vec (or Phylo2Vec), a high-performance software package for encoding, manipulating, and analysing binary phylogenetic trees. At its core, the package is based on the phylo2vec representation of binary trees, and is designed to enable fast sampling and tree comparison. This release features a core implementation in Rust for improved performance and memory efficiency, with wrappers in R and Python (superseding the original release), making it accessible to a broad audience in the bioinformatics community.