Prabaran R
United States
Deciphering enzymatic potential in metagenomic reads through DNA language model
R Prabakaran1,2, Yana Bromberg1,2
1. Department of Biology, Emory University, Atlanta, GA
2. Department of Computer Science, Emory University, Atlanta, GA
Abstract
Background
The microbial world, home to an extensive array of bacterial species, plays a fundamental role in shaping Earth’s biosphere, steering processes such as carbon and nitrogen cycling, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, are yet unstudied. Despite their significance, much of this microbial diversity remains uncharacterized, representing a significant chunk of metagenomic “dark matter”. Existing metagenome analysis methods are largely dependent on reference databases and are thus unable to make true discoveries of novel microbial functionality.
Methods
To address this limitation, we present REBEAN, a nucleotide language model engineered for homology-independent metagenomic analysis. REBEAN is trained to interpret the DNA context of gene fragments and to predict their enzymatic functions, prioritizing functional annotation over traditional sequence similarity.
Results
This approach enables REBEAN to identify previously uncharacterized (orphan) sequences carrying out known enzymatic functions. REBEAN can also locate functionally critical regions within genes despite not being explicitly trained for this task.
Conclusions
Our comprehensive analysis of multiple datasets highlights REBEAN’s potential for metagenomic annotation and unearthing novel enzymes. REBEAN can enrich our understanding of microbial communities and is thus available to the community as a standalone package and web service.
Leave A Comment