Zhang Lab @ VT

ProtAlign-ARG

The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG. The source code of ProtAlign-ARG is available here.

[Website] [Paper]

ARGContextProfiler

Antibiotic resistance (AR) presents a global health challenge, necessitating an improved understanding of the ecology, evolution, and dissemination of antibiotic resistance genes (ARGs). Several tools, databases, and algorithms are now available to facilitate the identification of ARGs in metagenomic sequencing data; however, direct annotation of short-read data provides limited contextual information. Knowledge of whether an ARG is carried in the chromosome or on a specific mobile genetic element (MGE) is critical to understanding mobility, persistence, and potential for co-selection. Here we developed ARGContextProfiler, a pipeline designed to extract and visualize ARG genomic contexts. By leveraging the assembly graph for genomic neighborhood extraction and validating contexts through read mapping, ARGContextProfiler minimizes chimeric errors that are a common artifact of assembly outputs. Testing on real, synthetic, and semi-synthetic data, including long-read sequencing data from environmental samples, demonstrated that ARGContextProfiler offers superior accuracy, precision, and sensitivity compared to conventional assembly-based methods. ARGContextProfiler thus provides a powerful tool for uncovering the genomic context of ARGs in metagenomic sequencing data, which can be of value to both fundamental and applied research aimed at understanding and stemming the spread of AR. The source code of ARGContextProfiler is available here.

[Website] [Paper]

Virseqimprover

Despite the recent surge of viral metagenomic studies, it remains a significant challenge to recover complete virus genomes from metagenomic data. The majority of viral contigs generated from de novo assembly programs are highly fragmented, presenting significant challenges to downstream analysis and inference. To address this issue, we have developed Virseqimprover, a computational pipeline that can extend assembled contigs to complete or nearly complete genomes while maintaining extension quality. Virseqimprover first examines whether there is any chimeric sequence based on read coverage, breaks the sequence into segments if there is, then extends the longest segment with uniform depth of coverage, and repeats these procedures until the sequence cannot be extended. Finally, Virseqimprover annotates the gene content of the resulting sequence. Results show that Virseqimprover has good performances on correcting and extending viral contigs to their full lengths, hence can be a useful tool to improve the completeness and minimize the assembly errors of viral contigs. Both a web server and a conda package for Virseqimprover are provided to the research community free of charge. The source code of Virseqimprover is available here.

[Website] [Paper]

DeePSP-GIN

Phages are vital components of the microbial ecosystem, and their functions and roles are largely determined by their structural proteins. Accurately annotating phage structural proteins (PSPs) is essential for understanding phage biology and their interactions with bacterial hosts, which can pave the way for innovative strategies to combat bacterial infections and develop phage-based therapies. However, the sequence diversity of PSPs makes their identification and annotation challenging. While various computational methods are available for predicting PSPs, they currently lack the integration of protein structural information, an important aspect for understanding protein function. With the advent of deep learning models, protein structures can be predicted accurately and quickly from protein sequences, creating new opportunities for PSP prediction and analysis. We developed DeePSP-GIN, a graph isomorphism network (GIN) - based deep learning model leveraging predicted protein structures and protein language model for PSP identification and classification. To the best of our knowledge, DeePSP-GIN is the first method utilizing predicted protein structural information for PSP prediction tasks. It offers dual functionality of identifying PSP and non-PSP sequences and classifying PSPs into seven major classes. DeePSP-GIN converts predicted protein structures from PDB 3D coordinates into graphs and extracts node features from protein language model-generated embeddings. The GIN is then applied to the constructed graphs to learn the discriminating features. The experimental results show that DeePSP-GIN outperforms the state-of-the-art methods in both PSP identification and classification tasks in terms of F1-score. DeePSP-GIN achieves a 1.04% higher F1-score than the nearest competing method in the PSP identification task. Additionally, its overall F1-score in the PSP classification task is approximately 34.38% higher than that of the second-best method. The source code of DeePSP-GIN is available here.

[Website] [Paper]

MetaCompare 2.0

While numerous environmental factors contribute to the spread of antibiotic resistance genes (ARGs), quantifying their relative contributions remains a fundamental challenge. Similarly, it is important to differentiate acute human health risks from environmental exposure, versus broader ecological risk of ARG evolution and spread across microbial taxa. Recent studies have proposed various methods for achieving such aims. Here, we introduce MetaCompare 2.0, which improves upon original MetaCompare pipeline by differentiating indicators of human health resistome risk (potential for human pathogens of acute resistance concern to acquire ARGs) from ecological resistome risk (overall mobility of ARGs and potential for pathogen acquisition). The updated pipeline's sensitivity was demonstrated by analyzing diverse publicly-available metagenomes from wastewater, surface water, soil, sediment, human gut, and synthetic microbial communities. MetaCompare 2.0 provided distinct rankings of the metagenomes according to both human health resistome risk and ecological resistome risk, with both scores trending higher when influenced by anthropogenic impact or other stress. We evaluated the robustness of the pipeline to sequence assembly methods, sequencing depth, contig count, and metagenomic library coverage bias. The risk scores were remarkably consistent despite variations in these technological aspects. We packaged the improved pipeline into a publicly-available web service that provides an easy-to-use interface for computing resistome risk scores and visualizing results. MetaCompare 2.0 also provides a public API and database available here.

[Website] [Paper]

DeepMicroGen

The human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods. This work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.

[Website] [Paper]

FastViromeExplorer-Novel

Despite the recent surge of viral metagenomic studies, recovering complete virus/phage genomes from metagenomic data is still extremely difficult and most viral contigs generated from de novo assembly programs are highly fragmented, posing serious challenges to downstream analysis and inference. In this study, we develop FastViromeExplorer (FVE)-novel, a computational pipeline for reconstructing complete or near-complete viral draft genomes from metagenomic data. The FVE-novel deploys FVE to efficiently map metagenomic reads to viral reference genomes, performs de novo assembly of the mapped reads to generate contigs, and extends the contigs through iterative assembly to produce final viral scaffolds. We applied FVE-novel to an ocean metagenomic sample and obtained 268 viral scaffolds that potentially come from novel viruses. Through manual examination and validation of the 10 longest scaffolds, we successfully recovered 4 complete viral genomes, 2 are novel as they cannot be found in the existing databases and the other 2 are related to known phages. This hybrid reference-based and de novo assembly approach used by FVE-novel represents a powerful new approach for uncovering near-complete viral genomes in metagenomic data.

[Website] [Paper]

DeepGeni

Recent studies revealed that gut microbiota modulates the response to cancer immunotherapy and fecal microbiota transplantation has clinical benefits in melanoma patients during treatment. Understanding how microbiota affects individual responses is crucial for precision oncology. However, it is challenging to identify key microbial taxa with limited data as statistical and machine learning models often lose their generalizability. In this study, DeepGeni, a deep generalized interpretable autoencoder, is proposed to improve the generalizability and interpretability of microbiome profiles by augmenting data and by introducing interpretable links in the autoencoder. DeepGeni-based machine learning classifier outperforms state-of-the-art classifier in the microbiome-driven prediction of responsiveness of melanoma patients treated with immune checkpoint inhibitors. Moreover, the interpretable links of DeepGeni elucidate the most informative microbiota associated with cancer immunotherapy response. DeepGeni not only improves microbiome-driven prediction of immune checkpoint inhibitor responsiveness but also suggests potential microbial targets for fecal microbiota transplant or probiotics improving the outcome of cancer immunotherapy.

[Website] [Paper]

HT-ARGfinder

Antibiotic resistance is a continually rising threat to global health. A primary driver of the evolution of new strains of resistant pathogens is the horizontal gene transfer (HGT) of antibiotic resistance genes (ARGs). However, identifying and quantifying ARGs subject to HGT remains a significant challenge. Here, we introduce HT-ARGfinder (horizontally transferred ARG finder), a pipeline that detects and enumerates horizontally transferred ARGs in metagenomic data while also estimating the directionality of transfer. To demonstrate the pipeline, we applied it to an array of publicly-available wastewater metagenomes, including hospital sewage. We compare the horizontally transferred ARGs detected across various sample types and estimate their directionality of transfer among donors and recipients. This study introduces a comprehensive tool to track mobile ARGs in wastewater and other aquatic environments.

[Website] [Paper]

MetaMLP

The functional profile of metagenomic samples enables improved understanding of microbial populations in the environment. Such analysis consists of assigning short sequencing reads to a particular functional category. Normally, manually curated databases are used for functional assignment, and genes are arranged into different classes. Sequence alignment has been widely used to profile metagenomic samples against curated databases. However, this method is time consuming and requires high computational resources. While several alignment-free methods based on k-mer composition have been developed in recent years, they still require large amounts of computer main memory. In this article, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences as numerical vectors (embeddings) and uses a simple one hidden layer neural network to profile functional categories, is developed. Unlike other methods, MetaMLP enables partial matching by using a reduced alphabet to build sequence embeddings from full and partial k-mers. MetaMLP is able to identify a slightly larger number of reads compared with DIAMOND (one of the fastest sequence alignment methods), as well as to perform accurate predictions with 0.99 precision and 0.99 recall. MetaMLP can process 100M reads in ∼10 minutes on a laptop computer, which is 50 times faster than DIAMOND. MetaMLP is publicly available here.

[Website] [Paper]

ARGminer

Curation of antibiotic resistance gene (ARG) databases is a labor-intensive process that requires expert knowledge to manually collect, correct, and/or annotate individual genes. Correspondingly, updates to existing databases tend to be infrequent, commonly requiring years for completion and often containing inconsistences. Further, because of limitations of manual curation, most existing ARG databases contain only a small proportion of known ARGs (~5k genes). A new approach is needed to achieve a truly comprehensive ARG database, while also maintaining a high level of accuracy. Here we propose a new web-based curation system, ARG-miner, which supports annotation of ARGs at multiple levels, including: gene name, antibiotic category, resistance mechanism, and evidence for mobility and occurrence in clinically-important bacterial strains. To overcome limitations of manual curation, we employ crowdsourcing as a novel strategy for expanding curation capacity towards achieving a truly comprehensive, up-to-date database. We develop and validate the approach by comparing performance of multiple cohorts of curators with varying levels of expertise, demonstrating that ARG-miner is more cost effective and less time-consuming relative to traditional expert curation. We further demonstrate the reliability of a trust validation filter for rejecting confounding input generated by spammers. Crowdsourcing was found to be as accurate as expert annotation, with an accuracy >90% for the annotation of a diverse test set of ARGs. ARG-miner provides a public API and database available here.

[Website] [Paper]

DeepMicro

Human microbiota plays a key role in human health and growing evidence supports the potential use of microbiome as a predictor of various diseases. However, the high-dimensionality of microbiome data, often in the order of hundreds of thousands, yet low sample sizes, poses great challenge for machine learning-based prediction algorithms. This imbalance induces the data to be highly sparse, preventing from learning a better prediction model. Also, there has been little work on deep learning applications to microbiome data with a rigorous evaluation scheme. To address these challenges, we propose DeepMicro, a deep representation learning framework allowing for an effective representation of microbiome profiles. DeepMicro successfully transforms high-dimensional microbiome data into a robust low-dimensional representation using various autoencoders and applies machine learning classification algorithms on the learned representation. In disease prediction, DeepMicro outperforms the current best approaches based on the strain-level marker profile in five different datasets. In addition, by significantly reducing the dimensionality of the marker profile, DeepMicro accelerates the model training and hyperparameter optimization procedure with 8X–30X speedup over the basic approach. DeepMicro is freely available here.

[Website] [Paper]

NanoARG

Direct and indirect selection pressures imposed by antibiotics and co-selective agents and horizontal gene transfer are fundamental drivers of the evolution and spread of antibiotic resistance. Therefore, effective environmental monitoring tools should ideally capture not only antibiotic resistance genes (ARGs), but also mobile genetic elements (MGEs) and indicators of co-selective forces, such as metal resistance genes (MRGs). A major challenge towards characterizing the potential human health risk of antibiotic resistance is the ability to identify ARG-carrying microorganisms, of which human pathogens are arguably of greatest risk. Historically, short reads produced by next-generation sequencing technologies have hampered confidence in assemblies for achieving these purposes. Here, we introduce NanoARG, an online computational resource that takes advantage of the long reads produced by nanopore sequencing technology. Specifically, long nanopore reads enable identification of ARGs in the context of relevant neighboring genes, thus providing valuable insight into mobility, co-selection, and pathogenicity. NanoARG was applied to study a variety of nanopore sequencing data to demonstrate its functionality. NanoARG was further validated through characterizing its ability to correctly identify ARGs in sequences of varying lengths and a range of sequencing error rates. NanoARG allows users to upload sequence data online and provides various means to analyze and visualize the data, including quantitative and simultaneous profiling of ARGs, MRGs, MGEs, and putative pathogens. A user-friendly interface allows users the analysis of long DNA sequences (including assembled contigs), facilitating data processing, analysis, and visualization. NanoARG is publicly available and freely accessible here.

[Website] [Paper]

MetaCompare

The spread of antibiotic resistance is a growing public health concern. While numerous studies have highlighted the importance of environmental sources and pathways of the spread of antibiotic resistance, a systematic means of comparing and prioritizing risks represented by various environmental compartments is lacking. Here, we introduce MetaCompare, a publicly available tool for ranking "resistome risk", which we define as the potential for antibiotic resistance genes (ARGs) to be associated with mobile genetic elements (MGEs) and mobilize to pathogens based on metagenomic data. A computational pipeline was developed in which each ARG is evaluated based on relative abundance, mobility, and presence within a pathogen. This is determined through the assembly of shotgun sequencing data and analysis of contigs containing ARGs to determine if they contain sequence similarity to MGEs or human pathogens. Based on the assembled metagenomes, samples are projected into a 3-dimensional hazard space and assigned resistome risk scores. To validate, we tested previously published metagenomic data derived from distinct aquatic environments. Based on unsupervised machine learning, the test samples clustered in the hazard space in a manner consistent with their origin. The derived scores produced a well-resolved ascending resistome risk ranking of: wastewater treatment plant effluent, dairy lagoon, and hospital sewage.

[Website] [Paper]

BisPin and BFAST-Gap

BisPin is a new multiprocess bisulfite-treated short DNA read mapper written in Python 2.7. It performs alignments using BFAST, leveraging its multithreading functionality and thorough hash-based indexing strategy. BisPin is feature rich and supports directional, nondirectional, PBAT, and hairpin construction strategies. BisPin approaches read mapping by converting the Cs to Ts and the Gs to As in both the reads and the reference genome. BisPin uses fast rescoring to disambiguate ambiguously aligned reads for a superior amount of uniquely mapped reads compared to other mappers. The performance of BisPin was evaluated on both real and simulated data in comparison to other read mappers. BFAST-Gap is a modified version of BFAST meant for Ion Torrent reads. It uses a parameterized logistic function to determine the weights of the gap open and extension penalties based on the homopolymer run length of the DNA read. This is because the Ion Torrent sequencing technology can overcall and undercall homopolymer runs. BisPin works with both BFAST-Gap and BFAST. BFAST-Gap is compatible with indexes built with BFAST. There are few mappers that specifically address Ion Torrent data. BFAST-Gap works with Illumina reads as well. Results: BisPin with BFAST consistently had a higher amount of uniquely mapped reads compared to other mappers on real data using a variety of construction strategies. Using a hairpin validation strategy, BisPin was superior using the maximum score, and it mapped 73% of reads correctly. BisPin with BFAST-Gap on Ion Torrent reads with a logistic gap open penalty function improved mapping accuracy with real and simulated data. On simulated bisulfite Ion Torrent data, the area under the curve was improved by approximately seven, and on one real data set, the uniquely mapped percent was improved by seven percent. BFAST-Gap performed better than TMAP on simulated regular Ion Torrent reads, and TMAP is designed for Ion Torrent reads. Other read mappers had worse performance. Conclusions: BisPin and BFAST-Gap have consistently good accuracy with a variety of data. BisPin is feature-rich. This makes BisPin and BFAST-Gap useful additions to read mapping software.

[Website] [Paper]

DeepARG

Growing concerns about increasing rates of antibiotic resistance call for expanded and comprehensive global monitoring. Advancing methods for monitoring of environmental media (e.g., wastewater, agricultural waste, food, and water) is especially needed for identifying potential resources of novel antibiotic resistance genes (ARGs), hot spots for gene exchange, and as pathways for the spread of ARGs and human exposure. Next-generation sequencing now enables direct access and profiling of the total metagenomic DNA pool, where ARGs are typically identified or predicted based on the “best hits” of sequence searches against existing databases. Unfortunately, this approach produces a high rate of false negatives. To address such limitations, we propose here a deep learning approach, taking into account a dissimilarity matrix created using all known categories of ARGs. Two deep learning models, DeepARG-SS and DeepARG-LS, were constructed for short read sequences and full gene length sequences, respectively. Evaluation of the deep learning models over 30 antibiotic resistance categories demonstrates that the DeepARG models can predict ARGs with both high precision (>0.97) and recall (>0.90). The models displayed an advantage over the typical best hit approach, yielding consistently lower false negative rates and thus higher overall recall (>0.90). As more data become available for under-represented ARG categories, the DeepARG models’ performance can be expected to be further enhanced due to the nature of the underlying neural networks. Our newly developed ARG database, DeepARG-DB, encompasses ARGs predicted with a high degree of confidence and extensive manual inspection, greatly expanding current ARG repositories. The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice. DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs. The DeepARG models and database are available as a command line version and as a Web service here.

[Website] [Paper]

FastViromeExplorer

With the increase in the availability of metagenomic data generated by next generation sequencing, there is an urgent need for fast and accurate tools for identifying viruses in host-associated and environmental samples. In this paper, we developed a stand-alone pipeline called FastViromeExplorer for the detection and abundance quantification of viruses and phages in large metagenomic datasets by performing rapid searches of virus and phage sequence databases. Both simulated and real data from human microbiome and ocean environmental samples are used to validate FastViromeExplorer as a reliable tool to quickly and accurately identify viruses and their abundances in large datasets.

[Website] [Paper]

InfoTrim

Biological DNA reads are often trimmed before mapping, genome assembly, and other tasks to improve the quality of the results. Biological sequence complexity relates to alignment quality as low complexity regions can align poorly. There are many read trimmers, but many do not use sequence complexity for trimming. Alignment of reads generated from whole genome bisulfite sequencing is especially challenging since bisulfite treated reads tend to reduce sequence complexity. InfoTrim, a new read trimmer, was created to explore these issues. It is evaluated against five other trimmers using four read mappers on real and simulated bisulfite treated DNA data. InfoTrim has reasonable results that are consistent with other read mappers.

[Website] [Paper]

UPS-indel

Indels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools. UPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets. UPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the command line version is freely available here. The online version of UPS-indel is available here.

[Website] [Paper]

MetaStorm

Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm, which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution.

[Website] [Paper]

BAM-ABS

DNA methylation is an epigenetic modification critical for normal development and diseases. The determination of genome-wide DNA methylation at single-nucleotide resolution is made possible by sequencing bisulfite treated DNA with next generation high-throughput sequencing. However, aligning bisulfite short reads to a reference genome remains challenging as only a limited proportion of them (around 50–70%) can be aligned uniquely; a significant proportion, known as multireads, are mapped to multiple locations and thus discarded from downstream analyses, causing financial waste and biased methylation inference. To address this issue, we develop a Bayesian model that assigns multireads to their most likely locations based on the posterior probability derived from information hidden in uniquely aligned reads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our method can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. Moreover, the assignment model shows robust performance with low coverage depth, making it particularly attractive considering the prohibitive cost of bisulfite sequencing. Additionally, results show that longer reads help improve the performance of the assignment model. The assignment model is also robust to varying degrees of methylation and varying sequencing error rates. Finally, incorporating prior knowledge on mutation rate and context specific methylation level into the assignment model increases inference accuracy. The assignment model is implemented in the BAM-ABS package and freely available here.

[Website] [Paper]