From NEON blog

Want a Faster Way to Process Microbial DNA data? Try the neonMicrobe R Package

February 17, 2022

Field technician sampling soil microbes at the PUUM field site

Microbial DNA sequencing data from the NEON terrestrial field sites are freely available on the NEON Data Portal—but extracting useful information from all that data can be daunting for ecologists without a bioinformatics background. Thanks to the neonMicrobeR package, researchers can now automate much of the work of downloading, processing, and assembling microbial data from NEON terrestrial field sites.

Ph.D. candidate Clara Qin from the University of California – Santa Cruz (UCSC), who led development of the R package, says, “It can take a lot of effort to collect and process DNA data. This package takes away a lot of the heavy lifting, so the user can focus on conducting ecological analysis rather than just data wrangling.”

Lowering the Barriers to Entry for NEON Microbial Data Users

The neonMicrobe R package has its roots in a study led by Qin’s advisor, Dr. Kai Zhu, an associate professor of Environmental Studies at UCSC. He was studying the biogeography of soil fungi. The NEON soil microbial datasets offered a perfect opportunity to examine soil fungal diversity across the continent. Zhu tasked Qin with figuring out the best way to download, process, and analyze the NEON data. Qin, who is working on her Ph.D. in microbial macroecology at UCSC and holds an M.S. in Statistics from Stanford, at first thought the process would be fairly straightforward.

Qin says, “The neonUtilities package developed by NEON makes it easier to download data, and there are other R packages out there for sequence processing. But still, I felt there were gaps. The process was not intuitive for a first-time user. We wanted to build a ladder for other people who want to use this data in the future.”

The project really got off the ground after Zhu attended the NEON Science Summit in 2019 and connected with others interested in using the data. He explains, “The main users of the NEON microbial data are ecologists, not bioinformatics specialists who know a lot about gene sequencing and working with large sets of genetic data. We needed to make the data work not only for specialists but also for general ecologists. Because Clara was already working on the problem, we thought it would be great if we could do this as a group.”

Qin created the neonMicrobeR package, which is now freely available on GitHub, with contributions from Zoey Werbin, a graduate student at Boston University, and Dr. Lee Stanish, then a research scientist at NEON (she is now at the U.S. Geological Survey). Qin describes it as “a suite of functions for downloading, pre-processing and assembling data sets across NEON soil microbe genetic sequence data.” The data package:

  • Helps to organize raw sequence data downloaded through neonUtilities.
  • Creates a directory structure on the user’s file system.
  • Pre-processes the data using the dada2 R package to make it easier to locate specific files within the directory.
  • Assembles data so it can be linked with other relevant data from the NEON Data Portal, such as soil abiotic data.
  • Makes it possible to run a sensitivity analysis based on various sequence quality filtering parameters.

The tool is designed to lower the barriers of entry for ecologists wishing to use raw gene sequencing data from the NEON program. “NEON already provides data products for microbial community composition,” Qin explains, “but they are several steps downstream of where we usually begin analysis. There are a lot of reasons that researchers may want to use the raw sequencing data—for example, to be able to compare with other datasets using consistent taxonomic units or to make new taxonomic assignments. This package lets you make decisions around how you process the data based on your research objectives and what you want to compare.” The neonMicrobe R package makes it easier for users to locate and organize the files they want, keep track of all the downloaded data, and prepare the data for analysis.

Qin, Zhu, and a number of coauthors published a paper describing the creation and validation of the R package in Ecosphere in November 2021: From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package.

A Pipeline for Open Science and Interdisciplinary Research

Zhu and Qin hope that the new R package will serve as a data pipeline for microbial ecologists wishing to use NEON sequencing data. Using the R package will make it faster and easier to study questions around biodiversity, spatial and temporal dynamics, and the macroecological rules governing species presence and abundance.

Details are in the caption following the image

Overview of the neonMicrobe package’s marker gene sequence processing pipeline. Blue rounded rectangles correspond to main vignettes in the R package; white parallelograms represent locally stored data or parameters. Credit: Clara Qin, et al. From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package. Ecosphere, 12(11).

Zhu says, “When I go to conferences and talk about the package, a lot of people ask about it. This was not a trivial problem to solve; managing these large datasets and processing all the data takes time if you don’t have a tool. neonMicrobe will save people a lot of time and, ultimately, make science more reproducible, which is the goal.”

Their work is in the same spirit of open science that is the foundation for the NEON program. “The NEON program is about the data,” says Zhu. “This project is about the process—what we are helping people do with that data.”

Like other programs on GitHub, neonMicrobe is free to use, and the source code is fully transparent, so other programmers can continue to build on the work. Zhu explains that this transparency is critical to supporting reproducible and self-correcting science. “If you close the loop and you are the only people who know how the tool works, how can it be verified? If you have many eyes looking at the same thing, you will get more accuracy.”

Qin adds, “Open science reduces barriers to interdisciplinary collaboration and opens the doors to looking at environmental questions more broadly. In the case of my own research, as someone who focuses on statistical questions, a lot of the data I use comes from large open networks like the NEON program. Now, we’re creating tools to help more people take advantage of those resources. Bringing people together across disciplines makes science more holistic and widely applicable…and ultimately, that means better science.”

Scaling up Microbial Ecology

Zhu and his students are using the R package to further research into soil fungal communities using the sequencing data available from the NEON program. In 2019, Zhu received a National Science Foundation (NSF) grant to build models of soil fungal diversity across North America. NEON data have been critical to these efforts.

“For the questions we wanted to ask, NEON is the best possible data source,” says Zhu. “Data collection is standardized, which is really important if you want to make comparisons across regions or across time. If samples are collected and analyzed one way in California and another way in Nebraska, for example, it makes comparison very hard. In addition, the locations NEON samples are very diverse in terms of ecosystem type and climate. The 47 terrestrial sites represent 20 ecoclimate domains and many different types of environmental gradients across the U.S. This allows us to study patterns across different variables and make our conclusions more general and stronger.”

The nested spatial design of the NEON program also allows analysis across different spatial scales. In turn, this allows Zhu to investigate a broader range of questions about the drivers of microbial diversity both within and across sites. For example, soil pH may be more applicable at the local level when looking at differences in microbial communities within the same site, whereas climate and species’ dispersal limitations may come into play when looking across larger regions. neonMicrobe facilitates the analysis of NEON microbial data alongside other data products, such as soil chemistry and physical characteristics or climate data. Zhu hopes to use the NEON data to explore the geographic distribution of soil fungi, the drivers behind those distribution patterns, and how climate change may impact that distribution in the future.

Qin and Zhu hope to see more researchers download neonMicrobe to explore the “big questions” around microbial special diversity, distribution, and dynamics. To help new users get started, Qin and her collaborators have created a series of tutorials available on the neonMicrobe GitHub page.

“We really learned a lot through the process,” says Qin. “At this stage, now we need more input from users to decide where to go next. We could expand the package to include NEON aquatic microbial data if there is interest from the user community. We could also continue to build out features to support the reproducibility of analysis. And, of course, we hope others will build on our code. The next steps really depend on what the community feels will be most productive.”