Microbiome research – don’t forget to look after your data!

author image

Ragavi Shanmugam   |     |  Microbiome Research

Microorganisms (microbes) are found in every environment on earth, including air, water, soil, plants, and animals. A “microbiome” is a collection of microorganisms (microbes) living in a specific environment that can include bacteria, viruses, fungi, and other microorganisms. Human Gut Microbiome refers to all the microorganisms[ 1 ] living inside the human gut.

By studying a microbiome, scientists can better understand the impact of human activities or any other external factors on that environment. For example, by investigating the microbes in soil and on say crop growth, scientists can better understand and improve crop yields, potentially reduce the need for pesticides, and perhaps create more sustainable farming practices. Furthermore, many studies have shown that changes in the human gut microbiome are associated with several diseases, such as inflammatory bowel disease, type 2 diabetes, and obesity [ 1 ][ 2 ].

In the area of human health, microbiome related products are available that alter the microbiome in each environment to achieve the desired outcome. With better understanding of the impacts of microbiome on health, the demand for microbiome products such as probiotics/prebiotics has grown significantly over the last decade. Microbiome therapeutics is now evolving rapidly and has shown promising results to treat[ 3 ] and certain mental health disorders [ 4 ][ 5 ]. The applications of microbiome-based products can now be found in the food, agriculture, fermentation, cosmetic and wellbeing/fitness industries. Given the fast-moving nature of this research, the focus is now on how to effectively capture, analyze and leverage microbiome-based data to support research and product development.

Microbiome Data: Microbiome studies typically start with the study of the genetic material, such as DNA /RNA of the microbiome. The genetic material is extracted and sequenced from the microbiome sample. There are various approaches used to gather the molecular genetic data of a microbiome.

16S rRNA sequencing: This technique is commonly used to study microbiomes. It identifies and classifies bacteria based on their 16S rRNA gene sequence. This is a cost-effective and rapid method for characterizing microbiome, allowing for the identification of different bacterial genus. However, it has limited taxonomic resolution and cannot distinguish between closely related species or strains. It could miss rare or low-abundance taxa that are important for understanding the overall microbiome function. This technique is also prone to sequencing errors.

Shotgun metagenomics: This technique sequences the entire genome of all the microbes in a sample. It provides more detailed information when compared to 16S rRNA sequencing, but it is more expensive. Unlike traditional methods that focus on one organism at a time, metagenomics sequencing allows researchers to simultaneously analyze the DNA of thousands of microbe species and strains in a single run. Metagenomics is now an essential technique for understanding the complex interactions between different microorganisms and their roles in maintaining the health and stability of the ecosystems the inhabit.

Metatranscriptomics: This technique analyzes the RNA molecules being produced by the microbes at a given time. The time and transcription information provides insights into not only which genes are being expressed but how they are being regulated – either up or down compared to the past. This gives much more information about the state of the microbiome at a point in time and what are the active/participating pathways and genes in the microbe population. When compared to analyzing only genomic DNA the meta transcriptomics approach can tell scientists how the microbiome reacts to changes in environment and exposure to external factors such as pollutants, different types of food etc.

All the above techniques require the same key steps:

  • Sample collection
  • Extracting the genetic material
  • Sequencing
  • Sequence data analysis
  • Interpretation

Once the data is gathered from the sequencer, the processing of raw data also varies by the nature of the study. After examining the DNA/RNA to get the taxonomy composition/functional abundance or gene expression profile of the microbiome, the output is then analyzed using a various statistical or AI/ML methods to examine a hypothesis.

A proper microbiome sequence data analysis pipeline and robust data storage are crucial for ensuring the accuracy and reproducibility of a microbiome study. There are many challenges in the analysis and interpretation phase due to the huge amounts of data these studies and sequencing instruments produce. Figure 1. explains the challenges at each major step of a metagenomics study:

Data Challenges

As researchers increasingly turn to microbiome data to answer complex scientific questions, it's important to have a good data management and analysis strategy and infrastructure in place. Here are some key things to do and some considerations:

Computational Environment: set up computational environment and evaluate the need for cloud services.

This may include evaluating the cost and scalability of cloud services, as well as ensuring compliance with relevant regulations. Consider which cloud service provider to use, such as Google Cloud or AWS, and compare the cost and features of each.

Bioinformatics Processing: Identify and set up the data processing workflow so it is both reproducible and scalable.

It is common for Bioinformaticians to rely on open-source workflows or pipelines that are published as Github repositories. These resources have their own advantages and disadvantages that researchers should be aware of before deciding to use them for research. There are several advantages to open-source resources: flexibility to customize, cost efficiency, collaboration with the community, transparency, and reproducibility. But the open-source tools/resources have their own set of disadvantages, such as lack of maintenance, compatibility issues with in-house applications or environments, open bug issues and security vulnerabilities.

Data Storage: Define where you will store and manage data processing outputs and bioinformatics specific file formats.

Bioinformatics processing involves various steps including base calling, read counts, taxonomic classifications, and functional annotations. These outputs require a strategy on how to efficiently model and store the memory intensive file formats such as FASTQ, BAM, and tab-delimited tables. The data will quickly become disorganized and expensive to store without proper storage solutions.

Data Integration: Ensure all data is in one place.

Define how you will integrate sample/subject metadata from the LIMS/clinical systems with the outputs of bioinformatics analysis pipeline. This is crucial in ensuring that the metadata is properly recorded, and the relationships are established between various data entities. Having the right approach to integrate outputs from data processing with public databases (Eg. KEGG), enables different analysis possible. For any downstream analysis such as network, Graph DB, statistical modeling, or AI/ML, relational data modelling is also necessary.

Data Visualization: Understand how best to visualise your data.

Data visualization is highly subjective and specific to the use case, so you must pick whether to go with static plots, interactive visualizations, or other tooling. There are some good open-source genomics tools, such as IGV, UCSC genome browser etc., that can be integrated with a database and embedded with your R-shiny/Python Dash applications for better usage.

Data governance: Ensure the data in properly managed and curated overtime.

Organizations must carefully consider how to store the microbiome data long-term and have strategies for version control, data backup, and data archiving. Evaluate the need for a centralized data repository and if needed, which repository to use. The specialist nature of data governance calls for roles such as data stewards with deep domain knowledge and experience with data-sharing policies such as GDPR, HIPPA, etc.  

So, we can see it is important to have all the right pieces in the right place to deliver conducted effective microbiome research and development. Each step of the process is very important and ensuring you have a reproducible bioinformatics workflow is a core part of this. The potential challenge here, though is you need knowledge and multiple domains – scientific, data, cloud, and analytics to perform due diligence and make informed decisions for your microbiome research and development informatics support.

Zifo and Boost Biomes is going to jointly expand on the above challenges and approaches to tackle them in this webinar:

Case Study 1:

 Hunt for fit-for-purpose tools:

Metagenomics requires several tools for processing and analyzing raw sequencing data. There are many open-source workflows, such as SqueezeMeta, MetaVelvet, etc., and open-source data resources, such as HMP, KEGG, GTDB, etc. While this is helpful, it can also be overwhelming for scientists to determine which tools/resources to use and how they compare to each other.

A data management plan should include strategies to address the disadvantages or limitations of the open-source resources described above in the bioinformatics processing step.

Zifo was approached to help benchmark and choose fit-for-purpose tools. Our Bioinformatics team recognized the need for benchmarking to help compare the performance of different metagenomics data processing tools. Popular and commonly used tools were evaluated for each processing step, such as quality control, assembly, taxonomy, and functional annotation. The nf-core/mag pipeline was enhanced by including these tools to make it a benchmarking pipeline. The benchmarking pipeline is published on GitHub for usage by the broad scientific community.

You can read more and try out this pipeline here:

Case Study 2:

 Stitch together your microbiome data:

Microbiome data comes in different shapes and sizes and as technology advances, scientists need to combine various types of microbiome data together to understand the real picture. Our Team was asked to design and develop a microbiome database by combining various microbiome data such as 16S, WGS and sample metadata. The main challenge was in modeling the data and figuring out the relationships between different entities. The bioinformatics processing methods of 16S and WGS were different, and it posed its own challenges during integration. Our team also modelled KEGG, RHEA and PATRICK public databases to make sure the annotation information was available in standardized formats for downstream analysis. The public sources have plenty of important information, but the lack of a common genomic standard ontology makes it difficult to digest the data. We overcame these challenges by carefully filtering only the data that supported the end goal.

We picked technologies that are both efficient and user friendly. The database was successfully built and deployed, and now delivers a centralized platform for managing all microbiome sequencing data from multiple studies and projects. Scientists are now able to query their data using a simple search, and the database now provides a working framework for conducting comparative analysis across different projects.

To find out more about how Zifo can help with your Microbiome informatics and analysis needs please email us directly at


1. Gomaa E. Z. (2020). Human gut microbiota/microbiome in health and diseases: a review. Antonie van Leeuwenhoek, 113(12), 2019–2040. https://doi.org/10.1007/s10482-020-01474-7.

2. Dominguez-Bello, M. G., Godoy-Vitorino, F., Knight, R., & Blaser, M. J. (2019). Role of the microbiome in human development. Gut, 68(6), 1108–1114. https://doi.org/10.1136/gutjnl-2018-317503.

3. Arbel, L. T., Hsu, E., & McNally, K. (2017). Cost-Effectiveness of Fecal Microbiota Transplantation in the Treatment of Recurrent Clostridium Difficile Infection: A Literature Review. Cureus, 9(8), e1599. https://doi.org/10.7759/cureus.1599.

4. Peirce, J. M., & Alviña, K. (2019). The role of inflammation and the gut microbiome in depression and anxiety. Journal of neuroscience research, 97(10), 1223–1241. https://doi.org/10.1002/jnr.24476.

5. Malan-Muller, S., Valles-Colomer, M., Raes, J., Lowry, C. A., Seedat, S., & Hemmings, S. M. J. (2018). The Gut Microbiome and Mental Health: Implications for Anxiety- and Trauma-Related Disorders. Omics : a journal of integrative biology, 22(2), 90–107. h.