When I began dabbling into single-cell RNA sequencing (scRNA-seq), the field was just emerging and few whole tissues had been analyzed in the literature. I was eager to take part, but little did I know the complexities associated with this decision. scRNA-seq is an incredibly powerful mean of comparing whole tissues under normal and altered conditions – such as diseases, genetic modifications, or chemical insults – to provide unbiased discoveries in tissue population dynamics and gene expression, which would never be found using other conventional means. But, given its enormous power, it is also fraught with many potential pitfalls and incredible complexity. During my journey, I became firmly committed to single-cell bioinformatics to understand my data in the most fundamental and profound ways. Here are a few things I picked up along the way.
The requirements for good bioinformatic analysis start early.
Clearly, pristine starting material is essential for a robust analysis that will stand the test of time. Rigorous testing to obtain a reasonable end product containing all tissue populations in good health is required and not trivial. Along with my colleagues, we performed trial runs over many months to establish enzyme concentrations for tissue dissociation, timing to reduce death of delicate populations, and filters to selectively remove only debris (but not cells). Subsequent samples appeared to provide robust results. Yet how do you actually measure the robustness of a sample preparation during your analysis? I will come back to this later.
« It is important to understand that data analysis is a reflection of, not the cause for poor results. »
Learning single-cell bioinformatics was a long journey.
Happily for me, the genetics core facility in my institute undertook both sample runs and pre-processing using the 10X pipeline software Cell Ranger. I had no previous knowledge of computer languages to help me with the analysis, but I was fortunate to receive help from an experienced colleague and, coincidentally, the university bioinformatics department began a superb work-study course for interested scientists to help demystify the field. These sources and hard work were enough to give me at least a rudimentary working knowledge of several analytical programs.
The biggest hurdle for me was learning how to code with R. Having no experience with any computer language, I found the simplest commands a mystery, even using the very helpful RStudio environment. I spent a solid two months going over code again and again, endlessly searching online for queries from those who had encountered similar problems (hint: online forums like biostars and github are true lifesavers). Slowly, painfully, I learned Seurat and how to alter parameters to check and recheck my data for validity and how to do downstream analyses to generate all those wonderful graphs. After two years of examining many different single-cell RNA-seq matrices (GEO will provide all you need for practice), I now feel quite comfortable with many R-based software, mac terminal has become my friend, and I am even learning a little Python.
Single-cell data analysis evolved quickly in just a few years.
To this day, scRNAseq data analysis remains user-unfriendly for the typical lab scientist. The most popular basic RNAseq analytic tools are Seurat and Scanpy (using R or Python computer languages respectively) and commitment to either demands a steep learning curve.
But times are changing. Two years ago, some of the major analysis pipelines like Seurat (R) were undergoing rapid code development and new versions seemed to be coming out weekly. Seurat, which always strove for well-written code and detailed explanation, was the most approachable and remains the gold standard for basic analysis today. I decided to focus on this analysis software for these reasons. However, Scanpy holds a close second and is the analysis of choice for those who prefer python. Other established platforms have improved significantly over the last several years, now with detailed explanations, simplified code and visuals for newbies (like me!). Finally, there are current attempts to streamline and simplify analysis which require no knowledge of a computer language.
The bioinformatics field is also becoming increasingly interrogative with newly emerging pipelines to query information from single-cell RNAseq data about signaling pathways, ligand-receptor pairing, and many more. BioRxiv is now packed with manuscripts from contenders proposing their own advanced analyses packages to improve existing analytical methods and/or provide new ways of looking at data.
How do I know if I ran a robust analysis?
Clearly, the quality of the input material is of paramount importance for generating useful, reproducible results that other labs can verify. Beyond the standard checks to perform during sample collection, there are multiple assessments available during analysis to determine if the sample quality remains high. These include:
- Reasonable cell and transcript numbers compared to your starting sample
- Low mitochondrial content (presence of a high amount of mitochondrial DNA points to poor cell viability).
- Dataset comparison with published related datasets if available (e.g., the GEO compendium).
In general, avoid “garbage in – garbage out”: If the sample quality is poor – i.e., with high cell death and/or poor recovery of some representative tissue populations – the results will be of limited value after a great deal of expense and efforts.
How to understand/control/verify bioinformatics output?
It is important to understand that data analysis is a reflection of, not the cause for poor results. Although different algorithms can be used to identify populations (cluster identification), by and large comparisons will reveal strong similarities in results independent of the analysis pipeline chosen. I would suggest that scientists (emerging and established) undertake at least some basic training to work efficiently with bioinformaticians, who are trained to understand code but who may not have deep insight into the specific biological questions and interests of the scientist.
Denise Gay (www.DLGbiologics.com) is a biologist with a background in immunology and regenerative biology. Her most recent work in collaboration with Paul-Henri Romeo and his group (CEA 92265 Fontenay-aux-Roses, France) showed for the first time that macrophages can phagocytize signaling inhibitors to inhibitors to promote more regenerative healing rather than fibrotic scarring (Science Advances). Three years ago, she was given a unique opportunity to undertake single-cell RNA-seq on whole skin wound tissue under both conditions (regenerative and scarring) and has never looked back.
Biostars forum: https://www.biostars.org/
Github community: https://github.community/
GEO Gene Expression Omnibus: https://www.ncbi.nlm.nih.gov/geo/
BioRxiv journal: https://www.biorxiv.org/