Taking the leap into single-cell bioinformatics

 

When I began dabbling into single-cell RNA sequencing (scRNA-seq), the field was just emerging and few whole tissues had been analyzed in the literature. I was eager to take part, but little did I know the complexities associated with this decision. scRNA-seq is an incredibly powerful mean of comparing whole tissues under normal and altered conditions – such as diseases, genetic modifications, or chemical insults – to provide unbiased discoveries in tissue population dynamics and gene expression, which would never be found using other conventional means. But, given its enormous power, it is also fraught with many potential pitfalls and incredible complexity. During my journey, I became firmly committed to single-cell bioinformatics to understand my data in the most fundamental and profound ways. Here are a few things I picked up along the way.

 

The requirements for good bioinformatic analysis start early.

Clearly, pristine starting material is essential for a robust analysis that will stand the test of time. Rigorous testing to obtain a reasonable end product containing all tissue populations in good health is required and not trivial. Along with my colleagues, we performed trial runs over many months to establish enzyme concentrations for tissue dissociation, timing to reduce death of delicate populations, and filters to selectively remove only debris (but not cells). Subsequent samples appeared to provide robust results. Yet how do you actually measure the robustness of a sample preparation during your analysis? I will come back to this later.

« It is important to understand that data analysis is a reflection of, not the cause for poor results. »

Learning single-cell bioinformatics was a long journey.

Happily for me, the genetics core facility in my institute undertook both sample runs and pre-processing using the 10X pipeline software Cell Ranger. I had no previous knowledge of computer languages to help me with the analysis, but I was fortunate to receive help from an experienced colleague and, coincidentally, the university bioinformatics department began a superb work-study course for interested scientists to help demystify the field. These sources and hard work were enough to give me at least a rudimentary working knowledge of several analytical programs.

The biggest hurdle for me was learning how to code with R. Having no experience with any computer language, I found the simplest commands a mystery, even using the very helpful RStudio environment. I spent a solid two months going over code again and again, endlessly searching online for queries from those who had encountered similar problems (hint: online forums like biostars and github are true lifesavers). Slowly, painfully, I learned Seurat and how to alter parameters to check and recheck my data for validity and how to do downstream analyses to generate all those wonderful graphs. After two years of examining many different single-cell RNA-seq matrices (GEO will provide all you need for practice), I now feel quite comfortable with many R-based software, mac terminal has become my friend, and I am even learning a little Python.

 

Single-cell data analysis evolved quickly in just a few years.

To this day, scRNAseq data analysis remains user-unfriendly for the typical lab scientist. The most popular basic RNAseq analytic tools are Seurat and Scanpy (using R or Python computer languages respectively) and commitment to either demands a steep learning curve.

But times are changing. Two years ago, some of the major analysis pipelines like Seurat (R) were undergoing rapid code development and new versions seemed to be coming out weekly. Seurat, which always strove for well-written code and detailed explanation, was the most approachable and remains the gold standard for basic analysis today. I decided to focus on this analysis software for these reasons. However, Scanpy holds a close second and is the analysis of choice for those who prefer python. Other established platforms have improved significantly over the last several years, now with detailed explanations, simplified code and visuals for newbies (like me!). Finally, there are current attempts to streamline and simplify analysis which require no knowledge of a computer language.

The bioinformatics field is also becoming increasingly interrogative with newly emerging pipelines to query information from single-cell RNAseq data about signaling pathways, ligand-receptor pairing, and many more. BioRxiv is now packed with manuscripts from contenders proposing their own advanced analyses packages to improve existing analytical methods and/or provide new ways of looking at data.

 

How do I know if I ran a robust analysis?

Clearly, the quality of the input material is of paramount importance for generating useful, reproducible results that other labs can verify. Beyond the standard checks to perform during sample collection, there are multiple assessments available during analysis to determine if the sample quality remains high. These include:

  • Reasonable cell and transcript numbers compared to your starting sample
  • Low mitochondrial content (presence of a high amount of mitochondrial DNA points to poor cell viability).
  • Dataset comparison with published related datasets if available (e.g., the GEO compendium).

In general, avoid “garbage in – garbage out”: If the sample quality is poor – i.e., with high cell death and/or poor recovery of some representative tissue populations – the results will be of limited value after a great deal of expense and efforts.

 

How to understand/control/verify bioinformatics output?

It is important to understand that data analysis is a reflection of, not the cause for poor results. Although different algorithms can be used to identify populations (cluster identification), by and large comparisons will reveal strong similarities in results independent of the analysis pipeline chosen.  I would suggest that scientists (emerging and established) undertake at least some basic training to work efficiently with bioinformaticians, who are trained to understand code but who may not have deep insight into the specific biological questions and interests of the scientist.

 

Denise Gay (www.DLGbiologics.com) is a biologist with a background in immunology and regenerative biology. Her most recent work in collaboration with Paul-Henri Romeo and his group (CEA 92265 Fontenay-aux-Roses, France) showed for the first time that macrophages can phagocytize signaling inhibitors to inhibitors to promote more regenerative healing rather than fibrotic scarring (Science Advances). Three years ago, she was given a unique opportunity to undertake single-cell RNA-seq on whole skin wound tissue under both conditions (regenerative and scarring) and has never looked back.

 

Single-cell bibliography

Biostars forum: https://www.biostars.org/

Github community: https://github.community/

GEO Gene Expression Omnibus: https://www.ncbi.nlm.nih.gov/geo/

BioRxiv journal: https://www.biorxiv.org/

 

Interview with Dr. Marc Beyer: collaborating for optimal scRNA-seq bioinformatic analysis

Dr. Marc Beyer is a research group leader at the German Center for Neurodegenerative Diseases (DZNE) in Bonn, Germany. After initially studying medicine, he pursued a post-graduate degree in bioinformatics in 2000, when the field was just barely taking off. A decade later, he participated in the very early days of single-cell transcriptomics, leading him to work with the Platform foR SinglE Cell GenomIcS and Epigenomics (PRECISE) today. He agreed to meet us to discuss his journey and the best collaborations he witnessed between biology and bioinformatics regarding single-cell RNA sequencing (scRNA-seq).

Scipio: Good afternoon Dr. Beyer. Thank you for sparing some time today to share your experience about the bioinformatic side of scRNA-seq projects. It can be a daunting prospect when scientists come from a pure biology background and must face such a new, complex, and fast-paced scientific field to complete their scRNA-seq studies. I understand, you have been involved in single-cell bioinformatics almost since the very beginning, is that correct?

Dr. Marc Beyer: That’s right! When we started [bioinformatic analysis of scRNA-seq] back in 2013, there was nothing available. We had to gather the packages [modules to run specific computational programs] ourselves and try to assemble and run them the best we could. It is pretty impressive what we have been able to achieve in just the past few years: we are starting to have common pipeline standards in the field now, for primary data processing as well as downstream analysis.

S: I can imagine it must have been fastidious work, if there was no user-friendly software or even interfaces at the time. Are there any lessons from those early days that still apply today now that pipelines have started to emerge?

MB: Indeed, we learned early on that you have to “encapsulate” your analysis in a container, knowing the exact versions of the software and packages you used, so that If you need to run it again two years later, you can. Otherwise, you might be re-running your analysis two months later, but you will have different results because the packages have changed in the meantime. This still applies today with software versions from single-cell technology companies.

« Depending on what biological questions you would like to answer, the technology is completely different. »

S: Looking at the ever-growing list of tools available for single-cell analysis, it seems the pace is not slowing down. How do you keep up with these constant updates and new computational methods?

MB: Over time I have acquired some knowledge in basically analyzing transcriptome data. Not so much that I can program everything now by myself, simply because that is too huge a task, but I know how to interpret programs that people are writing. And I can explain to a bioinformatician how to analyze and how to interpret the data.

S: Coming from biology, it sounds pretty intimidating and quite a huge workload to commit to if you have to understand the concepts behind different programs and algorithms, even if you do not have to learn how to program directly! I might be tempted to just run my experiment and sequencing but leave the analysis to an experienced bioinformatician altogether.

MB: If you want somebody else to analyze your data, it’s just tough. I think we learned this the hard way. In principle, if people want to do something with us, we need to have an initial discussion about experimental design. What is their biological question? What do they want to address? To which level of details do they want to go? Because depending on what biological questions you would like to answer, the technology is completely different. And the way to analyze the data will also be completely different.

« Our most successful collaborations were when a PhD student or a postdoc who is willing to spend time learning comes to our lab for three months or so. »

S: So how do those discussions take place? How much of the biology behind the sample do you need to know before running an analysis?

MB: Well, when people approach us coming with cell types that we have no clue about, we simply say that it would be a very difficult project, because we have no idea about the biology. Therefore, I cannot tell you anything about how to set up the experiment, or how good the data will be because I do not know what to expect. On the other hand, I can use the algorithms that I normally run, which should give me valid results. But for the interpretation of the data, I have absolutely no idea what makes sense or not.

S: I see. When this is the case, how do you deal with such a gap in knowledge? Is there a good approach to solve this?

MB: I think this is a critical question for the whole field! I am in the luxurious position that I know what happens there. I see how people can struggle, and how they try to bridge this gap of knowledge. And I think there is probably no perfect solution. One of them is to say, well, I invest into this and I want to learn at least the basics myself, to be able to understand what the other side tells me. And that can be both sides, right? If you think about a bioinformatician having little knowledge about the biology, it’s the same thing if you want to understand what you’re actually doing with the data. You can do a lot of things with data, but in principle it is for the other side where you are coming with biological knowledge. You might think you have found the perfect way to analyze the data, but unfortunately find no significant gene expression in there, because that’s it, that’s your result. And the other side might say, “well, we do have a biological effect”. Now, how do we come together?

S: So which solutions have you come up with for those issues? How did your most fruitful collaborations take place?

MB: Ideally, you have to meet half-way between the biology and the bioinformatics. If you can, on the one hand, get somebody in your team who wants to commit and learn some bioinformatics to start bridging the gap in knowledge. On the other hand, we can provide somebody at this interface with some knowledge in biology to bridge the rest of the way. Our most successful collaborations were when a PhD student or a postdoc who is willing to spend time learning comes to our lab for three months or so. Then they go back, they analyze the data with what they have learned, they ask questions, and we have a productive back-and-forth to setup the best analysis and interpretation we can.

S: Having a team member in a lab motivated and committed to put in the time to learn the basics in bioinformatics for the whole group sounds like an ideal solution indeed. Would you have any advice for the cases where this is unfortunately not a possible solution?

MB: I would recommend to everybody that you find partners to collaborate with who are experienced in your domain, even if that means collaborating with people somewhere else on the globe. It is nice to have a lab nearby that might do single-cell technologies, but if they have a completely different domain of knowledge, then often it’s not helping a lot. I mean, you can learn the technology there, yes. But for your biological question, then I think it’s better to actually ask people who are more experienced in your field.

S: That sounds like a sensible plan indeed! Thank you very much Dr. Beyer.

Interview of Dr. Marc Beyer recorded on Jan 8th, 2021, by Wilko Duprez.

scRNA-seq: artifacts from sample preparation

Independently of the ScRNA-seq methodology you plan to use, there are genetic artifacts bound to arise from the sample preparation preceding the mRNA capture step. Being aware of the stimuli triggering spurious genetic expression and alteration of the transcriptomic profile is necessary for optimal experimental design.

Typically, the tissue dissociation and cell isolation processes can take up to several hours – in some cases days – during which the cells are removed from their normal environment. After cell capture, mRNAs are typically stabilized through their capture following cell lysis, so the entire vulnerable period runs from the initial sample collection to the cell lysis. During this time, cellular stress-related responses can lead to changes in the behavior and the morphology of the sample cells, even leading to cell death.

 

Effect of Sampling Time

The time it takes to harvest samples immediately impacts the resulting transcriptomic profile of the different cell subtypes. In their investigation, Massoni-badosa et al. collected red blood cells from 5 patients and waited respectively 0, 2, 8, 24 and 48h hours before processing the samples through various scRNA-seq technologies (Massoni-Badosa et al., Genome Biology, 2020). Their analysis found significant shifts in their PCA analysis across all cell subtypes, correlating with the increase in the sampling time. Digging further using differential expression analysis, they identified a time-dependent decrease in the number of detected genes in all their datasets and a global downregulation of gene expression. Overall, they identified between 1000 and 2000 differentially regulated genes (depending on the sample type) over the course of 48h.

The genetic signature of such a time-dependent bias could – and should – be identified and corrected during the bioinformatic analysis of the resulting ScRNA-seq data. Nevertheless, the overall quality of the sample seems to be decreasing as the sampling time increase. Therefore, it might be worth investigating new ways of processing freshly harvested ScRNA-seq samples quicker although it might be arduous in specific cases, for example when the collection occurs outside the normal hours of the sample processing facilities.

 

Effect of Digestion Time and Temperature

Standard sample preparation methods for solid tissues require enzymatic and/or mechanical dissociation and, depending on the tissue origin, density, disease state, elastin, or collagen content, this may require long enzymatic digestion and/or vigorous mechanical disruption. Transcriptional machinery remains active at 37 °C, and extended incubation at high temperatures may introduce gene expression artifacts, unrelated to the biological state at the time of harvest. Moreover, extended incubation at higher temperatures in the absence of nutrients or anchorage, or harsh dissociation, may induce apoptosis or anoikis, polluting the viable cell population or generating low-quality suspensions.

After discovering a set of 507 genes – some of them related to cell stress pathways – strongly affected by the digestion temperature, O’Flanagan et al. performed a time-course ScRNA-seq experiment on breast cancer xenograft tissues. They sampled cells regularly over two hours of total digestion time, using either collagenase at 37°C or a cold protease at 6°C (O’Flanagan et al., Genome Biology, 2019). They found out digestion with collagenase substantially upregulated the expression of this core set of stress-related genes, with a subset even further expressed as the digestion time increased. Applying differential expression analysis to their entire dataset, they figured that 43% of the total 18,734 retained genes were differentially regulated after 2h of digestion time compared to only 30 minutes.

This example highlights the importance of initially refining the digestion process to be as short, mild, and as efficient as possible for your sample type, as it can affect the expression of a significant number of the detected genes.

 

Unusual sample types

Specific cell subtypes can be absent from a final ScRNA-seq dataset because of peculiar characteristics. Some tissues are particularly hard to dissociate (e.g. cardiomyocytes), meaning their cell population might be underrepresented in the resulting sample sent for ScRNA-processing (Ackers-Johnson et al., Nature Communications, 2018). Unusual cell subtypes might have unusual shapes or sizes, preventing them from being successfully processed with some ScRNA-seq technologies or instruments. Another example is cells suffering from anoikis after being removed from their anchorage onto an extracellular matrix during tissue dissociation – a process believed to be started only 3 hours after removal – lowering the cell recovery rate and decreasing the sample quality. An alternative to preserve the native physiological distribution between all cell subtypes and the genetic material of fragile cells is to extract the nuclei and sequence nuclear mRNA.

 

Conclusion

A successful SCRNA-seq experiment starts straight away during sample harvest, as studies seem to suggest the quality and relevance of the data slowly start deteriorating immediately and is time-dependent from the length of the entire sample preparation phase, well before the single cell mRNA capture step.

Even if the sampling and digestion times are well-optimized, another factor that can extend the sample preparation time and act upon the sample quality is the waiting time to access a ScRNA-seq platform or instrument where the mRNA will finally be extracted. Until a new solution is developed to accelerate this process, you can check updated methods to freeze or fix single-cells to preserve your samples and the integrity of your future dataset.