Interview with Dr. Marc Beyer: collaborating for optimal scRNA-seq bioinformatic analysis

Dr. Marc Beyer is a research group leader at the German Center for Neurodegenerative Diseases (DZNE) in Bonn, Germany. After initially studying medicine, he pursued a post-graduate degree in bioinformatics in 2000, when the field was just barely taking off. A decade later, he participated in the very early days of single-cell transcriptomics, leading him to work with the Platform foR SinglE Cell GenomIcS and Epigenomics (PRECISE) today. He agreed to meet us to discuss his journey and the best collaborations he witnessed between biology and bioinformatics regarding single-cell RNA sequencing (scRNA-seq).

Scipio: Good afternoon Dr. Beyer. Thank you for sparing some time today to share your experience about the bioinformatic side of scRNA-seq projects. It can be a daunting prospect when scientists come from a pure biology background and must face such a new, complex, and fast-paced scientific field to complete their scRNA-seq studies. I understand, you have been involved in single-cell bioinformatics almost since the very beginning, is that correct?

Dr. Marc Beyer: That’s right! When we started [bioinformatic analysis of scRNA-seq] back in 2013, there was nothing available. We had to gather the packages [modules to run specific computational programs] ourselves and try to assemble and run them the best we could. It is pretty impressive what we have been able to achieve in just the past few years: we are starting to have common pipeline standards in the field now, for primary data processing as well as downstream analysis.

S: I can imagine it must have been fastidious work, if there was no user-friendly software or even interfaces at the time. Are there any lessons from those early days that still apply today now that pipelines have started to emerge?

MB: Indeed, we learned early on that you have to “encapsulate” your analysis in a container, knowing the exact versions of the software and packages you used, so that If you need to run it again two years later, you can. Otherwise, you might be re-running your analysis two months later, but you will have different results because the packages have changed in the meantime. This still applies today with software versions from single-cell technology companies.

“Depending on what biological questions you would like to answer, the technology is completely different.”

S: Looking at the ever-growing list of tools available for single-cell analysis, it seems the pace is not slowing down. How do you keep up with these constant updates and new computational methods?

MB: Over time I have acquired some knowledge in basically analyzing transcriptome data. Not so much that I can program everything now by myself, simply because that is too huge a task, but I know how to interpret programs that people are writing. And I can explain to a bioinformatician how to analyze and how to interpret the data.

S: Coming from biology, it sounds pretty intimidating and quite a huge workload to commit to if you have to understand the concepts behind different programs and algorithms, even if you do not have to learn how to program directly! I might be tempted to just run my experiment and sequencing but leave the analysis to an experienced bioinformatician altogether.

MB: If you want somebody else to analyze your data, it’s just tough. I think we learned this the hard way. In principle, if people want to do something with us, we need to have an initial discussion about experimental design. What is their biological question? What do they want to address? To which level of details do they want to go? Because depending on what biological questions you would like to answer, the technology is completely different. And the way to analyze the data will also be completely different.

“Our most successful collaborations were when a PhD student or a postdoc who is willing to spend time learning comes to our lab for three months or so.”

S: So how do those discussions take place? How much of the biology behind the sample do you need to know before running an analysis?

MB: Well, when people approach us coming with cell types that we have no clue about, we simply say that it would be a very difficult project, because we have no idea about the biology. Therefore, I cannot tell you anything about how to set up the experiment, or how good the data will be because I do not know what to expect. On the other hand, I can use the algorithms that I normally run, which should give me valid results. But for the interpretation of the data, I have absolutely no idea what makes sense or not.

S: I see. When this is the case, how do you deal with such a gap in knowledge? Is there a good approach to solve this?

MB: I think this is a critical question for the whole field! I am in the luxurious position that I know what happens there. I see how people can struggle, and how they try to bridge this gap of knowledge. And I think there is probably no perfect solution. One of them is to say, well, I invest into this and I want to learn at least the basics myself, to be able to understand what the other side tells me. And that can be both sides, right? If you think about a bioinformatician having little knowledge about the biology, it’s the same thing if you want to understand what you’re actually doing with the data. You can do a lot of things with data, but in principle it is for the other side where you are coming with biological knowledge. You might think you have found the perfect way to analyze the data, but unfortunately find no significant gene expression in there, because that’s it, that’s your result. And the other side might say, “well, we do have a biological effect”. Now, how do we come together?

S: So which solutions have you come up with for those issues? How did your most fruitful collaborations take place?

MB: Ideally, you have to meet half-way between the biology and the bioinformatics. If you can, on the one hand, get somebody in your team who wants to commit and learn some bioinformatics to start bridging the gap in knowledge. On the other hand, we can provide somebody at this interface with some knowledge in biology to bridge the rest of the way. Our most successful collaborations were when a PhD student or a postdoc who is willing to spend time learning comes to our lab for three months or so. Then they go back, they analyze the data with what they have learned, they ask questions, and we have a productive back-and-forth to setup the best analysis and interpretation we can.

S: Having a team member in a lab motivated and committed to put in the time to learn the basics in bioinformatics for the whole group sounds like an ideal solution indeed. Would you have any advice for the cases where this is unfortunately not a possible solution?

MB: I would recommend to everybody that you find partners to collaborate with who are experienced in your domain, even if that means collaborating with people somewhere else on the globe. It is nice to have a lab nearby that might do single-cell technologies, but if they have a completely different domain of knowledge, then often it’s not helping a lot. I mean, you can learn the technology there, yes. But for your biological question, then I think it’s better to actually ask people who are more experienced in your field.

S: That sounds like a sensible plan indeed! Thank you very much Dr. Beyer.

Interview of Dr. Marc Beyer recorded on Jan 8th, 2021, by Wilko Duprez.

scRNA-seq: artifacts from sample preparation

Independently of the ScRNA-seq methodology you plan to use, there are genetic artifacts bound to arise from the sample preparation preceding the mRNA capture step. Being aware of the stimuli triggering spurious genetic expression and alteration of the transcriptomic profile is necessary for optimal experimental design.

Typically, the tissue dissociation and cell isolation processes can take up to several hours – in some cases days – during which the cells are removed from their normal environment. After cell capture, mRNAs are typically stabilized through their capture following cell lysis, so the entire vulnerable period runs from the initial sample collection to the cell lysis. During this time, cellular stress-related responses can lead to changes in the behavior and the morphology of the sample cells, even leading to cell death.

 

Effect of Sampling Time

The time it takes to harvest samples immediately impacts the resulting transcriptomic profile of the different cell subtypes. In their investigation, Massoni-badosa et al. collected red blood cells from 5 patients and waited respectively 0, 2, 8, 24 and 48h hours before processing the samples through various scRNA-seq technologies (Massoni-Badosa et al., Genome Biology, 2020). Their analysis found significant shifts in their PCA analysis across all cell subtypes, correlating with the increase in the sampling time. Digging further using differential expression analysis, they identified a time-dependent decrease in the number of detected genes in all their datasets and a global downregulation of gene expression. Overall, they identified between 1000 and 2000 differentially regulated genes (depending on the sample type) over the course of 48h.

The genetic signature of such a time-dependent bias could – and should – be identified and corrected during the bioinformatic analysis of the resulting ScRNA-seq data. Nevertheless, the overall quality of the sample seems to be decreasing as the sampling time increase. Therefore, it might be worth investigating new ways of processing freshly harvested ScRNA-seq samples quicker although it might be arduous in specific cases, for example when the collection occurs outside the normal hours of the sample processing facilities.

 

Effect of Digestion Time and Temperature

Standard sample preparation methods for solid tissues require enzymatic and/or mechanical dissociation and, depending on the tissue origin, density, disease state, elastin, or collagen content, this may require long enzymatic digestion and/or vigorous mechanical disruption. Transcriptional machinery remains active at 37 °C, and extended incubation at high temperatures may introduce gene expression artifacts, unrelated to the biological state at the time of harvest. Moreover, extended incubation at higher temperatures in the absence of nutrients or anchorage, or harsh dissociation, may induce apoptosis or anoikis, polluting the viable cell population or generating low-quality suspensions.

After discovering a set of 507 genes – some of them related to cell stress pathways – strongly affected by the digestion temperature, O’Flanagan et al. performed a time-course ScRNA-seq experiment on breast cancer xenograft tissues. They sampled cells regularly over two hours of total digestion time, using either collagenase at 37°C or a cold protease at 6°C (O’Flanagan et al., Genome Biology, 2019). They found out digestion with collagenase substantially upregulated the expression of this core set of stress-related genes, with a subset even further expressed as the digestion time increased. Applying differential expression analysis to their entire dataset, they figured that 43% of the total 18,734 retained genes were differentially regulated after 2h of digestion time compared to only 30 minutes.

This example highlights the importance of initially refining the digestion process to be as short, mild, and as efficient as possible for your sample type, as it can affect the expression of a significant number of the detected genes.

 

Unusual sample types

Specific cell subtypes can be absent from a final ScRNA-seq dataset because of peculiar characteristics. Some tissues are particularly hard to dissociate (e.g. cardiomyocytes), meaning their cell population might be underrepresented in the resulting sample sent for ScRNA-processing (Ackers-Johnson et al., Nature Communications, 2018). Unusual cell subtypes might have unusual shapes or sizes, preventing them from being successfully processed with some ScRNA-seq technologies or instruments. Another example is cells suffering from anoikis after being removed from their anchorage onto an extracellular matrix during tissue dissociation – a process believed to be started only 3 hours after removal – lowering the cell recovery rate and decreasing the sample quality. An alternative to preserve the native physiological distribution between all cell subtypes and the genetic material of fragile cells is to extract the nuclei and sequence nuclear mRNA.

 

Conclusion

A successful SCRNA-seq experiment starts straight away during sample harvest, as studies seem to suggest the quality and relevance of the data slowly start deteriorating immediately and is time-dependent from the length of the entire sample preparation phase, well before the single cell mRNA capture step.

Even if the sampling and digestion times are well-optimized, another factor that can extend the sample preparation time and act upon the sample quality is the waiting time to access a ScRNA-seq platform or instrument where the mRNA will finally be extracted. Until a new solution is developed to accelerate this process, you can check updated methods to freeze or fix single-cells to preserve your samples and the integrity of your future dataset.

Cell sample size vs sequencing depth: find your compromise.

With newer technologies enabling the screening of an ever-higher number of cells at a cheaper cost, long gone are the times of intensive labor on a small number of cells. ScRNA-seq can now potentially support a wide range of options – typically from 10^2 to 10^6 cells being processed in parallel – but in-depth sequencing of hundreds of thousands of separate cells would overload most sequencing platforms, while also considering the huge overhead costs and the resulting massive datasets to analyze.

 

A better variable to consider for your experiments is the number of reads per cell, which you can adapt depending on the biological purpose of your study. A smaller number of cells with a high sequencing depth should provide more robust transcriptomic data, filtering out the technical noise and providing a more reliable snapshot of the transcriptional state of each cell. On the opposite, a larger number of cells at the cost of a low sequencing depth is a better representation of a cell population, particularly in the presence of multiple subtypes or even potential rare cell types.

 

This Nature Communications paper suggests that “given a fixed [sequencing] budget, sequencing as many cells as possible at approximately one read per cell per gene is optimal, both theoretically and experimentally.”. From this cue, always keep in mind the aim of your research question. For example, if you are trying to identify novel cellular subtypes or quantify the number of rare cells in a biological sample, then obviously plan towards the higher cell number limit your ScRNA-seq protocol can handle (10,000 to 50,000 reads per cell can suffice for this purpose [1][2]). If you are trying to characterize a cellular subtype through accurate gene expression estimation, then plan for deeper sequencing depth.

 

Those are general guidelines, from which you need to take practical decisions depending on your sample and the limitations of the cell capture methods you have chosen or you have access to (see below). Here are some examples:

  • Regarding the nature and the extraction method of your sample, how many biological and technical replicates can you afford in a single ScRNA-seq run?
  • If you expect a heterogeneous tissue with numerous subtypes and rare cells, what should be the minimal amount of cells to process to have a high chance of identifying them? (Hint: you can play around with this tool for rough estimations).
  • How much RNA does your sample tissue typically yields? Organs such as the heart, the spleen, the liver or kidneys are usually bountiful (in humans) but on the opposite muscles, bone and adipose tissues provide up to ten times fewer RNA molecules.
  • Are you aiming for 3′ sequencing (low depth with ≈50,000 reads per cell) or for full-transcript sequencing (high depth with ≈ 1,000,000 read per cell)?