Most publications and presentations will feature the practical details of single-cell experiments and jump to data interpretation with impressive data visualization tools. However, whether you perform a multi-modal study (such as multi-omics) or multiple instances of a unimodal experiment (e.g. scRNA-seq) you will need to go through data integration. And it’s far from easy. We had a chat with Dr. Alessandra Vigilante, Senior Lecturer at Kings College London about her approach and the state of data integration techniques today.
Single-cell data integration is a crucial step that gathers several concepts. One aspect is to integrate different types of data into a single matrix to uncover dependencies between dimensions and relationships between data points. This can be difficult regarding the nature and format of the data, as well as the respective technical artifacts and background noise signals generated by each technique. The second stage is to actually make sense of the resulting huge amount of data to unearth biological meaning and relate to the initial biological question of the study.
“Let’s say I see in my dataset a different phenotype,” suggests Dr. Vigilante. “I’d see some outliers, for cell adhesion, for cell shape, anything. But why? Is that due to the fibronectin concentration? To gene expressions? To SNPs? That is what I need to investigate.”
Data integration: harmonizing datasets
The challenge to combine data from different studies was recently emphasized with the creation of the first single-cell atlases. Consortiums overseeing the generation of those databases must contend with data resulting from different workflows (microfluidic-based, smart-seq, etc) and different sequencing techniques. For example, a single-cell study might include spatial data while a second might not, but the resulting atlas would have to link spatial data from the former to the latter.
As of today, there are a plethora of single-cell technologies with newer methodologies being released almost on a weekly basis. An ideal situation for data integration would be for researchers to all share the same protocol and generate the exact same format of data. However, reaching a consensus workflow is a lengthy process.
“We can draw a parallel with bulk RNA-seq,” reminisces Dr. Vigilante. “At the beginning of bulk RNA-seq, we had the same issue. People were all using slightly different protocols, they were performing the bioinformatic analysis at the end differently, with different fitting models and so on. But in the end, we are now at a stage where it’s all routine. We know what to do. If you look at the papers, experiments are done more or less the same way, they are reproducible and very easy to integrate. We are getting there with bulk RNA-seq, and eventually we will get there single-cell RNA-seq too.”
On the bioinformatic side, many single-cell analysis pipelines exist and are constantly updated, adding to the complexity of harmonizing data. There is yet time for a single gold standard to emerge: each software suite has its own way of processing data, and sometimes its very own definition of widely used parameters (doublet rates, purity, etc). Within the same software even, constant versioning might make it difficult to integrate datasets from similar experiments performed a few months apart.
“With the early software, it used to be that if you changed a single parameter, you’d get completely different clustering results,” adds Dr. Vigilante. “But new tools were developed. We are getting close to gold standards for some analyses, but for others such as pseudo-time or trajectory analysis, we are not there yet. In terms of dataset integration, Seurat has made great progress. In our latest Nature paper, we did not use Seurat at the beginning. A reviewer then requested additional datapoints for our single-cell experiments, and data integration was a nightmare. Cells that were supposed to cluster together weren’t and formed separate subclusters. It wasn’t satisfying at all, until we turned to Seurat and it turned out great. The choice of the software you use really changes everything.”
Early versus late single-cell data integration
We can distinguish several approaches to current data integration. In a nutshell, “early” data integration initially concatenates all the relevant datasets into a single matrix, then use methods for example based on machine learning to discover dependencies and generate global clustering. It holds a certain advantage in terms of practicality, as you can run matrix-based analytical methodologies on your entire data straight away. The cons are that the data layers are not normalized, might still includes a high number of dimensions, and the weight of the different sources of data might be skewed by their number of dimensions.
In contrast, “late” data integration aims to analyze each dataset separately, normalizing them and removing their individual background noise. Only then is a global consensus created and dependencies investigated.
Dr. Vigilante prefers the latter approach: “I think the safest right now is to analyze the different datasets individually using our most advanced techniques, and then integrate them. I think that each data set separately can give us very good insights. Each single dataset has something to tell us, and it’s a shame not to get the most out of it. And then integrate all of them to give the whole picture and link things together.”
The current trends in single-cell science are to combine multiple -omics technologies (genomics, epigenetics, transcriptomics, proteomics, etc.) to uncover dependencies and elucidate the causal links behind physiological mechanisms. It is then logical that newer technological platforms emerge to combine multiple experiments in a single pipeline. For data integration, it means that new tools are now offering simultaneous multi-modal analysis rather the sum of separate datasets.
“My gut feeling is that it still best right now to have different datasets and then integrate them,’” comments Dr. Vigilante. “But those new platforms are the way to go. I’m looking forward to actually try them out!”
When planning your data integration, de-noising and dimensionality reduction are two necessary steps to make sense of your data. Removing the noise is obviously necessary to single out actual biological signals, dimensionality reduction is required to bring complexity down to a level that can be apprehended. But the process remains a technical challenge that is better to request help from experimented collaborators.
“You have so many technical artifacts that you have to take into account in the noise that the biggest fear when I do the data integration is finding something maybe exciting that is not true,” confesses Dr. Vigilante. “It’s a field that is really in development. In the next few years, we’ll have many, many papers, and new algorithms to remove high levels of noise, we’ll get there eventually!”
If you’d like to hear from the bioinformatician’s side, take a look at our earlier interview of Dr. Marc Beyer at the German Center for Neurodegenerative Diseases!