Table of Links
Abstract and 1. Introduction
2. Background
2.1 Amortized Stochastic Variational Bayesian GPLVM
2.2 Encoding Domain Knowledge through Kernels
3. Our Model and Pre-Processing and Likelihood
3.2 Encoder
4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance
4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI
4.3 Consistency of Latent Space with Biological Factors
4. Conclusion, Acknowledgement, and References
A. Baseline Models
B. Experiment Details
C. Latent Space Metrics
D. Detailed Metrics
5 CONCLUSION
This paper identifies a misalignment in the generative model of current GPLVMs used in single-cell data and proposes an amortized BGPLVM better adapted to the scRNA-seq dimensionality reduction setting. In particular, by drawing insight from commonly used single-cell-specific methods, including scVI, LDVAE, and Splatter single-cell simulations, our proposed model tackles three main aspects of single-cell data by (1) accounting for count data with an approximate Poisson likelihood, (2) incorporating batch effect modelling in both the encoder and GP kernel, and (3) normalizing the library size in the data via a pre-processing step. We demonstrate the importance of aligning modelling choices to domain-specific knowledge as the model achieves comparable performance to scVI on both a simulated dataset and real-world COVID dataset in both UMAP visualizations and commonly used latent space metrics.
ACKNOWLEDGMENTS
The authors would like to thank Emma Dann, Natsuhiko Kumasaka and the rest of the team at Sanger for help and guidance with our initial project and for providing the data and code, which we based this study on. AR is supported by the accelerate programme for scientific discovery. During the time of this work, SZ was supported by the Churchill Scholarship.
REFERENCES
Sumon Ahmed, Magnus Rattray, and Alexis Boukouvalas. Grandprix: scaling up the bayesian gplvm for single-cell data. Bioinformatics, 35(1):47–54, 2019.
Florian Buettner, Kedar N Natarajan, F Paolo Casale, Valentina Proserpio, Antonio Scialdone, Fabian J Theis, Sarah A Teichmann, John C Marioni, and Oliver Stegle. Computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells. Nature biotechnology, 33(2):155–160, 2015.
Kieran Campbell and Christopher Yau. Bayesian gaussian process latent variable models for pseudotime inference in single-cell rna-seq data. bioRxiv, pp. 026872, 2015.
Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J Steemers, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, 2019.
Graham Heimberg, Rajat Bhatnagar, Hana El-Samad, and Matt Thomson. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell systems, 2(4):239–250, 2016.
James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
Brian Hie, Joshua Peters, Sarah K Nyquist, Alex K Shalek, Bonnie Berger, and Bryan D Bryson. Computational methods for single-cell rna sequencing. Annual Review of Biomedical Data Science, 3:339–364, 2020.
Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 2013.
Dylan Kotliar, Adrian Veres, M Aurel Nagy, Shervin Tabrizi, Eran Hodis, Douglas A Melton, and Pardis C Sabeti. Identifying gene expression programs of cell-type identity and cellular activity with single-cell rna-seq. Elife, 8:e43803, 2019.
Natsuhiko Kumasaka, Raghd Rostom, Ni Huang, Krzysztof Polanski, Kerstin B Meyer, Sharad Patel, Rachel Boyd, Celine Gomez, Sam N Barnett, Nikolaos I Panousis, et al. Mapping interindividual dynamics of innate immune response at single-cell resolution. bioRxiv, pp. 2021–09, 2021.
Vidhi Lalchand, Aditya Ravuri, Emma Dann, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik GH Lindeboom, Shaista Madad, Sarah A Teichmann, and Neil D Lawrence. Modelling technical and biological effects in scrna-seq data with scalable gplvms. arXiv preprint arXiv:2209.06716, 2022a.
Vidhi Lalchand, Aditya Ravuri, and Neil D Lawrence. Generalised gplvm with stochastic variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 7841–7864. PMLR, 2022b.
Neil D Lawrence. Gaussian process models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 2004.
Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058, 2018.
Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019.
Malte D Luecken, Maren Buttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Interlandi, ¨ Michaela F Muller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colom ¨ e-Tatch ´ e, et al. ´ Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022.
Aaron T.L. Lun, Karsten Bach, and John C. Marioni. Pooling across cells to normalize singlecell rna sequencing data with many zero counts. Genome Biology, 17(75), 2016. doi: https: //doi.org/10.1186/s13059-016-0947-7.
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
Daniel T Montoro, Adam L Haber, Moshe Biton, Vladimir Vinarsky, Brian Lin, Susan E Birket, Feng Yuan, Sijia Chen, Hui Min Leung, Jorge Villoria, et al. A revised airway epithelial hierarchy includes cftr-expressing ionocytes. Nature, 560(7718):319–324, 2018.
Lindsey W Plasschaert, Rapolas Zilionis, Rayman Choo-Wing, Virginia Savova, Judith Knehr, ˇ Guglielmo Roma, Allon M Klein, and Aron B Jaffe. A single-cell atlas of the airway epithelium reveals the cftr-rich pulmonary ionocyte. Nature, 560(7718):377–381, 2018.
Emily Stephenson, Gary Reynolds, Rachel A Botting, Fernando J Calero-Nieto, Michael D Morgan, Zewen Kelvin Tuong, Karsten Bach, Waradon Sungnak, Kaylee B Worlock, Masahiro Yoshida, et al. Single-cell multi-omics analysis of the immune response in covid-19. Nature medicine, 27 (5):904–916, 2021.
Valentine Svensson, Roser Vento-Tormo, and Sarah A Teichmann. Exponential scaling of single-cell rna-seq in the past decade. Nature protocols, 13(4):599–604, 2018.
Valentine Svensson, Adam Gayoso, Nir Yosef, and Lior Pachter. Interpretable factor models of single-cell rna-seq via variational autoencoders. Bioinformatics, 36(11):3418–3421, 2020.
Amos Tanay and Aviv Regev. Scaling single-cell genomics from phenomenology to mechanism. Nature, 541(7637):331–338, 2017.
Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific reports, 9(1):5233, 2019.
Archit Verma and Barbara E Engelhardt. A robust nonlinear low-dimensional manifold for single cell rna-seq data. BMC bioinformatics, 21(1):1–15, 2020.
F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis. Genome biology, 19:1–5, 2018.
Luke Zappia, Belinda Phipson, and Alicia Oshlack. Splatter: simulation of single-cell rna sequencing data. Genome biology, 18(1):174, 2017.
Authors:
(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);
(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);
(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);
(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).