3Although single cell RNA sequencing (scRNA-seq) technology is newly invented and promising 4 one, because of lack of enough information that labels individual cells, it is hard to interpret 5 the obtained gene expression of each cell. Because of this insufficient information available, 6 unsupervised clustering, e.g., tSNE and UMAP, is usually employed to obtain low dimensional 7 embedding that can help to understand cell-cell relationship. One possible drawback of this 8 strategy is that the outcome is highly dependent upon genes selected for the usage of clustering. 9 In order to fulfill this requirements, there are many methods that performed unsupervised gene 10 selection. In this study, a tensor decomposition (TD) based unsupervised feature extraction 11 (FE) was applied to the integration of two scRNA-seq expression profiles that measure human 12 and mouse midbrain development. TD based unsupervised FE could not only select coincident 13 genes between human and mouse, but also biologically reliable genes. Coincidence between two 14 species as well as biological reliability of selected genes is increased compared with principal 15 component analysis (PCA) based FE applied to the same data set in the previous study. Since 16 PCA based unsupervised FE outperformed other three popular unsupervised gene selection 17 methods, highly variable genes, bimodal genes and dpFeature, TD based unsupervised FE can 18 do so as well. In addition to this, ten transcription factors (TFs) that might regulate selected genes 19 and might contribute to midbrain development are identified. These ten TFs, BHLHE40, EGR1, 20 GABPA, IRF3, PPARG, REST, RFX5, STAT3, TCF7L2, and ZBTB33, were previously reported to 21 be related to brain functions and diseases. TD based unsupervised FE is promising method to 22 integrate two scRNA-seq profiles effectively. 23 Keywords: tensor decomposition, enrichment analysis, sngle cell RNA-sequencing, midbrain development, inter-species analysis 24
INTRODUCTIONSingle cell RNA sequencing (scRNA-seq) (17) is a newly invented technology that enables us to measure 25 amount of RNA in single cell basis. In spite of its promising potential, it is not easy to interpret the 26 measurements. The primary reason of this difficulty is the lack of sufficient information that characterizes 27 1 Y-h. Taguchi TD based FE to single-cell individual cells. In contrast to the huge number of cells measured, which is often as many as several 28 thousands, the number of labeling is limited, e.g., measurement of conditions as well as the amount of 29 expression of key genes measured by fluorescence-activated cell sorting, whose number is typically as little 30 as tens. This prevents us from selecting genes that characterize the individual cell properties.
31In order to deal with samples without suitable number of labelling, unsupervised method is frequently 32 used, since it does not make use of labeling information directly. K-means clustering as well as hierarchical 33 clustering are the popular methodology tha...