16Recent rise of microarray and next-generation sequencing in genome-related fields has 17 simplified obtaining gene expression data at whole gene level, and biological interpretation of 18 gene signatures related to life phenomena and diseases has become very important. However, 19 the conventional method is numerical comparison of gene signature, pathway, and gene 20 ontology (GO) overlap and distribution bias, and it is not possible to compare the specificity 21 and importance of genes contained in gene signatures as humans do.
22This study proposes the gene signature vector (GsVec), a unique method for interpreting 23 gene signatures that clarifies the semantic relationship between gene signatures by 24 incorporating a method of distributed document representation from natural language 25 processing (NLP). In proposed algorithm, a gene-topic vector is created by multiplying the 26 feature vector based on the gene's distributed representation by the probability of the gene 27 signature topic and the low frequency of occurrence of the corresponding gene in all gene 28 signatures. These vectors are concatenated for genes included in each gene signature to create 29 a signature vector. The degrees of similarity between signature vectors are obtained from the 30 cosine distances, and the levels of relevance between gene signatures are quantified. 31 Using the above algorithm, GsVec learned approximately 5,000 types of canonical 32 pathway and GO biological process gene signatures published in the Molecular Signatures 33 Database (MSigDB). Then, validation of the pathway database BioCarta with known 3 34biological significance and validation using actual gene expression data (differentially 35 expressed genes) were performed, and both were able to obtain biologically valid results. In 36 addition, the results compared with the pathway enrichment analysis in Fisher's exact test 37 used in the conventional method resulted in equivalent or more biologically valid signatures.
38Furthermore, although NLP is generally developed in Python, GsVec can execute the entire 39 process in only the R language, the main language of bioinformatics. 40 41 4 53 and completeness of human knowledge. Therefore, interpretation is commonly performed by 54 comparing the gene signature, such as differentially expressed genes and gene modules, 55 against a biological gene signature database (such as pathway and GO) and identifying an 56 objective association from a biological perspective [2]. 57 Numerous methodologies for association with pathways have been proposed. Common 58 examples include Fisher's exact test, which is a classical statistical test for the specific overlap 59 of genes; over-representation analysis and gene set enrichment analysis [3], which statistically 60 process the number of overlapping genes and ranking bias by incorporating randomization; 61 and modular enrichment analysis and EnrichNet with graph-based statistics of biological 62 networks [4, 5].63However, these comparisons are numerical, and it is thus not p...