A vector space model for automatic indexing

Salton, Gerard; Wong, Amy; Yang, Cheng‐San

doi:10.1145/361219.361220

Cited by 5,699 publications

(2,878 citation statements)

References 3 publications

Supporting

Mentioning

2,760

Contrasting

Unclassified

116

Order By: Relevance

“…Documents and sentences were represented as vectors in a vector space model [26], each dimension corresponding to a term and measured along a number of metrics: term occurrence (TO, the number of occurrences of the term in the document); binary term occurrence (BTO, set to 1 only if TO>0, set to 0 otherwise); term frequency (TF, given by the TO divided by the total number of terms in the document) and term frequency-inverse document frequency (TF-IDF, given by the TF divided by the frequency of the term in the whole corpus).…”

Section: Methodsmentioning

confidence: 99%

Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

Napolitano

Marshall

Hamilton

et al. 2016

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

Napolitano, Giulio et al., Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction, Artificial Intelligence in Medicine, 2016 Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction Giulio Napolitano Materials and methods:The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: 'semi-structured' and 'unstructured'. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry. Results:The best result of 99.4% accuracy -which included only one semi-structured report predicted as unstructured -was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured. Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.

show abstract

Section: Methodsmentioning

confidence: 99%

Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

Napolitano

Marshall

Hamilton

et al. 2016

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

show abstract

“…After the feature extraction, we have represented each sample in dataset using the Vector Space Model [36] which is commonly used model in information retrieval.…”

Section: W2vmentioning

confidence: 99%

Turkish Music Genre Classification using Audio and Lyrics Features

Çoban¹

2017

SDÜ Fen Bil Enst Der

View full text Add to dashboard Cite

Music Information Retrieval (MIR) has become a popular research area in recent years. In this context, researchers have developed music information systems to find solutions for such major problems as automatic playlist creation, hit song detection, and music genre or mood classification. Meta-data information, lyrics, or melodic content of music are used as feature resource in previous works. However, lyrics do not often used in MIR systems and the number of works in this field is not enough especially for Turkish. In this paper, firstly, we have extended our previously created Turkish MIR (TMIR) dataset, which comprises of Turkish lyrics, by including the audio file of each song. Secondly, we have investigated the effect of using audio and textual features together or separately on automatic Music Genre Classification (MGC). We have extracted textual features from lyrics using different feature extraction models such as word2vec and traditional Bag of Words. We have conducted our experiments on Support Vector Machine (SVM) algorithm and analysed the impact of feature selection and different feature groups on MGC. We have considered lyrics based MGC as a text classification task and also investigated the effect of term weighting method. Experimental results show that textual features can also be effective as well as audio features for Turkish MGC, especially when a supervised term weighting method is employed. We have achieved the highest success rate as 99,12% by using both audio and textual features together.

show abstract

“…Considering all of these features, it is quite challenging to find a numerical counterpart for a word, which preserves all of these properties and represents the same word in a numerical feature space. To this end there are well-known models such as Salton et al (1975) which try to transfer words and their syntactic and semantic information. Recently, NNs have become the established state-of-theart for creating distributed representations of words (and also other textual units such as characters etc.).…”

Section: Enriching Word Embeddings With Subword Informationmentioning

confidence: 99%

Providing Morphological Information for SMT Using Neural Networks

Passban¹,

Li²,

Way³

2017

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

Treating morphologically complex words (MCWs) as atomic units in translation would not yield a desirable result. Such words are complicated constituents with meaningful subunits. A complex word in a morphologically rich language (MRL) could be associated with a number of words or even a full sentence in a simpler language, which means the surface form of complex words should be accompanied with auxiliary morphological information in order to provide a precise translation and a better alignment. In this paper we follow this idea and propose two different methods to convey such information for statistical machine translation (SMT) models. In the first model we enrich factored SMT engines by introducing a new morphological factor which relies on subword-aware word embeddings. In the second model we focus on the language-modeling component. We explore a subword-level neural language model (NLM) to capture sequence-, word-and subword-level dependencies. Our NLM is able to approximate better scores for conditional word probabilities, so the decoder generates more fluent translations. We studied two languages Farsi and German in our experiments and observed significant improvements for both of them.

show abstract

A vector space model for automatic indexing

Cited by 5,699 publications

References 3 publications

Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

Turkish Music Genre Classification using Audio and Lyrics Features

Providing Morphological Information for SMT Using Neural Networks

Contact Info

Product

Resources

About