Evaluation of Stopwords Removal on the Statistical Approach for Automatic Term Extraction

Braga, Igor

doi:10.1109/stil.2009.8

Cited by 6 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As Stopwords são palavras, geralmente funcionais, que não devem ser consideras para a formac ¸ão do texto [4]. O processo de remoc ¸ão consiste em identifica-las nos textos, algo que pode ser feito através de abordagem estatística, e removê-las.…”

Section: B Pré-processamento De Dadosunclassified

“…O processo de remoc ¸ão consiste em identifica-las nos textos, algo que pode ser feito através de abordagem estatística, e removê-las. De acordo com [4], essa técnica beneficia a construc ¸ão de modelos, porque reduz o número de entradas nas redes neurais artificiais. Algo que facilita o processo de aprendizagem de máquina.…”

Section: B Pré-processamento De Dadosunclassified

“…A técnica de remoc ¸ão de stopwords [4] consiste em eliminar palavras que não contribuem para a formac ¸ão de um termo, como "de", "e", "a", entre outras. Sua remoc ¸ão diminui o tamanho dos textos e garante que apenas palavras com maior relevância para a formac ¸ão do significado sejam avaliadas pelo algoritmo.…”

Section: Introduc ¸ãOunclassified

See 2 more Smart Citations

Identificação de desvios de linguagem através de redes neurais artificiais

Pires¹,

Coelho²

2021

Anais Do 15. Congresso Brasileiro De Inteligência Computacional

View full text Add to dashboard Cite

Este artigo propõe um método de identificação de desvios de linguagens em frases escritas na língua portuguesa através de Redes Neurais Artificiais. São abordadas diferentes configurações de neurônios e funções de ativação para encontrar a melhor acurácia. O modelo desenvolvido consegue obter uma taxa de acerto melhor do que outros propostos na literatura, tendo como vantagem uma estrutura que não é atrelada a um dicionário de dados fixo. Assim, por ter extraído uma generalização das características dos desvios de linguagem, é capaz de classificar mais exemplares e com melhor performance.

show abstract

Section: B Pré-processamento De Dadosunclassified

See 1 more Smart Citation

Identificação de desvios de linguagem através de redes neurais artificiais

Pires¹,

Coelho²

2021

Anais Do 15. Congresso Brasileiro De Inteligência Computacional

View full text Add to dashboard Cite

show abstract

“…These words are removed to enhance computation, they don't actually relate to the information needs of the documents. Stop word removal improves performance when extracting bigram terms [3]. Stop words were removed by identifying a list of standard stop words, a table was created out of a static stop list, each token was matched against the table, hashing operation was done and the text were built into the lexical analyzer.…”

Section: Text Pre-processingmentioning

confidence: 99%

“…Vector space model otherwise known as term vector model is an algebraic model for representing text documents as vectors of identifiers, such as index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings [3]. In this study, the vector space model was used to implement the text representation of essay-type marking scheme and essay-type student script.…”

Section: Vector Space Modelmentioning

confidence: 99%

Evaluation of N-gram Text Representations for Automated Essay-Type Grading Systems

EstherOduntan¹,

Adeyanju²,

Olabiyisi³

et al. 2015

IJAIS

View full text Add to dashboard Cite

Automat ed grading systems can reduce stress and time constraints faced by examiners especially where large numbers of students are enrolled. Essay-type grading involves a comparison of the textual content of a student's script with the marking guide of the examiner. In this paper, we focus on analyzing the n-gram text representation used in automated essay-type grading system. Each question answered in a student script or in the marking guide is viewed as a document in the document term matrix. Three n-gram representation schemes were used to denote a term vis-à-vis unigram 1-gram, bigram 2-gram and both )"(unigram )+( bi-gram) "( . A binary weighting scheme was used for each document vector with cosine similarity to compare documents across the student scripts and marking guide. The final student score is computed as a weighted aggregate of documents' similarity scores as determined by marks allocated to each question in the marking guide. Our experiment compared effectiveness of the three representation schemes using electronically transcribed handwritten students' scripts and marking guide from a first year computer science course of a Nigerian Polytechnic. The machine generated scores were then compared with those provided by the Examiner for the same scripts using mean absolute error and Pearson correlation coefficient. Experimental results indicate )"(unigram )+( bigram) " representation outperformed the other two representations with a mean absolute error of 7.6 as opposed to 15.8 and 10.6 for unigram and bigram representations respectively. These results are reinforced by the correlation coefficient with "unigram + bigram" representation having 0.3 while unigram and bigram representations had 0.2 and 0.1 respectively. The weak but positive correlation indicates that the Examiner might have considered other issues not necessarily documented in the marking guide. We intend to test other datasets and apply techniques for reducing sparseness in our document term matrices to improve performance.

show abstract