Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Steinberger, Ralf; Pouliquen, Bruno; Hagman, Johan

doi:10.1007/3-540-45715-1_44

Cited by 73 publications

(54 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These systems map words or concepts, such as named entities, into a common representation space by means of a multilingual thesaurus (e.g. Eurovoc (Steinberger et al, 2002) or EuroWordnet (Vossen, 1998)). However, multilingual thesauri are not always available;…”

Section: Related Workmentioning

confidence: 99%

Methods for cross-language plagiarism detection

Barrón-Cedeño

Gupta

Rosso

2013

Knowledge-Based Systems

View full text Add to dashboard Cite

Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available.In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: CrossLanguage Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T+MA); three inherently different models in nature and required resources. * Corresponding author Email addresses: albarron@lsi.upc.edu (Alberto Barrón-Cedeño), pgupta@dsic.upv.es (Parth Gupta), prosso@dsic.upv.es (Paolo Rosso)The authors appear in alphabetical order. A. Barrón and P. Gupta contributed equally to this work and should both be considered as first authors. Preprint submitted to Knowledge Based SystemsJune 14, 2013The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks-something never done before. The experiments show that T+MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

show abstract

Section: Related Workmentioning

confidence: 99%

Methods for cross-language plagiarism detection

Barrón-Cedeño

Gupta

Rosso

2013

Knowledge-Based Systems

View full text Add to dashboard Cite

show abstract

“…Leek et al (1999) use machine translation tools to help cross-language topic tracking. Steinberger et al (2002) represent European document contents using descriptor terms of a multilingual thesaurus EUROVOC and measure the semantic similarity based on the distance between the two documents' representations. Vu et al (2009) use Discrete Fourier Transform (DFT) score of a word's frequency chain to measure time distribution similarity as a feature to evaluate the bilingual news similarity.…”

Section: Related Workmentioning

confidence: 99%

“…Yang et al (1997) propose to use cross-lingual information retrieval techniques to identify the correspondence between multilingual texts. Steinberger et al (2002) calculate the semantic similarity of news in different languages using a multilingual thesaurus. Vu et al (2009) present a feature-based method to include useful information sources.…”

Section: Introductionmentioning

confidence: 99%

Minimum Error Rate Training for Bilingual News Alignment

Wang

Liu

Sun

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

News articles in different languages on the same event are invaluable for analyzing standpoints and viewpoints in different countries. The major challenge to identify such closely related bilingual news articles is how to take full advantage of various information sources such as length, translation equivalence and publishing date. Accordingly, we propose a discriminative model for bilingual news alignment, which is capable of incorporating arbitrary information sources as features. Chinese word segmentation, Part-of-speech tagging and Named Entity Recognition technologies are used to calculate the semantic similarities between words or text as feature values. The feature weights are optimized using the minimum error rate training algorithm to directly correlate training objective to evaluation metric. Experiments on Chinese-English data show that our method significantly outperforms two strong baseline systems by 12.7% and 2.5%, respectively.

show abstract

“…Eurovoc has previously been used for the identification of translated documents [9,10], in which, the Eurovoc concepts were enriched by a set of associative phrases extracted from a large manually (keywords) annotated corpus. The Eurovoc concepts are then assigned to the documents based on the similarity between the contents of the document and the enriched associative set.…”

Section: Related Workmentioning

confidence: 99%

Cross-Language High Similarity Search Using a Conceptual Thesaurus

Gupta

Barrón-Cedeño

Rosso

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

show abstract

Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Cited by 73 publications

References 2 publications

Methods for cross-language plagiarism detection

Methods for cross-language plagiarism detection

Minimum Error Rate Training for Bilingual News Alignment

Cross-Language High Similarity Search Using a Conceptual Thesaurus

Contact Info

Product

Resources

About