2003
DOI: 10.1162/089120103322711578
|View full text |Cite
|
Sign up to set email alerts
|

The Web as a Parallel Corpus

Abstract: Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
246
0
1

Year Published

2004
2004
2017
2017

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 366 publications
(250 citation statements)
references
References 21 publications
3
246
0
1
Order By: Relevance
“…In fact, several researchers have shown that using the Web as a corpus is an effective way of addressing the typical data sparseness problem one encounters when working with corpora (compare [14], [21], [23], [25]). Actually, we subscribe to the principal idea by Markert et al [23] of exploiting the Google TM API.…”
Section: Finding Patternsmentioning
confidence: 99%
“…In fact, several researchers have shown that using the Web as a corpus is an effective way of addressing the typical data sparseness problem one encounters when working with corpora (compare [14], [21], [23], [25]). Actually, we subscribe to the principal idea by Markert et al [23] of exploiting the Google TM API.…”
Section: Finding Patternsmentioning
confidence: 99%
“…For instance developing a technique for mining the web to collect parallel corpora for low-density language pairs (Resnik and Smith, 2003), and running new SMT system for languages Catalan-English with no parallel corpus (Gispert and Mariño, 2006). In a research conducted by Utiyama and Isahara (2007), the use of pivot language through phrase translation and sentence translation are investigated.…”
Section: Previous Workmentioning
confidence: 99%
“…Resnik and Smith proposed a method to extract similar documents from Web based on the HTML [7]. Talvensaari et al proposed a method to find similar documents based on translated words in the source language to the target language by using the main keywords [4].…”
Section: Related Workmentioning
confidence: 99%