2006
DOI: 10.1007/11816508_29
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of Alignment Methods for HTML Parallel Text

Abstract: Abstract. The Internet constitutes a potential huge store of parallel text that may be collected to be exploited by many applications such as multilingual information retrieval, machine translation, etc. These applications usually require at least sentence-aligned bilingual text. This paper presents new aligners designed for improving the performance of classical sentence-level aligners while aligning structured text such as HTML. The new aligners are compared with other well-known geometric aligners.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2007
2007
2016
2016

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 6 publications
(5 reference statements)
0
7
0
Order By: Relevance
“…In terms of alignment quality, there is a complete study (Sanchez-Villamil et al, 2006) with results about this issue. The metrics used to evaluate Bitextor have been precision and recall.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In terms of alignment quality, there is a complete study (Sanchez-Villamil et al, 2006) with results about this issue. The metrics used to evaluate Bitextor have been precision and recall.…”
Section: Resultsmentioning
confidence: 99%
“…To assist in this task, another free/open-source application has been used: the TagAligner tool (Sanchez-Villamil et al, 2006), which both uses the tag structure in XML files and the length of the sentences in a pair of documents to align them (Brown et al, 1991;Gale and Church, 1994).…”
Section: Introductionmentioning
confidence: 99%
“…The WWW can be regarded as a huge corpus containing millions of texts of variable quality (Kilgarriff and Grefenstette, 2003) and, thus, a collection of bitexts (a bitext is composed of versions in two different languages of a given text) can be built by finding pairs of documents in the web which are mutual translations. In order to identify automatically pairs of Uniform Resource Locators (URL) whose contents are bitext candidates, some approaches (Resnik and Smith, 2003) take into account the similitude of the URLs, the textual content of the pages and, to some extent, the structure of the text provided by the HTML tags (Sánchez-Villamil et al, 2006). Here, we will explore a complementary approach: after a collection of texts in the source language is compiled, documents which are a possible translation of the source texts are sought for.…”
Section: Introductionmentioning
confidence: 99%
“…The relatively recent exploration of the web as a bilingual or multi-lingual corpus was made possible by the rapid growth in the number of web pages, and the availability of vast quantities of web-based translation texts involving many language pairs. Till now, the focus of most of the investigations in this field has been on the discovery and pairing of bilingual sites, domains, HTML documents and pages, although new research is emerging in processing and preparing HTML pages for the actual extraction of translation pairs when bilingual web pages are downloaded (Sanchez-Villamil et al, 2006). At the same time, extracting translations, whether from unannotated data resources or from meta-information-rich content, inevitably involves methods of aligning bilingual texts.…”
Section: Introductionmentioning
confidence: 99%