2010
DOI: 10.2478/v10108-010-0003-9
|View full text |Cite
|
Sign up to set email alerts
|

Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor

Abstract: Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpusbased machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2014
2014
2018
2018

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 5 publications
0
11
0
Order By: Relevance
“…Movie subtitles are available in the OPUS corpus [4] and since the previous CzEng release, the collection has been significantly extended with Open-Subtitles. 5 Very recently, OPUS released yet another update 6 but it did not make it in time to be included in CzEng 1.6. Subtitles of educational videos can be obtained from other sources and represent a rather different genre than movie subtitles.…”
Section: Czeng 16 Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Movie subtitles are available in the OPUS corpus [4] and since the previous CzEng release, the collection has been significantly extended with Open-Subtitles. 5 Very recently, OPUS released yet another update 6 but it did not make it in time to be included in CzEng 1.6. Subtitles of educational videos can be obtained from other sources and represent a rather different genre than movie subtitles.…”
Section: Czeng 16 Datamentioning
confidence: 99%
“…8 Medical domain is of special interest of several European research and innovation projects. We try to extend CzEng in this direction by specifically crawling some parallel health-related web sites using Bitextor [5] and also by re-crawling EMEA (European Medicines Agency) 9 corpus because its OPUS version 10 suffers from tokenization issues (e.g., decimal numbers split) and it is probably smaller than what can be currently obtained from the database.…”
Section: Czeng 16 Datamentioning
confidence: 99%
“…As described earlier, we submitted two extra datasets resulting from trivial combinations of our aligner and baseline outputs. Due to lack of time, we didn't try more sophisticated forms of combining our content-based features with other kinds of feature, such as URL matching and HTML document structure as proposed in the Bitextor paper (Esplà-Gomis et al, 2010).…”
Section: Better Integration With Metadata-based Featuresmentioning
confidence: 99%
“…Similarly, Zhang et al (2006) adopted a naive aligner in order to estimate the content similarity of candidate parallel web pages. Esplà-Gomis and Forcada (2010) developed Bitextor, a system combining language identification with shallow features (file size, text length, tag structure, and list of numbers in a web page) to mine parallel pages from multilingual sites that have been already been stored locally with the HTTrack6 website copier. Barbosa et al (2012) crawl the web and examine the HTML DOM tree of visited web pages with the purpose of detecting multilingual websites based on the collation of links that are very likely to point to in-site pages in different languages.…”
Section: State-of-the-art and Related Workmentioning
confidence: 99%