2008
DOI: 10.1007/s10791-008-9058-8
|View full text |Cite
|
Sign up to set email alerts
|

Focused web crawling in the acquisition of comparable corpora

Abstract: CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domainspecific words. The same words were also translated using a high-quality, but non-genom… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(14 citation statements)
references
References 18 publications
0
14
0
Order By: Relevance
“…The distance between the documents in this approach can be measured by the degree of overlap between their keywords. Alternatively, instead of using more common words it is also possible to use Hapax Legomena (words occuring only once in each document) in order to identify potentially parallel documents [25,59]. ), as well as how the exact distance measure is defined (χ 2 in [46], cosine, Euclidean, etc.).…”
Section: Parallel Textsmentioning
confidence: 99%
See 1 more Smart Citation
“…The distance between the documents in this approach can be measured by the degree of overlap between their keywords. Alternatively, instead of using more common words it is also possible to use Hapax Legomena (words occuring only once in each document) in order to identify potentially parallel documents [25,59]. ), as well as how the exact distance measure is defined (χ 2 in [46], cosine, Euclidean, etc.).…”
Section: Parallel Textsmentioning
confidence: 99%
“…In order to define which of the different methods and techniques mentioned in the literature best suited the task and languages we usually work with, we carried out various experiments with monolingual term candidate extraction and also with bilingual equivalents calculation (described in [58] and [59]). Implementing the methods 62 A. Gurrutxaga et al…”
Section: Our Approachmentioning
confidence: 99%
“…However, in view of the fact that Chinese and Thai belong to two different languages, Internet information sharing and communication is a serious problem due to the language barriers. The application of cross-language information retrieval (CLIR) technology provides an effective way to solve this problem [1][2][3]. The task of CLIR attempts to bridge the mismatch between the source and target languages using the approaches such as query and the document translation.…”
Section: Introductionmentioning
confidence: 99%
“…Such collections usually have a quite limited domain covering a limited number of languages apart from the cost and the time consuming work of creating parallel corpora, and the lack of sufficient parallel data for various languages and domains is currently one of the major obstacles to further linguistic research. On the other hand, comparable corpora are generally obtained from news articles [2], [3], available research corpora such as CLEF collections [4] or by crawling the Web [5], [6]. Such corpora can be collected easily by downloading electronic copies of newspapers, journals, articles, etc., from the Web.…”
Section: Introductionmentioning
confidence: 99%