Abstract. This paper presents the results of a comparison of three metrics for measuring cross-linguistic variations in information volume between parallel segments of a bilingual corpus. The performance of each metric is compared with the results of a human annotation of multiword expressions (MWEs) in each segment. The first metric measures characters in source and target segments and compares the variation, if any, with the expected character count ratio based on averages for the entire source and target texts. The second metric follows the same method except that it measures graphical word count (function and content words combined) in target and source segments. The third metric involves an analysis obtained via the content word precision (CWP) algorithm coded in Python. The purpose of the comparison is to determine which metric is closer to the human annotation and is, therefore, a better indicator of a large spectrum of MWEs.Keywords: cross-linguistic phraseology, detection of multiword expressions, information volume variation, content word precision algorithm.
IntroductionAs a contribution to computational studies of multiword expressions (MWEs), this paper presents the results of a comparison, albeit small-scale, of three metrics for detecting MWEs in parallel segments of a bilingual corpus. The untested underlying hypothesis of the comparison is that information volume variation in parallel segments (as established by content word imbalance) correlates with the presence in source or target of a large spectrum of MWEs. Hence, MWEs should occur in parallel segments where there is a cross-linguistic information volume difference, as determined by differences in the number of content words in source and target. The study of MWEs in parallel segments is of the utmost importance to crosslinguistic phraseological studies as described in Colson (2008), and hence to transla-1 I wish to thank the reviewers for their comments and suggestions. I also wish to thank my colleague Paul John from UQTR for his input in reading the final version of the paper. Of course, any remaining omissions or errors are mine.1 tion studies and phraseological studies. Determining which metric is most accurate in detecting MWEs will also contribute to recent works in corpus-based phraseology such as Granger and Paquot (2008) and will allow the mining of a large spectrum of MWEs (such as clusters, lexical bundles, n-grams, recurrent sequences) or even new classes of MWEs. Just like any corpus-based approach, our new bilingual approach is designed to complement other traditional monolingual approaches such as appear in volumes 1 and 2 of the Oxford Dictionary of Current Idiomatic English (Cowie & Mackin, 1975;Cowie, Mackin & McCaig, 1983). As a preliminary phase of a larger project, this paper presents the results of a smallscale analysis of the first 25 segments of the bilingual text selected. The text chosen for the evaluation of metric performance is the parallel English-French Inaugural Address of J.F. Kennedy (January 20, 1961). Th...