Recently, corpus comparison has been used by a number of researchers for extracting single-word terms (SWTs) from specialized corpora. It is viewed as a means to supplement multi-word term (MWT) extraction, the focus of which is on noun phrases. However, little is known about the value of this technique in a terminological setting. This paper examines two different methods for finding French SWTs in the field of computing. The first one (M1) compares the specialized corpus to a corpus considered to be a reflection of language as a whole. The second one (M2) breaks down the specialized corpus into six topical subcorpora that are compared in turn to the entire specialized corpus. The calculation relies on standard normal distribution and is carried out by a program called TermoStat. The specific units produced by both methods are then evaluated by comparing them to the contents of two specialized dictionaries. We also compare the results yielded by the two methods. Results show that precision is fair (approximately 50% of units extracted by both methods can be found in specialized dictionaries). However, recall is lower in both methods. Results also reveal that, even though M1 yields better results that M2, both methods are useful for identifying SWTs and should be considered in terminological work.
This paper examines evaluation criteria for term-extraction software. These tools have gained popularity over the past few years, but they come in all sorts of structures and their performance cannot be compared (qualitatively) to that of humans performing the same task. The lists obtained after an automated extraction must always be filtered by users. The evaluation form proposed here consists of a certain number of preprocessing criteria (such as the language analyzed by the software, identification strategies used, etc.) and a postprocessing criterion (performance of software) that users must take into account before they start using such systems. Each criterion is defined and illustrated with examples.Commercial tools have also been tested.
Lexical knowledge patterns are effective tools for identifying knowledge in corpora. As refining pattern sets is a labour-intensive and time-consuming process, re-use of patterns in subsequent research is common. The portability of patterns from one domain or corpus to another has nevertheless been questioned, although not widely evaluated. This analysis focussed on occurrences of a set of thirty-seven verbal markers of cause–effect relations in three corpora of French texts in the fields of computing, medicine and psychology. These corpora also represented two text genres: specialised and popularised/didactic. We analysed the relative frequencies of these markers and the types of relationships they expressed (specifically, those that involved a cause–effect component versus those that did not). In so doing, we aimed to evaluate and compare the markers’ occurrences with a view to shedding light on the possibilities for using markers to semi-automatically identify knowledge-rich contexts in different domains and corpora. Results show that almost all of the markers are observed in all three corpora and both text genres, and that many markers quite reliably indicate some kind of cause–effect relation. However, the relative frequencies of the individual markers and the reliability with which some of them are associated with causal senses was observed to vary between domains and text genres, suggesting that these factors require further evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.