Complex conjunctions and determiners are often considered as pretokenized units in parsing. This is not always realistic, since they can be ambiguous. We propose a model for joint dependency parsing and multiword expressions identification, in which complex function words are represented as individual tokens linked with morphological dependencies. Our graphbased parser includes standard secondorder features and verbal subcategorization features derived from a syntactic lexicon.We train it on a modified version of the French Treebank enriched with morphological dependencies. It recognizes 81.79% of ADV+que conjunctions with 91.57% precision, and 82.74% of de+DET determiners with 86.70% precision.
Les systèmes de transcription qui proposent de reproduire certains phénomènes de l’oral, comme les bribes, les hésitations, les répétitions, et qui n’utilisent pas de ponctuation peuvent laisser présager de grandes difficultés pour l’étiquetage grammatical de corpus transcrits. Le développement d’étiqueteurs directement conçus pour l’oral est souhaitable, mais ne peut constituer qu’une entreprise à long terme. Nous relatons dans cet article une expérience d’étiquetage d’un corpus oral à l’aide d’un étiqueteur conçu pour l’écrit, complété par des programmes de pré-édition et de post-édition adéquats, qui, contre toute attente, permet d’obtenir d’excellents résultats sur l’oral, presque comparables à ceux obtenus sur l’écrit. Ces résultats permettent d’envisager la constitution rapide de grands corpus oraux étiquetés pour le français.
It has been observed that learners of French as a second language at different stages of the acquisition process tend to use forms and rules that are comparable to those of French-based creoles or pid-ginized French. The more advanced learners employ rules and forms akin to dialectal variants of French or to French as spoken in isolated areas such as Old Mines, Missouri. The learners produce non-standard forms considered unacceptable by the purist tradition of French grammarians. It has been noted that the observed similarities between interlanguage, regional dialects, etc., occur in given “sensitive” zones of French morphology and syntax such as the use of verbs and auxiliaries, morphology and placement of clitic pronouns, over-generalization of given prepositions, those very areas which are problematic in the acquisition of French as L1. Since the 17th century, these have been the object of a strict codification by purist grammarians who disregard actual usage in various dialects. It is hypothesized that such similarities between the interlanguage forms at various stages of development, French regional dialects, and areas of conflict over the elaboration of norms in standard French can be partly accounted for if one considers the dynamics of the target language. To explain the functioning of this process, we posit a “system” comprising the learner-speaker, the specific linguistic system itself (including pressure to conform to the norm), and the interactions with native speakers. Through self-regulation, this system devises solutions which perforce pertain to that common area which in any language is at the crossroads of variation, language change, and acquisition. This hypothetical zone (called français zéro by Chaudenson, 1984) is the point of convergence of the self-regulating processes which are responsible for the formal and functional similarities between French-based interlanguages, language change, norm conflicts in the standardization of French, and the creolization process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.