Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (i) distinction of similar languages, (ii) detection of multilingualism in a single document, and (iii) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.
The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets -TweetNorm es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.
i LABURPENA Lan honetan Ebaluatoia aurkezten da, eskala handiko ingelesa-euskara itzulpen automatikoko ebaluazio kanpaina, komunitate-elkarlanean oinarritua. Bost sistemaren itzulpen kalitatea konparatzea izan da kanpainaren helburua, zehazki, bi sistema estatistiko, erregeletan oinarritutako bat eta sistema hibrido bat (IXA taldean garatuak) eta Google Translate. Emaitzetan oinarrituta, sistemen sailkapen bat egin dugu, baita etorkizuneko ikerkuntza bideratuko duten zenbait analisi kualitatibo ere, hain zuzen, ebaluazio-bildumako azpi-multzoen analisia, iturburuko esaldien analisi estrukturala eta itzulpenen errore-analisia. Lanak analisi hauen hastapenak aurkezten ditu, etorkizunean zein motatako analisietan sakondu erakutsiko digutenak.Hitz gakoak: itzulpen automatikoa, ingelesa, euskara, ebaluazioa, bikotekako konparazioa, errore analisia ABSTRACT This dissertation reports on the crowd-based large-scale English-Basque machine translation evaluation campaign, Ebaluatoia. This initiative aimed to compare system quality for five machine translation systems: two statistical systems, a rulebased system and a hybrid system developed within the IXA group, and an external system, Google Translate. We have established a ranking of the systems under study and performed qualitative analyses to guide further research. In particular, we have carried out initial subset evaluation, structural analysis and error analysis to help identify where we should place future analysis effort.
Machine translation post-editing is becoming commonplace and professional translators are often faced with this unknown task with little training and support. Given the different translation processes involved during post-editing, research suggests that untrained translators do not necessarily make good post-editors. Besides, the post-editing activity will be largely influenced by numerous aspects related to the technology and texts used. Training material, therefore, will need to be tailored to the particular conditions under which post-editing is bound to happen. In this work, we provide a first attempt to uncover what activity professional translators carry out when working from Spanish into Basque. Our initial analysis reveals that when working with moderate machine translation output post-editing shifts from the task of identifying and fixing errors, to that of "patchwork" where post-editors identify the machine translated elements to reuse and connect them using their own contributions. Data also reveal that they primarily focus on correcting machine translation errors but often fail to restrain themselves from editing correct structures. Both findings have clear implications for training and are a step forward in tailoring sessions specifically for language combinations of moderate quality.
RESUMENEl avance de estos últimos años en el campo de la traducción automática (TA) ha hecho posible su adopción en la industria. Y no sólo porque la calidad de los sistemas punteros haya mejorado considerablemente, sino también porque las empresas del sector han reconocido que incluso una TA imperfecta puede ser útil para satisfacer las demandas actuales del mercado de la traducción. Estamos presenciando un cambio de perspectiva con respecto al concepto de calidad. Estamos dejando atrás los estándares y modelos de evaluación que buscaban la máxima calidad en las traducciones, a favor de estándares más flexibles que se adecúan a la finalidad de los textos. Los ámbitos de uso de la traducción son múltiples, así como las expectativas de calidad de los lectores. Mientras que la traducción manual es indispensable para ciertos campos como pueden ser la medicina o el ámbito legal, donde la falta de precisión puede tener consecuencias desastrosas, o la publicidad, donde prevalece la creatividad, la TA, combinada con la posedición de mayor o menor envergadura, cubre gran parte de la demanda actual, incluso ayudando a incrementar el volumen de traducción.Palabras clave: traducción automática, posedición, productividad, calidad. RESUM (Postedició, productivitat i qualitat)L'avanç d'aquests últims anys en el camp de la traducció automàtica (TA) ha fet possible la seva adopció en la indústria. I no només perquè la qualitat dels sistemes punters hagi millorat considerablement, sinó també perquè les empreses del sector han reconegut que fins i tot una TA imperfecta pot ser útil per satisfer les demandes actuals del mercat de la traducció. Estem presenciant un canvi de perspectiva pel que fa al concepte de qualitat. Estem deixant enrere els estàndards i models d'avaluació que buscaven la màxima qualitat en les traduccions, a favor d'estàndards més flexibles que s'adeqüen a la finalitat dels textos. Els àmbits d'ús de la traducció són múltiples, així com les expectatives de qualitat dels lectors. Mentre que la traducció manual és indispensable per a certs camps com poden ser la medicina o l'àmbit legal, on la falta de precisió pot tenir conseqüències desastroses, o la publicitat, on preval la creativitat, la TA, combinada amb la postedició de major o menor envergadura, cobreix gran part de la demanda actual, fins i tot ajudant a incrementar el volum de traducció.Paraules clau: traducció automàtica, postedició, productivitat, qualitat.ABSTRACT (Postediting, productivity and quality) This article describes the new perspective on quality as a dynamic concept, propelling the industry's adoption of machine translation, which when combined with post-editing as it usually is, offers the flexibility to meet different response times and required levels of quality. Post-editing productivity is presented as a extended model that enables companies to decide whether or not to adopt automatic translation. Information gathered is demonstrated and difficulties in implementation acknowledged. Finally, the findings of various studies that ind...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.