2018
DOI: 10.1093/llc/fqy025
|View full text |Cite
|
Sign up to set email alerts
|

Discourse lexicon induction for multiple languages and its use for gender profiling

Abstract: We propose a novel way to create categorized discourse lexicons for multiple languages. We combine information from the Penn Discourse Treebank with statistical machine translation techniques on the Europarl corpus. Using gender profiling as an application, we evaluate our approach by comparing it with an approach using features from a knowledge-based lexicon and with an Rhetorical structure theory (RST) discourse parser. Our experiments are performed on corpora for three languages (English, Dutch, and German)… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…1 I applied a logistic regression classifier since this is an algorithm that obtains high results when classifying texts with linguistic features (in comparison to decision trees that tend to over-fit with many interval features) and assign weights to every feature (in comparison to support vector machines). Other researchers, such as Underwood, have used the coefficients of this algorithm as a metric for the importance of each feature for the classification of each genre (Underwood 2014; or for other tasks (Rahat and Talebpour 2018;Verhoeven and Daelemans 2018). The matrix of features contains those that showed better results in the previous Section 6.1.2 -tokens, linguistic annotation and TEI-tags, all with their frequencies relative to the number of tokens per text and log transformed, selecting the 3,000 most frequent features (parameters that are in general the most successful, as will be shown in the Section 6.1.4).…”
Section: Knowledge Extraction About Featuresmentioning
confidence: 99%
“…1 I applied a logistic regression classifier since this is an algorithm that obtains high results when classifying texts with linguistic features (in comparison to decision trees that tend to over-fit with many interval features) and assign weights to every feature (in comparison to support vector machines). Other researchers, such as Underwood, have used the coefficients of this algorithm as a metric for the importance of each feature for the classification of each genre (Underwood 2014; or for other tasks (Rahat and Talebpour 2018;Verhoeven and Daelemans 2018). The matrix of features contains those that showed better results in the previous Section 6.1.2 -tokens, linguistic annotation and TEI-tags, all with their frequencies relative to the number of tokens per text and log transformed, selecting the 3,000 most frequent features (parameters that are in general the most successful, as will be shown in the Section 6.1.4).…”
Section: Knowledge Extraction About Featuresmentioning
confidence: 99%
“…They are being created in hundreds of languages (Zhao and Schütze, 2019) and are increasingly used to augment modern deep learning models (Li et al, 2020b;Hu et al, 2019). Both supervised (Irvine and Callison-Burch, 2013) and unsupervised (Artetxe et al, 2019;Zhang et al, 2017;Kanayama and Nasukawa, 2012) methods have been proposed, some with an emphasis on supporting interpretation (Verhoeven and Daelemans, 2018;Clos and Wiratunga, 2017;Misra et al, 2015).…”
Section: Introductionmentioning
confidence: 99%