In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost-and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.
BackgroundInterval cancers are primary breast cancers diagnosed in women after a negative screening test and before the next screening invitation. Our aim was to evaluate risk factors for interval cancer and their subtypes and to compare the risk factors identified with those associated with incident screen-detected cancers.MethodsWe analyzed data from 645,764 women participating in the Spanish breast cancer screening program from 2000–2006 and followed-up until 2009. A total of 5,309 screen-detected and 1,653 interval cancers were diagnosed. Among the latter, 1,012 could be classified on the basis of findings in screening and diagnostic mammograms, consisting of 489 true interval cancers (48.2%), 235 false-negatives (23.2%), 172 minimal-signs (17.2%) and 114 occult tumors (11.3%). Information on the screening protocol and women's characteristics were obtained from the screening program registry. Cause-specific Cox regression models were used to estimate the hazard ratios (HR) of risks factors for interval cancer and incident screen-detected cancer. A multinomial regression model, using screen-detected tumors as a reference group, was used to assess the effect of breast density and other factors on the occurrence of interval cancer subtypes.ResultsA previous false-positive was the main risk factor for interval cancer (HR = 2.71, 95%CI: 2.28–3.23); this risk was higher for false-negatives (HR = 8.79, 95%CI: 6.24–12.40) than for true interval cancer (HR = 2.26, 95%CI: 1.59–3.21). A family history of breast cancer was associated with true intervals (HR = 2.11, 95%CI: 1.60–2.78), previous benign biopsy with a false-negatives (HR = 1.83, 95%CI: 1.23–2.71). High breast density was mainly associated with occult tumors (RRR = 4.92, 95%CI: 2.58–9.38), followed by true intervals (RRR = 1.67, 95%CI: 1.18–2.36) and false-negatives (RRR = 1.58, 95%CI: 1.00–2.49).ConclusionThe role of women's characteristics differs among interval cancer subtypes. This information could be useful to improve effectiveness of breast cancer screening programmes and to better classify subgroups of women with different risks of developing cancer.
Expansion of queries using related terms in the UMLS Metathesaurus beyond synonymy is an effective way to overcome the gap between query and document vocabularies when searching for patient cohorts.
We present a Question Answering (QA) system that won one of the tasks of the Kaggle CORD-19 Challenge, according to the qualitative evaluation of experts. The system is a combination of an Information Retrieval module and a reading comprehension module that finds the answers in the retrieved passages. In this paper we present a quantitative and qualitative analysis of the system. The quantitative evaluation using manually annotated datasets contradicted some of our design choices, e.g. the fact that using QuAC for fine-tuning provided better answers over just using SQuAD. We analyzed this mismatch with an additional A/B test which showed that the system using QuAC was indeed preferred by users, confirming our intuition. Our analysis puts in question the suitability of automatic metrics and its correlation to user preferences. We also show that automatic metrics are highly dependent on the characteristics of the gold standard, such as the average length of the answers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.