Causally Denoise Word Embeddings Using Half-Sibling Regression

Yang, Zekun; Liu, Tianlin

doi:10.1609/aaai.v34i05.6485

Cited by 5 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After obtaining the cleaned word list, we represent each word by its word embedding and pile the embeddings together to obtain a matrix representation of the text. Word embedding represents a word by a vector of real numbers that preserves the semantic similarities between words, and it has been used by many downstream natural language processing tasks (Yang and Liu, 2020). In this study, we use FastText, a pre-trained 300-dimensional Chinese word embedding (Grave et al , 2018).…”

Section: Methodsmentioning

confidence: 99%

Interpretable video tag recommendation with multimedia deep learning framework

Yang

Lin

2021

INTR

Self Cite

View full text Add to dashboard Cite

PurposeTags help promote customer engagement on video-sharing platforms. Video tag recommender systems are artificial intelligence-enabled frameworks that strive for recommending precise tags for videos. Extant video tag recommender systems are uninterpretable, which leads to distrust of the recommendation outcome, hesitation in tag adoption and difficulty in the system debugging process. This study aims at constructing an interpretable and novel video tag recommender system to assist video-sharing platform users in tagging their newly uploaded videos.Design/methodology/approachThe proposed interpretable video tag recommender system is a multimedia deep learning framework composed of convolutional neural networks (CNNs), which receives texts and images as inputs. The interpretability of the proposed system is realized through layer-wise relevance propagation.FindingsThe case study and user study demonstrate that the proposed interpretable multimedia CNN model could effectively explain its recommended tag to users by highlighting keywords and key patches that contribute the most to the recommended tag. Moreover, the proposed model achieves an improved recommendation performance by outperforming state-of-the-art models.Practical implicationsThe interpretability of the proposed recommender system makes its decision process more transparent, builds users’ trust in the recommender systems and prompts users to adopt the recommended tags. Through labeling videos with human-understandable and accurate tags, the exposure of videos to their target audiences would increase, which enhances information technology (IT) adoption, customer engagement, value co-creation and precision marketing on the video-sharing platform.Originality/valueThe proposed model is not only the first explainable video tag recommender system but also the first explainable multimedia tag recommender system to the best of our knowledge.

show abstract

Section: Methodsmentioning

confidence: 99%

Interpretable video tag recommendation with multimedia deep learning framework

Yang

Lin

2021

INTR

Self Cite

View full text Add to dashboard Cite

show abstract

“…Likewise, it is infeasible to take each particular word token (in some sentence) as a potential confounder, and it is also expensive to discuss all high-level words since there are about 30,000 words in the Bert vocabulary. In this paper, we choose nouns as potential confounders since 1) nouns are content words that have meaning or semantic value [53]; 2) the role of nouns is similar to the role of objects in image, which might ease the inter-modality intervention. Specifically, we use the NLTK toolkit [6] to perform Part-of-Speech Tagging, and choose word tokens of which the tags belong to ["N N ", "N N S", "N N P", "N N PS"] as potential confounders.…”

Section: Intra-and Inter-modality Interventionmentioning

confidence: 99%

DeVLBert

Zhang

Jiang²,

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we propose to investigate the problem of out-ofdomain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of DeVLBert by boosting generalization ability. CCS CONCEPTS • Computing methodologies → Transfer learning.

show abstract

“…Specifically, we adopt the architecture of visiolinguistic Bert [26] and choose nouns in user queries as the confounders in the model to mitigate spurious correlations between words [45]. Since nouns are content words that have meanings or semantic value [41,45], it is more likely for keywords as nouns to form spurious correlations because of high frequency to appear in the same sentences. Also, as the role of nouns is similar to the role of objects in images, spurious correlations caused by nouns can be harmful to the correctness of the vision-language fusion.…”

Section: Introductionmentioning

confidence: 99%

Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.In this paper, we discover that deconfounded learning is an important step to provide better vision-language explanations. Thus we propose a deconfounded explainable vision-language retrieval system. By introducing deconfounded learning to pretrain our vision-language model, the spurious correlations in the model can be reduced through interventions by potential confounders. This helps to train more accurate representations and further enable better explainability. Based on explainable retrieval results, we propose novel interactive mechanisms. In such interactions, users can better understand why the system returns particular results and give feedback effectively improving the results. This additional feedback is sample efficient and thus alleviates the data limitation problem. Through extensive experiments, our system achieves about 60% improvements, compared to the state-of-the-art. CCS CONCEPTS• Computing methodologies → Causal reasoning and diagnostics; • Information systems → Users and interactive retrieval; Retrieval efficiency; Presentation of retrieval results; Image search.

show abstract

Causally Denoise Word Embeddings Using Half-Sibling Regression

Cited by 5 publications

References 28 publications

Interpretable video tag recommendation with multimedia deep learning framework

Interpretable video tag recommendation with multimedia deep learning framework

DeVLBert

Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

Contact Info

Product

Resources

About