HW 1: Crowdsourced Annotations

Goals

Crowdsourcing annotations has become a fundamental aspect of NLP research. The goal of this assignment is to explore the ethical implications of soliciting crowdsourced data, specifically social biases that may emerge when asking for generated sentences.

Overview

In this homework, you will perform a “bias audit” of an NLP dataset produced by crowdsourcing. You will attempt to measure the presence of social stereotypes in this dataset that may have harmful effects if used to train classifiers in downstream tasks.

You will use pointwise mutual information (PMI) to find which associations are being made with identity labels. PMI can be used as a measure of word association in a corpus, i.e. how frequently two words co-occur above what might just be expected based on their frequencies. See the PMI Wikipedia page for more details. Here we use PMI to measure which words co-occur with labels for identities. This allows us to see associations that may perpetuate stereotypes.

After this analysis, you will present specific examples from the data that you speculate could be particularly biased and problematic. In the optional advanced analysis, you will expand this analysis to another corpus.

Data and Resources

Basic Requirements

Completing the basic requirements will earn a passing (B-range) grade.

Word association analysis: First, build a tool for calculating pointwise mutual information (PMI) between unigram frequencies in the SNLI dataset. Your tool should take a unigram, with word frequencies relative to a corpus, as input and give a list of other unigrams in the corpus ranked by PMI. Terms that occur less than 10 times in the corpus should not be considered; optionally you can consider other thresholds. For preprocessing, lowercase, remove stopwords and tokenize the data. Note that there are duplicate premises and hypotheses in the data; remove these and just look at unique utterances.

Here’s how you can calculate PMI. Let \(c(w_i)\) be the count of word \(w_i\) in the corpus and \(c(w_i, w_j)\) be the number of times that \(w_i\) and \(w_j\) occur in the same premise or hypothesis. If they co-occur more than once within a premise or hypothesis, you can still just calculate that as one. With \(N\) as the number of documents (premises or hypotheses) in the corpus, we define \(P(w_i)\) as the word frequency \(c(w_i) / N\). Then PMI is:

\[PMI(w_i, w_j) = log_2 \frac{p(w_i, w_j)}{P(w_i)P(w_j)} = log_2\frac{N\cdot c(w_i, w_j)}{c(w_i)c(w_j)}\]

Compute PMI between the identity labels in the provided list and all other words in the SNLI training corpus (see details in the Data and Resources section above. Look at the top associated words for identity labels of your choice. Do you see any that may reflect social stereotypes? It is helpful to compare the top PMI words for certain identity terms with other related ones (such as men compared with women). Note that some terms in the list do not occur in the data; they are included for advanced analysis on possible other corpora.

Calculate PMI separately for identity terms in the premises, which are the original provided captions from the Flickr30k image captioning dataset, and identity terms in the hypotheses, which were elicited in a crowdworking task. You will compare the associations made in the write-up.

Qualitative analysis: Find specific hypotheses from the dataset where an identity label occurs with a top-associated term that shows some social bias or does not. Look at 1-2 examples for at least 5 different identity labels. Also note the label (entailment, contradiction, neutral) and consider the impact of asking annotators for certain types of inference.

Crowdsourcing set-up: Read about what crowdworkers were asked to do in constructing the SNLI corpus in the SNLI paper. Come up with at least one idea about how the designers of the crowdsourcing task might have mitigated any social bias you found in your analysis. For example, are there certain topics that often led to biased hypotheses? Could the task have been structured differently or different instructions given to mitigate bias?

Advanced Analysis

Choose one of the options below for advanced analysis.

New corpus

Choose another crowdsourced NLP or ML dataset and perform a similar bias audit based on identity terms. Datasets to consider include (but are not limited to!):

Modify the identity list by adding and/or dropping labels that do or do not occur above a frequency threshold in this new dataset and run the PMI word association analysis. Similar to the basic requirements, discuss stereotypes found in this corpus and give specific examples. Are there differences in the type or degree of stereotypes found compared with the SNLI corpus? Read about the annotation procedure for these datasets. How might these crowdsource tasks have set themselves up — or not — for responses that reflect stereotypes in ways that are similar or different from SNLI? Discuss potential implications for new crowdsourced data collection in the write-up.

Phrases

Expand the PMI analysis to higher-order n-grams and possibly syntactic phrases. Note that you will want to also expand the identity list to possibly include combinations of identity types (such as asian man). Refer to Rudinger et al. (2017) for ideas here.

Bias with respect to class labels

In this assignment you implemented an approach for identifying associations between unigrams and identity labels in a corpus. You can similarly investigate whether identity words (and closely related words via e.g. embedding similarity) have some correlation with class labels in text classification tasks, such as sentiment analysis. Discuss your results. What are possible implications of such bias for downstream uses of such a classifier? Are there certain model designs that might be more susceptible to such correlations than others?

Association measures

Perhaps PMI is not the best lexical association measure to see social bias with identity terms. Explore lexical association measures other than PMI; see Pecina (2010) for ideas.

Latest methods

Find and implement a recent approach for identifying biases in datasets. Summarize the approach in a couple paragraphs, and how you applied it to SNLI or another dataset. Compare the results to your analysis based on PMI. What are trade-offs of each approach, in terms of implementation, efficiency, results?

Write-up

Each student should submit their own 2-3 page report (ACL format). Please do not submit more than 4 pages, though you can put large tables and figures in an appendix beyond that if necessary. The report should include:

Grading (100 points)

References