HW 2: Civility in Communication


Goals

As we have discussed in class, abusive language on online platforms has become a major concern in the past few years. However, developing automated methods for flagging and censoring abusive language has proved to be difficult and prone to unwanted biases. The goals of this assignment are to (1) explore the challenges and ethical issues behind developing classifier for identifying offensive language (2) develop techincal solutions that aim to address these challenges.


Overview

In this assignment, you will explore an off-the-shelf toxicity classifier as well as build your own models. In general, you will evaluate models using two criteria: (1) performance over hate speech detection (Accuracy and F1 Score where “NOT” is considered the positive label) and (2) False Positive Rate (FPR), how often the model misclassifiers non-toxic speech as toxic, specifically for comments associated with different demographic dialects. Poor performance over hate speech classification suggests that the model is not accurate enough to be useful, while poor or imbalanced FPR indicates that the model may impose racial biases.

The primary data for this assignment is available here . Please note that the data contains offensive or sensitive content, including profanity and racial slurs.

We provide data drawn from two sources. The first (files train.tsv and dev.tsv) consists of tweets annotated for offensiveness taken from the 2019 SemEval task on offensive language detection. In the files train.tsv and dev.tsv, the first column (text) contains the text of a tweet, the second column (label) contains an offensiveness label:

The file offenseval-annotation.txt provides additional details on the annotation scheme.

We additionally provide a data set of tweets proxy-labelled for race in the file titled mini_demographic_dev.tsv. This data is taken from the TwitterAAE data set and uses posterior proportions of demographic topics as a proxy for racial dialect (details). The first column (text) contains the text of the tweet, and the second column (demographic) contains a label: “AA” (for “African American”), “White”, “Hispanic”, or “Other”. For this assignment, we assume that no tweet in the TwitterAAE data set contains toxic language. Thus, any tweet in this file that is classified as toxic is a false positive.

Finally, both development sets (dev.tsv and mini_demographic_dev.tsv) contain a column perspective_score, which contains a toxicity score. These scores were obtained using the PerspectiveAPI tool released by Alphabet. This tool is intended to help “developers and publishers…give realtime feedback to commenters or help moderators do their job.”

In all data sets, user mentions have been replaced with the token @USER.


Basic Requirements

Completing the basic requirements will earn a passing (B-range) grade

Off-the-shelf Model Exploration

Custom Model Exploration

Test Set Predictions

Write-up

Submit a 2-3 page report (ACL format) titled FirstName_LastName_hw2.pdf. Please do not submit more than 4 pages. The report should include:

Be sure to cite all references.


Advanced Analysis

Choose one of the two options below for advanced analysis:

Improve your preliminary classifier You may aim to improve accuracy/F1 of hate speech classification, or FPR, or to improve both metrics simultaneously. If you choose to focus on one metric, still report results for the other metric and discuss any trade-offs. Creative model architectures or feature crafting will receive full credit, even if they do not improve results.

In your report, include a description of your model and results over dev.tsv. Additionally, use your improved classifier to predict results over test.tsv and place these predictions in a file titled FirstName_LastName_advanced.tsv.

In order to facitilate analysis, we provide a larger data set here. This extended data set contains full training and dev sets from the TwitterAAE data set, as well as additional data annotated for hate speech drawn from a different paper (ICWSM, 2017). Note that user mentions have not been replaced in this data set. You are free to explore any ideas you have. We provide a few pointers for inspiration.

If you choose to maximize performance over offensiveness classification, you may choose to develop a more sophisiticated model for hate speech detection. Some prior work includes:

Models from prior SemEval tasks may also be helpful. Additionally, the provided train.tsv file contains annotations for different types of offensive language (e.g. untargeted vs. targeted, labels are in the third column titled category), which you may also consider leveraging.

If you choose to improve FPR, you may wish to leverage the provided demographic_train.tsv file. Data from this file could be used to balance your training data or to train a model with an adversarial object. Some related work includes:

Adapt to Gab As a second option, you can explore how to adapt your classifier to a different data domain from Twitter and a slightly different task. Kennedy et al., 2021 introduces a new annotation scheme for “hate-based rhetoric” including Human Degradation (HD), Calls for Violence (CV), Vulgar/offensive (VO). The Gab Hate Corpus (GHC) annotates posts from the website gab.com for these three categories (among other variables like targeted group/framing). Gab is commonly used by political extremists and the alt-right Andrews 2021, so that hate speech is more concentrated than on Twitter with a different distribution over topics/targeted groups. To investigate how well your model can do, you can select one or more of the hate-based rhetoric categories to evaluate on. Then, you can train a new classifier over the Gab training data to compare performance with the preliminary classifier. The Gab Hate Corpus also contains several variables for targeted populations/framing. It would be neat to look more closely at how performance differs over speech targeting different groups (e.g. political identity (POL), racial/ethnic identity (RAE)). Alternatively, you can also develop methods to adapt a model trained on Twitter assuming little or no training data from Gab. Dataset available here.


Grading (100 points)


Implementation Tips


References