Photo by Jason Rosewell, published on Unsplash before 5 June 2017 and therefore CC0.

How can you tell when an online conversation may get out of hand or go off the rails? Justine Zhang and Jonathan Chang, researchers at Cornell University, recently developed a framework for predicting whether an online discussion could devolve—and then applied that framework to conversations between Wikipedia editors.
Zhang and Chang will be presenting results from this research—part of a collaboration between Cornell University, Jigsaw and the Wikimedia Foundation—at the upcoming Wikimedia Research Showcase on 18 June 2018 at 11:30 PT. A livestream will be available on YouTube. In advance, we asked them to reflect on some of the results they received, and where their research could go next.
1. Could you explain the methodology of the work and its limitations?

Online conversations have a reputation for getting out of hand (e.g., devolving into insults, toxicity, or personal attacks). Is it possible to predict whether a (currently civil) conversation will get out of hand sometime in the future? To explore this question, we develop a framework for capturing linguistic patterns used to start conversations, and analyze whether these patterns can indicate that the conversation will (or will not) devolve into personal attacks in the future. Applying our framework to conversations between Wikipedia editors, we find that it is feasible to detect early warning signs of conversations getting out of hand.

While the results we have obtained thus far are promising, we note that our approach has several limitations. One drawback is that our framework is not causal—that is, we can pick up correlations between certain linguistic patterns and future personal attacks, but have no way of concluding that the use of those patterns causes the future attacks. Future work could use controlled randomized trials to tackle the question of causality. Another drawback is our reliance on crowdsourced labels that might exclude more subtle forms of personal attacks. Future work could consider other ways to label personal attacks.

2. Could you give a summary of the results of this research?

We evaluate our framework by applying it to the following “guessing game”: given a pair of conversations (both of which start civil), we show the first two comments of each conversation, and the goal is to predict which conversation will lead to a personal attack. Humans make the right guess about 72% of the time, indicating that they have an intuition for this task, but also highlighting that the task is far from trivial. (If you want to test your own intuitions on this task, we have set up an online version of the game at http://awry.infosci.cornell.edu/.) We find that our model can recover some of this intuition, making the right guess about 65% of the time—and 80% of the time when applied to the cases that humans were able to correctly guess.

We also reveal what ways of starting a conversation are more likely to lead to it derailing into a personal attack. We find that indicators of conversations getting out of hand include language that perhaps suggest a more direct, confrontational tone (e.g., repeated direct questioning, use of the second person) while indicators of staying on track reflect softer, potentially more congenial language (e.g., hedging, friendly greetings).

3. What are your thoughts about potential applications: we have this research, so what now? Can our volunteers already pay attention to some conversational prompts? How could such a system be used for flagging conversations that are predicted to deteriorate within a few steps?

The long-term goal of this line of work is to provide actionable knowledge to moderators at a time when the conversation can still be salvaged (as opposed to detecting attacks after the fact). While the current work is at a proof-of-concept stage, an idealized version of our framework could be deployed alongside existing moderation tools, and could lessen the load on human moderators by helping them prioritize which conversations to keep an eye on. Another potential application, directed at editors rather than moderators, would be to highlight parts of a drafted comment that are likely to be perceived as aggressive or to lead to personal attacks.

4. We are also curious about issues related to bias in the algorithm: what are your thoughts on bias, discrimination, and the disparate impact the algorithm may introduce, if it were to be used in production?

Right now, there at least three sources of bias in the algorithm. The first comes from the fact that we use an automated machine learning tool to select potentially toxic conversations to send to crowdsourcing. This means that we inherit biases already present in such machine learning approaches. The second source of bias comes from the use of crowdsourcing to determine which conversations contain personal attacks—as with any use of crowdsourcing, this results in labels that reflect what the crowdsourced workers perceive to be personal attacks, a perception that may itself be biased. Finally, the third source of bias is that the model is currently trained only on English Wikipedia. Beyond the mere fact of introducing a language limitation, this choice also means it has little, if any, knowledge of culture-specific forms of attacks.

All this being said, more research would be needed to understand what specific biases are being picked up. We are actively thinking of ways to reduce the risk of bias in the model; in particular, we are investigating possible (non machine learning-based) alternative methods for pre-filtering conversations.

5. Do you have anything else that you’d like to add?

Right now, our work focuses on the phenomenon of conversations getting out of hand. However, we have also observed cases of conversations that appear to be getting out of hand, but later get back on track. We are very interested in studying this phenomenon, both because it is interesting scientifically and also because understanding what behaviors or interventions might prevent conversations from fully derailing would have practical implications for designing better social systems.

Interview by Melody Kramer, Senior Audience Development Manager, Communications
Dario Taraborelli, Director, Head of Research
Wikimedia Foundation

You can read their paper and take an online quiz on Cornell’s website, and see their data and code on Github. This research is a formal collaboration with the Wikimedia Foundation, subject to our open access policy, and the code, data and scientific output of this work is being released in the open to secure its accessibility and reusability.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

1 Comment

Inline Feedbacks

View all comments

skierpage

6 years ago

#28357

“Another potential application, directed at editors rather than moderators, would be to highlight parts of a drafted comment that are likely to be perceived as aggressive or to lead to personal attacks.”
That would encourage some people to adjust their comments to be more supportive and less antagonistic, but it would also help others to craft demoralizing and depressing comments that don’t trigger any highlighting. (Sadly a few Wikipedia editors have mastered this art of insulting without violating conduct guidelines.)

Diff

Welcome to Diff

Subscribe to Diff via Email

🎈 Let’s Connect Learning Clinic on Gender Sensitivity Training

WikiCon Australia 2024

Wikimedia Hackathon 2025

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole

	This comment is spam
	This comment is a violation of the Code of Conduct
	Other

Can you help us translate this article?

Related

Related

Subscribe to Diff via Email