Profanity on Twitter in the Nordics

Steven Coats
English Philology, University of Oulu
steven.coats@oulu.fi

5th SwiSca Symposium on Swearing
November 23rd, 2017

Background and research questions

  1. Explore extent of profanity use by gender
  2. Quantify results by country and gender and identify most characteristic words
  3. Use word embeddings to investigate semantic space by gender

Data collection | Twitter Streaming API

Data collection | Gender disambiguation

\[ \small P(name_x \in male) = \frac{\sum{name_x \in male}}{\sum{name_x}} ,\qquad P(name_x \in female) = \frac{\sum{name_x \in female}}{\sum{name_x}} \normalsize \]

Data collection | Twitter REST API

Tweet density by language

Map polygons from Natural Earth, maps from Open Street Maps

Some Sample Tweets containing Profanity

Profanity

shit, fuck, cunt, whore, damn, nigger
Regex captures any items containing these strings, such as dumbshit, shithead, fucktard etc.

E.g. bollocks, faggot, fuq, heeb, hell, hillbilly, homo, honkey, hussy, jackass, jackoff, jigaboo, lameass, lardass, lesbo, lezzie, limey, limpdick, mcfagget, minge, moon cricket, nignog, paki, pansy, peckerhead, pikey, piss, pussy, spastic, snow nigger, twat, white trash, wtf, etc.

E.g. arnapalaaq, iteq, nipangerit, aumingi, böllur, djöfullinn, drusla, fífl, fíflingur, mogghøvd, skitni fæni, pupp, pæss, rass, rasshøl, rompe, ronk, runk, ræv, ræva, rævhøl, rævva, bøsserøv, fisse, fissehår, hestepik, klaphat, kussekryller, lort, fitta, fittig, för helvete, helvete, jävlä, jävlar, knulla, kuk, kuksås, huoraa, huorilta, huorilla, huorille, huorista, huorien, huoriin, huorissa, huorat, huoria, kusipäät, kusipäältä, kusipäällä, etc.

Sources of Profanity

Noswearing.com scrape (347 terms)
list of 1,383 potentially offensive terms created at Carnegie Mellon University, Pittsburgh, USA

Online word lists from software tools created to filter user input on websites (here and here for Norwegian, Finnish, Swedish and Danish)
Crowd-sourced Youswear dictionary for some terms in Icelandic, Faroese, and Greenlandic

Use of profanity by country and gender

\[ G = 2\sum_{i} {O_{i} \cdot \ln\left(\frac{O_i}{E_i}\right)} \]

Iceland (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.204, males 0.224

Norway (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.311, males 0.343

Denmark (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.188, males 0.181

Sweden (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.198, males 0.235

Finland (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.478, males 0.602

Totals, all Nordics (Nordic profanity)

Total Nordic-language profanity per 1k words: Females 0.232, males 0.297 (Includes Greenland, Faroes, and Åland)

What about English lexical items?

Iceland (English profanity)

Total English-language profanity per 1k words: Females 2.952, males 3.392

Norway (English profanity)

Total English-language profanity per 1k words: Females 2.33, males 1.182

Denmark (English profanity)

Total English-language profanity per 1k words: Females 2.907, males 2.169

Sweden (English profanity)

Total English-language profanity per 1k words: Females 1.842, males 1.475

Finland (English profanity)

Total English-language profanity per 1k words: Females 1.498, males 1.271

Total, all Nordics (English profanity)

Total English-language profanity per 1k words: Females 2.028, males 1.627 (includes Greenland, Faroes, and Åland)

Profanity frequency by country and gender

Word embeddings

Drawing
(Mikolov et al. 2013: 749)

Drawing

t-SNE to 2-dimensional space

t-SNE Visualization

Caveats and future outlook

Preliminary summary and conclusions

Acknowledgements

Also thanks to

References

Argamon, S., M. Koppel, J. W. Pennebaker and J. Schler. 2007. Mining the blogosphere: Age, gender, and the varieties of self-expression. First Monday 12/9. http://firstmonday.org/ojs/index.php/fm/article/view/2003/1878

Bamman, D., J. Eisenstein and T. Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2), 135–160. http://onlinelibrary.wiley.com/doi/10.1111/josl.12080/full

Coats, S. 2016. Grammatical feature frequencies of English on Twitter in Finland. In L. Squires (ed.), English in computer-mediated communication: Variation, representation, and change. Berlin: De Gruyter. 179–210.

Dewaele, J.-M. 2004. The emotional force of swearwords and taboo words in the speech of multilinguals. Journal of Multilingual and Multicultural Development 25(2–3), 204–222.

Firth, J.R. 1957. Papers in linguistics, 1934–1951. London: Oxford University Press.

Labov, W. 1990. The intersection of sex and social class in the course of linguistic change. Language Variation and Change 2, 205–254.

Lui, M. and T. Baldwin. 2012. Langid.py: An off-the-shelf language identification tool. 50th Proceedings of the Association for Computational Linguistics, 25–30. Stroudsburg, PA: ACL. http://dl.acm.org/citation.cfm?id=2390475

McEnery, T. 2006. Swearing in English: Bad language, purity and power from 1586 to the present. New York: Routledge.

Mehl, M. and J. Pennebaker. 2003. The sounds of social life: A psychometric analysis of students’ daily social environments and natural conversations. Journal of Personality and Social Psychology 84(4), 857–870.

Mikolov, T., W. Yih, and G. Zweig. 2013. Linguistic regularities in continuous space word represen-tations. In: Proceedings of HLT-NAACL 13, 746–751. https://www.aclweb.org/anthology/N13-1090

Newman, M.L., C. Groom, L. Handelman, and J. Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples.Discourse Processes 45, 211–236. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.216.4267&rep=rep1&type=pdf

Roesslein, J. 2015. Tweepy. Python package [Computer software]. http://www.tweepy.org

Wang, W., L. Chen, K. Thirunarayan, and A. P. Sheth. 2014. Cursing in English on Twitter. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, 415–425.