class: center, middle, inverse, title-slide # A Few Twitter Language Research Projects ###
Steven Coats
English Philology, University of Oulu, Finland
steven.coats@oulu.fi
###
DHH Hackathon, Helsinki
May 16th, 2019
--- class: inverse, center, middle background-image: url(http://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="http://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                 A Few Twitter Language Research Projects | DHH19</span></div> --- <div class="my-header"><img border="0" alt="W3Schools" src="http://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                 A Few Twitter Language Research Projects | DHH19</span></div> ## Outline 1. European language ecology, bi-, and multilingualism 2. Profanity and gender in the Nordic countries 3. Skin tone emoji 4. Code-switching and borrowing: English-German 5. Anglicisms in German .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] --- <div class="my-header"><img border="0" alt="W3Schools" src="http://cc.oulu.fi/~scoats/NewLogoRussianPNG1.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                 A Few Twitter Language Research Projects | DHH19</span></div> ### European language ecology - This project investigated bi- and multilingualism among European Twitter users -- - Who is using what languages on social media? Absence of “reliable, quantitative measurement of online linguistic diversity” (Lee 2016) -- - The sample consisted of tweets from Europe-located users who published tweets in more than one language -- - What patterns of are evident at country-level and language-level? -- - Can tell us “where languages stand and where they are going in comparison with other languages of the world” (Haugen 1972) -- - Possibly relevant techniques: - Place and language disambiguation - Use of Shannon entropy as a diversity measure - Choropleth mapping using Leaflet - Network analysis, visualization in R using visNetwork --- ### European language ecology .small[(Coats 2017)] - Tweets with populated `place` attributes from 55 European countries/territories collected from the Twitter Streaming API November 2016 – June 2017 ("seed" data: 153.2m tweets, 2.9m users) -- - User timelines downloaded (up to 3250 tweets) for users with at least 100 tweets, of which at least 10 are in a second language, whose `tweet_source` is not a bot, and for whom > 90% of `place` values in the "seed" data and > 50% in the total data were from one country. -- - In total 50,268 unique authors, tweets from 2007 to 2017 -- - "Double checking" Twitter's language detection algorithm: Only tweets with ≥ 10 words, required match with `cld2` --- ### Map
--- ### European language ecology
--- ### Profanity and gender in the Nordic countries - Tweets from Greenland, Iceland, Faroes, Norway, Denmark, Sweden, Åland, Finland -- - Profanities in English, Kalaallisut, Icelandic, Faroese, Norwegian, Danish, Swedish, Fininsh --- ### Profanity and gender in the Nordic countries - Males use more profanity than females, but for English-language profanities, the rates are more similar -- - Males use more "traditional" profanities (diaboloical invocations such as *fjandinn*, *helvete*, *fandeme*, *jävlar*, *saatana*, *perkele*) -- - Males use more scatological profanities (*paskaa*, *skit*, *shit*) -- - Females use more profanities that negatively denote females (*hora*, *tussa*, *druslur*, *bruttan*, *luder*, *kælling*, *ämmä*, *horo*) -- - Possibly relevant techniques: - Word frequency analysis - Nordic Genderizer: Data from Nordic statistical offices aggregated for 33k name types, m/f probabilities used to infer Twitter gender from `user:name value` - Word embeddings (word2vec, gensim) and t-SNE --- ### Skin tone emoji .small[(Coats 2018a)] Since Unicode 8.0 (June 17, 2015), skin tone characters are part of Unicode -- .pull-left30[ .small[[source](https://www.arpansa.gov.au/sites/g/files/net3086/f/legacy/pubs/RadiationProtection/FitzpatrickSkinType.pdf)]] -- .pull-right70[ Emoji Modifier Fitzpatrick Type-1-2<br> Emoji Modifier Fitzpatrick Type-3<br> Emoji Modifier Fitzpatrick Type-4<br> Emoji Modifier Fitzpatrick Type-5<br> Emoji Modifier Fitzpatrick Type-6<br> ] --- ### Emoji sequences Since Unicode 9.0 (late 2016), emoji sequences can also be used to indicate activities, professions, groups, etc. These can usually be combined with skin tone as well. --  +  =  --  +  +  =  -- Sequences can utilize additional **zero-width joiner** and **variation selector** code points to show that the sequence is to be parsed as one character  = \U0001f9d6\U0001f3fb\U0000200d\U00002642\U0000fe0f -- - Parsing and tokenization of emoji sequences can present difficulties --- ### Skin tone emoji - Negative correlation between sentiment and darker skin tone -- - Tweets from much of Asia and the Middle East contain mainly the lightest skin tones -- - Possibly relevant techniques: - Correct Python tokenization of composed emoji - Emoji-based sentiment analysis dictionary (Kralj-Novak et al. 2015) - More maps (Leaflet, Google GeoCharts) - Emoji embeddings (word2vec, gensim) and t-SNE --- ### Global skin tone emoji summary statistics .small[
] --- ### Median skintone color
--- ### t-SNE emoji semantic similarity
--- ### Code-switching and borrowing: German-English - Get word vectors in a large corpus of tweets with German-English codeswitching -- - Trace the semantic specialization of borrowed English words in German --- ### Anglicisms in German .small[(Coats 2018b)] - Automatically generate potential German verbal anglicisms from a long list of English verbal roots .small[(Davies 2004-, 2008-, 2015; Hanks 2014)] -- - Get their frequencies -- Some recent borrowings from English show partial assimilation to standard German orthography: they can retain the -*ed* of the English participle - .best_studio[liken - geliked/gelikt] .small[(to like - liked)], .best_studio[crashen - gecrashed/gecrasht] .small[(to crash - crashed)], .best_studio[featuren - gefeatured/gefeaturt] .small[(to feature - featured)] - To what extent are users on social media coining *new* verbal Anglicisms? (i.e. not yet codified as German words or well-established in German use) --- ### Anglicisms in German: Attested types by frequency .small[
] --- ### Anglicisms in German: Rank-frequency profile of attested types
--- ### Anglicisms in German - Possibly relevant techniques: - Linguistic analysis of other blends/portmanteaus - Morphological productivity of neologism bases - Word vectors for "Brexit" in different languages: closest words --- #Thank you! --- ### References .small[ .hangingindent[ Coats, S. (2018a). Skin tone emoji and sentiment on Twitter. In E. Mäkelä and M. Tolonen (eds.), *Proceedings of the 3rd Digital Humanities in the Nordic Countries Conference*, Helsinki, Finland, March 7–9, 2018, 122–138. Aachen, Germany: CEUR. Coats, S. (2018b). Variation of new German verbal Anglicisms in a social media corpus. In R. Vandekerckhove, D. Fišer and L. Hilte (eds.), *Proceedings of the 6th conference on CMC and social media corpora for the humanities*, 27–32. Antwerp, Belgium: University of Antwerp. Coats, S. (2017). European language ecology and bilingualism with English on Twitter. In C. Wigham and E. Stemle (eds.), *Proceedings of the 5th conference on CMC and social media corpora for the humanities*, 35–38. Bozen/Bolzano: Eurac Research Davies, M. 2004–. *BYU-BNC (Based on the British National Corpus from Oxford University Press)*. [https://corpus.byu.edu/bnc](https://corpus.byu.edu/bnc). Davies, M. 2008–. *The Corpus of Contemporary American English (COCA): 560 million words, 1990-present*. [https://corpus.byu.edu/coca/](https://corpus.byu.edu/coca/). Davies, M. 2015. *The Wikipedia Corpus: 4.6 million articles, 1.9 billion words*. [https://corpus.byu.edu/wiki/](https://corpus.byu.edu/wiki/). Haugen, E. (1972). The ecology of language. In E. Haugen and A. Dil (eds.), *The ecology of language*, 325-339. Stanford: Stanford University Press. Lee, C. (2016). Multilingual resources and practices in digital communication. In A. Georgakopoulou and T. Spilioti (eds.), *The Routledge handbook of language and digital communication*, 118–132. Hanks, P. 2013. *Lexical Analysis: Norms and Exploitations*. Cambridge, MA: MIT Press. Kralj-Novak, P., Smailovic, J., Sluban, B., and Mozetic, I. (2015). [Sentiment of emojis](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296). *PLoS ONE* 10(12). ]]