European Language Ecology and Bilingualism with English on Twitter

Steven Coats
English Philology, University of Oulu, Finland
steven.coats@oulu.fi

5th CMC-Corpora Conference, Bozen
October 3rd, 2017

Starting points

  • Language use online: Who is using what languages on social media? Absence of “reliable, quantitative measurement of online linguistic diversity” (Lee 2016: 129).
    • Survey data from the European Commission (2012) or country-level studies (e.g. Leppänen et al. 2011)
    • Sampling of language use on Twitter (Mocanu et al. 2013, Leetaru et al. 2013) or from the web (Sites 2013b)
  • Is English displacing other languages online? What is the online status of European languages on Twitter?
  • Can we create a language network that shows the relative propensity of a European bilingual to use a particular language?
    • Networks based on connections between individual users (Eleta and Golbeck 2014, Hall 2014, Kim et al. 2014)
    • This project: A network based on connections between languages to get the larger picture
  • Attempt to characterize online language ecology, which can “tell us something about where languages stand and where they are going in comparison with other languages of the world” (Haugen 1972)

Data collection

Data collection | Language detection

Language identification and tweet length

Tweets identified as Albanian by cld2

##  [1] "rahhhhhh brenda didn't win"       
##  [2] "mdrrr j'en ai marre de lui"       
##  [3] "j'en ai marre je te jure mddr"    
##  [4] "ptdrrrr j'en ai marre de toi"     
##  [5] "j'en ai marre des gens cons ptn"  
##  [6] "j'en ai marre de mes tâches"      
##  [7] "j'en ai marre de ce gamin️"        
##  [8] "1010 j'en ai marre d'ce gamine"   
##  [9] "mdrrrr la flm j'te jure ."        
## [10] "robson il force à pas sortir kane"
##  [1] "kosova maqedonia unë të dua unë do të jetë i pranishëm në tiranë në prill"                                         
##  [2] "keni dokumente të leshuara nga institucione të huaja për t i legalizuar aplikoni online nëpërmjet platformës"      
##  [3] "mj takon ministrin rritja e b p ekonomik përforcimi i b p ndërkufitar në foku"                                     
##  [4] "mj siguron i ofron shqipërisë çdo asistencë të mundshme për emergjencën e zjarreve"                                
##  [5] "regjistrohu në app vetëm duke përdorur të dhënat emër mbiemër dhe adresën e postës elektronike"                    
##  [6] "mbahet samiti i kartës së adriatikut me zëvëndës presidentin amerikan mike pence"                                  
##  [7] "ministri mirëpret nënshkrimin e traktatit të miqësisë mes maqedonisë dhe bullgarisë"                               
##  [8] "a e dini se dje ishte dita e fundit e edicionit po që u grumbulluan 22k faleminderit"                              
##  [9] " kryesia bullgare do t ju mbështesë në marrjen e vendimit politik për hapjen e procesit të negociatave zv km mpj"  
## [10] " shpresojmë në 2018 të kemi një b p edhe më të fokusuar duke konsideruar prioritetin e zgjerimit të be të kryesisë"

English density

N = 20,180,940 tweets
Slides are on my homepage at https://cc.oulu.fi/~scoats if you want to check out the interactive elements!

Languages over time

Entropy

\[ H_{user}^{\prime} = -\sum _{i=1}^{n}{p_{i}\log_{2} ({p_{i}})} \]

Entropy by user

Mean entropy

\[ H_{country/language}^{\prime} = \frac{1}{N}\sum_{j=1}^N {p_{j}} \]

Mean entropy by country

Mean entropy by language

Creating a language network

\(language_{i}\) ~\(language_{i}\)
\(language_{j}\) \(O_{11}\) \(O_{12}\) \(= R_{1}\)
~\(language_{j}\) \(O_{21}\) \(O_{22}\) \(= R_{2}\)
\(= C_{1}\) \(= C_{2}\) \(= N\)

\[ \phi_{ij}= \frac{(O_{11}O_{22}-O_{12}O_{21})}{\sqrt{R_{1}R_{2}C_{1}C_{2}}} \]

Network visualization

Preliminary summary and conclusions, outlook

Acknowledgements

Thanks to

References

Coats, S. 2016. Grammatical feature frequencies of English on Twitter in Finland. In L. Squires (ed.), English in computer-mediated communication: Variation, representation, and change. Berlin: De Gruyter. 179–210.

European Commission. 2012. Europeans and their languages: Special Euro-barometer 386. http://ec.europa.eu/public_opinion/archives/ebs/ebs_386_sum_en.pdf

Haugen, E. 1972. The ecology of language. In E. Haugen and A. Dil (eds.), The Ecology of Language, 325-339. Stanford: Stanford University Press.

Lee, C. 2016. Multilingual resources and practices in digital communication. In A. Georgakopoulou and T. Spilioti (eds.), The Routledge handbook of language and digital communication, 118–132.

Leetaru, K., S. Wang, G. Cao, A. Padmanabhan and E. Shook. 2013. Mapping the global Twitter heartbeat: The geography of Twitter. First Monday 18(5-6). http://firstmonday.org/article/view/4366/3654

Leppänen, S. et al. 2011. National Survey on the English Language in Finland: Uses, meanings and attitudes (= Studies in Variation, Contacts and Change in English, Volume 5). Helsinki: Varieng.

Lui, M. and T. Baldwin. 2012. Langid.py: An off-the-shelf language identification tool. 50th Proceedings of the Association for Computational Linguistics, 25–30. Stroudsburg, PA: ACL. http://dl.acm.org/citation.cfm?id=2390475

Mocanu, D., A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang and A. Vespignani. 2013. The Twitter of Babel: Mapping world languages through microblogging platforms. PLoS ONE 8(4). http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061981

Roesslein, J. 2015. Tweepy. Python package [Computer software]. http://www.tweepy.org

Sheskin, D. 2000. Handbook of parametric and non-parametric statistical procedures, 2nd ed. Boca Raton: Chapman and Hall.

Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27, 379–423 and 623–656.

Sites, D. 2013a. Compact language detector 2. https://github.com/CLD2Owners/cld2

Sites, D. 2013b. Language on the Web. Working paper https://docs.google.com/document/d/14jBa2KmFMCqHGLnUR8k7Lj7K2s1vE6_yIG-3aXLdhUM/edit