Multilingual Clusters and Gender in Nordic Twitter

Steven Coats
English Philology, Faculty of Humanities
steven.coats@oulu.fi

CLARIN-PLUS Workshop “Creation and Use of Social Media Resources”
May 19th, 2017

Goals of research

  1. Explore extent of language use by gender on Twitter for the Nordic countries
  1. Investigate similarities or differences by gender among bi- and multilingual Nordic Twitter users

Data collection | Streaming API and gender disambiguation

\[ P(X=x_{m}) = \frac{\sum_{\omega: X(\omega)=x_m}}{\sum_{\omega: X(\omega)=x_m}+\sum_{\omega: X(\omega)=x_f}} \]

Data collection | REST API and source filtering

Data collection | Language detection

Tweet density by language

Map polygons from Natural Earth, maps from Open Street Maps

Use of language by country and gender

Use of language by country and gender: Iceland

Use of language by country and gender: Norway

Use of language by country and gender: Denmark

Use of language by country and gender: Sweden

Use of language by country and gender: Finland

Summary language use by gender

Quantifying bi- and multilingualism

Creating networks of bi- and multilinguals

\(language_{i}\) ~\(language_{i}\)
\(language_{j}\) \(O_{11}\) \(O_{12}\) \(= R_{1}\)
~\(language_{j}\) \(O_{21}\) \(O_{22}\) \(= R_{2}\)
\(= C_{1}\) \(= C_{2}\) \(= N\)

\[ \phi_{ij}= \frac{(O_{11}O_{22}-O_{12}O_{21})}{\sqrt{R_{1}R_{2}C_{1}C_{2}}} \]

Bilinguals’ use of language by country and gender: Iceland

Bilinguals’ use of language by country and gender: Norway

Bilinguals’ use of language by country and gender: Denmark

Bilinguals’ use of language by country and gender: Sweden

Bilinguals’ use of language by country and gender: Finland

Clusters for all Nordic males and females

Significance of gender differences

\[ z_{(i,j)} = \frac{z_{\phi_{(i,j)_{m}}}-z_{\phi_{(i,j)_{f}}}}{\sqrt{\frac{1}{n_{(i,j)_{m}}-3}+{\frac{1}{n_{(i,j)_{f}}-3}}}} \]

(Sheskin 2000: 792)

For languages \((i,j)\):

\(z_{\phi_{(i,j)_{m}}}\) is the Fisher transformed value of \(\phi\) for males

\(z_{\phi_{(i,j)_{f}}}\) is the Fisher transformed value of \(\phi\) for females

\(n_{(i,j)_{m}}\) and \(n_{(i,j)_{f}}\) are the number bilingual speakers in the male and female networks.

Summary of gender differences

Preliminary summary and conclusions

Acknowledgements

Thanks to

References

Argamon, S., M. Koppel, J. W. Pennebaker and J. Schler. 2007. Mining the blogosphere: Age, gender, and the varieties of self-expression. First Monday 12/9. http://pear.accc.uic.edu/ojs/index.php/fm/article/view/2003/1878

Bamann, D., J. Eisenstein and T. Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2), 135–160. http://onlinelibrary.wiley.com/doi/10.1111/josl.12080/full

Coats, S. 2016. Grammatical feature frequencies of English on Twitter in Finland. In L. Squires (ed.), English in computer-mediated communication: Variation, representation, and change. Berlin: De Gruyter. 179–210.

Labov, W. 1990. “The intersection of sex and social class in the course of linguistic change.” Language Variation and Change 2, 205–254.

Lui, M. and T. Baldwin. 2012. “Langid.py: An off-the-shelf language identification tool”. 50th Proceedings of the Association for Computational Linguistics, 25–30. Stroudsburg, PA: ACL. http://dl.acm.org/citation.cfm?id=2390475

Roesslein, J. 2015. Tweepy. Python package [Computer software]. http://www.tweepy.org

Sheskin, D. 2000. Handbook of parametric and non-parametric statistical procedures, 2nd ed. Boca Raton: Chapman and Hall.

Sites, D. 2013. Compact language detector 2. https://github.com/CLD2Owners/cld2