Steven Coats

steven.coats (at) oulu.fi
University Lecturer, English Philology, University of Oulu, Finland



I'm a linguist interested in corpus linguistics, language variation, online language and social media, and computational approaches to language analysis, among other topics.

My research background is mostly in dialectology, sociolinguistics, and digital humanities. I've created the Corpus of North American Spoken English (CoNASE), the Corpus of British Isles Spoken English (CoBISE), the Corpus of Australian and New Zealand Spoken English (CoANZSE), and the Corpus of German Speech (CoGS). These are large corpora of geolocated YouTube transcripts. CoANZSE Audio is the searchable online version of CoANZSE, which contains audio and alignments, in addition to speech transcripts.

Professional experience

Publications

  1. Coats, Steven. (2024). Language of social media and online communication in Germanic. In Sebastian Kürschner and Antje Dammel (eds.), The Oxford Encyclopedia of Germanic Linguistics. Link
  2. Coats, Steven and Cameron Morin. (2024). Double modals beyond the Atlantic: New evidence from computational sociolinguistics. English Today. Link
  3. Coats, Steven. (2024). A development outlook for CLARIN’s northernmost center. In Vincent Vandeghinste and Thalassia Kontino (eds.), CLARIN Annual Conference Proceedings 2024, 85–89. Link
  4. Coats, Steven. (2024). Commenting on local politics: An analysis of YouTube video comments for local government videos. Research in Corpus Linguistics 13(1), 1–25. Link
  5. Coats, Steven. (2024). A framework for analysis of speech and chat content in YouTube and Twitch streams. In Céline Poudat and Mathilde Guernut (eds.), Proceedings of the 11th Conference on CMC and Social Media Corpora for the Humanities, 16–19. Nice, France: CORLI. Link Article
  6. Coats, Steven. (2024). Building a searchable online corpus of Australian and New Zealand aligned speech. Australian Journal of Linguistics. Link
  7. Coats, Steven. (2024). CoANZSE Audio: Creation of an online corpus for linguistic and phonetic analysis of Australian and New Zealand Englishes. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 3407–3412. Link
  8. Coats, Steven. (2024). Naturalistic double modals in North America. American Speech 99(1), 47–77. Link Article
  9. Coats, Steven and Veronika Laippala, eds. (2024). Linguistics across disciplinary borders: The march of data. London: Bloomsbury Academic. Link
  10. Coats, Steven and Veronika Laippala. (2024). Introduction. In Steven Coats and Veronika Laippala (eds.), Linguistics across disciplinary borders: The march of data, 1–16. London: Bloomsbury Academic. Link
  11. Coats, Steven. (2024). Noisy data: Using automatic speech recognition transcripts for linguistic research. In Steven Coats and Veronika Laippala (eds.), Linguistics across disciplinary borders: The march of data, 17–39. London: Bloomsbury Academic. Link
  12. Méli, Adrien, Steven Coats and Nicolas Ballier. (2023). Methods for phonetic scraping of Youtube videos. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), 244–249. Link
  13. Coats, Steven. (2023). Raumgeographische Verteilung von Twitter-Hashtags im deutschen Sprachraum. Neuphilologische Mitteilungen 114(2), 97–126. Link
  14. Morin, Cameron and Steven Coats. (2023). Double modals in Australian and New Zealand English. World Englishes. Link
  15. Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. Language Resources and Evaluation. Link
  16. Coats, Steven. (2023). A pipeline for the large-scale acoustic analysis of streamed content. In Louis Cotgrove, Laura Herzberg, Harald Lüngen, and Ines Pisetta (eds.), Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023), 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache. Link Article
  17. Coats, Steven. (2023). Double modals in contemporary British and Irish speech. English Language and Linguistics 27(4), 693–718. Link Article
  18. Coats, Steven. (2023). Dialect corpora from YouTube. In Beatrix Busse, Nina Dumrukcic, and Ingo Kleiber (eds.), Language and linguistics in a complex world, 79–102. Berlin: de Gruyter. Link
  19. Coats, Steven. (2022). The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. In Pradeesh Parameswaran, Jennifer Biggs, and David Powers (eds.), Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association, 1–5. Adelaide, Australia: Australasian Language Technology Association. (Awarded "best short paper") Link
  20. Coats, Steven. (2022). A database of North American double modals and self-repairs from YouTube. Psychology of Language and Communication 26, 273–296. Link
  21. Coats, Steven. (2022). The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. In Karl Berglund, Matti La Mela, and Inge Zwart (eds.), Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022, 187–194. Aachen, Germany: CEUR. Link
  22. Coats, Steven. (2021). ZipfExplorer: A Tool for the Comparison of Shared Lexis. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne, and Jānis Daugavieti (eds.), Post-Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21–23, 2020, 145–155. Aachen, Germany: CEUR. Article Tool
  23. Coats, Steven. (2021). 'Bad language' in the Nordics: profanity and gender in a social media corpus. Acta Linguistica Hafniensia 53(1), 22–57. Link Article
  24. Coats, Steven. (2020). Comparing word frequencies and lexical diversity with the ZipfExplorer tool. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne, and Jānis Daugavieti (eds.), Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21–23, 2020, 219–225. Aachen, Germany: CEUR. Article Tool
  25. Coats, Steven. (2020). Anglicism diversity in hyphenated German compounds. In Julien Longhi and Claudia Marinica (eds.), CMC Corpora through the Prism of Digital Humanities, 75–92. Paris: L'Harmattan. Link
  26. Coats, Steven. (2020). Articulation rate in American English in a corpus of YouTube videos. Language and Speech 63(4), 799–831. Link Article
  27. Coats, Steven and Adrien Barbaresi. (2019). Productivity of anglicism bases in hyphenated German compounds. In Julien Longhi and Claudi Marinica (eds.), Proceedings of the 7th Conference on CMC and Social Media Corpora for the Humanities, 53–58. Cergy, France: Cergy-Pontoise University. Link Article
  28. Coats, Steven. (2019). Lexicon geupdated: New German anglicisms in a social media corpus. European Journal of Applied Linguistics 7(2), 255–280. Link Article
  29. Coats, Steven. (2019). Language choice and gender in a Nordic social media corpus. Nordic Journal of Linguistics 42(1), 31–55. Link Article
  30. Coats, Steven. (2019). Online language ecology: Twitter in Europe. In Egon Stemle and Ciara Wigham (eds.), Building computer-mediated communication corpora for sociolinguistic analysis, 73–96. Clermont-Ferrand: Presses universitaires Blaise Pascal. Link Article
  31. Coats, Steven. (2019). A Corpus of regional American language from YouTube. In Costanza Navarretta et al. (eds.), Proceedings of the 4th Digital Humanities in the Nordic Countries Conference, Copenhagen, Denmark, March 6–8, 2019, 79–91. Aachen, Germany: CEUR. Article
  32. Coats, Steven. (2018). Variation of new German verbal Anglicisms in a social media corpus. In Reinhild Vandekerckhove, Darja Fišer, and Lisa Hilte (eds.), Proceedings of the 6th conference on CMC and social media corpora for the humanities, 27–32. Antwerp, Belgium: University of Antwerp. Link Article Data
  33. Coats, Steven. (2018). Skin tone emoji and sentiment on Twitter. In Eetu Mäkelä and Mikko Tolonen (eds.), Proceedings of the 3rd Digital Humanities in the Nordic Countries Conference, Helsinki, Finland, March 7–9, 2018, 122–138. Aachen, Germany: CEUR. Link Article
  34. Coats, Steven. (2018). Collecting Twitter data. In Christine Mallinson, Becky Childs, and Gerard Van Herk (eds.), Data collection in sociolinguistics: Methods and applications (2nd Ed.), 248–251. London/New York: Routledge. Link
  35. Coats, Steven. (2017). Gender and lexical type frequencies in Finland Twitter English. In Turo Hiltunen, Joe McVeigh, and Tanja Säily (eds.), Big and rich data in English corpus linguistics: Methods and explorations (= Studies in Variation, Contacts and Change in English 19). Helsinki, Finland: Varieng. Link
  36. Coats, Steven. (2017). Gender and grammatical frequencies in social media English from the Nordic countries. In Darja Fišer and Michael Beißwenger (eds.), Investigating social media corpora, 102–121. Ljubljana, Slovenia: U. of Ljubljana Academic Publishing. Link
  37. Coats, Steven. (2017). European language ecology and bilingualism with English on Twitter. In Egon Stemle and Ciara Wigham (eds.), Proceedings of the 5th conference on CMC and social media corpora for the humanities, 35–38. Bozen/Bolzano: Eurac Research. Article
  38. Coats, Steven. (2016). Grammatical feature frequencies of English on Twitter in Finland. In Lauren Squires (ed.), English in computer-mediated communication: Variation, representation, and change, 179–210. Boston/Berlin: de Gruyter Mouton. Link Article
  39. Coats, Steven. (2016). Grammatical frequencies and gender in Nordic Twitter Englishes. In Darja Fišer and Michael Beißwenger (eds.), Proceedings of the 4th conference on CMC and social media corpora for the humanities, 12–16. Ljubljana: U. of Ljubljana Academic Publishing. Article
  40. Kretzschmar, William A. Jr., Paulina Bounds, Jacqueline Hettel, Steven Coats, Lee Pederson, Lisa-Lena Opas-Hänninen, Ilkka Juuso, and Tapio Seppänen. (2012). Digital Archive of Southern Speech. Philadelphia, PA: Linguistic Data Consortium. Link

Presentations

  1. A development outlook for CLARIN’s northernmost center. Presentation at the CLARIN Annual Conference Proceedings 2024, Barcelona, Spain, 17 October 2024. Slides
  2. Regional variation in monophthongs in Australian and New Zealand Englishes: A big data approach. Presentation at the 10th Biannual Conference for the Linguistics of Contemporary English, Alicante, Spain, 27 September 2024. Slides
  3. A framework for analysis of speech and chat content in YouTube and Twitch streams. Presentation at the 11th International Conference on CMC and Social Media Corpora for the Humanities, Nice, France, 5 September 2024. Slides Code
  4. Analysis of online discourse from text to multimedia. Keynote presentation for the University of Eastern Finland SCE Program, 20 May 2024. Slides
  5. YouTube Phonetics Pipeline Workshop. Workshop for the ALOES pre-conference event, Paris, France, 28 March 2024. Slides Code
  6. A pipeline for the large-scale acoustic analysis of streamed content. Presentation at the 10th International Conference on CMC and Social Media Corpora for the Humanities, Mannheim, Germany, 15 September 2023. Slides
  7. The Corpus of Australian and New Zealand Spoken English (CoANZSE). Virtual presentation at the Workshop on Language Corpora in Australia, Canberra, Australia, 3 July 2023. Slides
  8. Corpora for the study of multimodal variation in English: Acoustic analysis from CoNASE. Presentation at Kielitieteen Päivät, Oulu, Finland, 25 May 2023. Slides
  9. Corpora of automatic speech recognition transcripts for the study of variation in English: Syntactic and phonetic perspectives. Presentation at PAC2023, Nanterre, France, 12 April 2023. Slides
  10. CoANZSE: The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. Virtual presentation at the 20th Annual Workshop of the Australasian Language Technology Association, Adelaide, Australia, 16 December 2022. Slides
  11. Civic engagement with local government videos: Comparing YouTube transcripts with user comments. Presentation at the 9th Conference on CMC and Social Media Corpora for the Humanities, Santiago de Compostela, Spain, 29 September 2022. Slides
  12. CoANZSE: The Corpus of Australian and New Zealand Spoken English. Computational Thinking in the Humanities Online Workshop, Brisbane, Australia, 1 September 2022. Slides
  13. Double modals in YouTube videos from North America and the British Isles. Presentation at CoCorDial Workshop, Helsinki, Finland, 27 April 2022. Slides
  14. The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. Virtual presentation at DHNB 2022, Uppsala, Sweden, 17 March 2022. Slides
  15. Scraping online dictionaries for usage annotations. Virtual presentation at the 7th SwiSca Symposium, Reykjavík, Iceland, 2 December 2021. Slides
  16. A database of North American multiple modals from YouTube. Presentation at the 8th Conference on CMC and Social Media Corpora for the Humanities, Nijmegen, the Netherlands, 29 October 2021. Slides
  17. Multiple modals in the wild: A study of 24,530 multiple modal sequences in naturalistic North American speech. Virtual presentation for the workshop "The March of Data" at the Sixth International Society for the Study of English Conference, Joensuu, Finland, 2 June 2021. Slides
  18. Comparing word frequencies and lexical diversity with the ZipfExplorer tool. Virtual presentation at the Fifth Digital Humanities in the Nordic Countries Conference, National Library of Latvia, Riga, Latvia, 23 October 2020. Slides
  19. Dialect corpora from YouTube. Virtual presentation at ICAME 41, University of Heidelberg, Germany, 20–24 May 2020. Slides Video
  20. Steven Coats and Adrien Barbaresi. Productivity of anglicism bases in hyphenated German compounds. Presentation at the 7th Conference on CMC and Social Media Corpora for the Humanities, Cergy, France, 10 September 2019. Slides
  21. Regional variation in speech rate in American English from YouTube videos. Presentation at Research Data and Humanities Conference, University of Oulu, Finland, 14 August 2019, and 9th Conference of the Finnish Society for the Study of English, University of Tampere, Finland, 15 August 2019. Slides
  22. Swearing on Twitter: Harvesting and visualizing data. Slides for workshop at the 6th SwiSca Symposium, Södertörn University, Sweden, 23 May 2019. Slides
  23. A Corpus of regional American language from YouTube. Presentation at the Fourth Digital Humanities in the Nordic Countries Conference, University of Copenhagen, Denmark, 8 March 2019. Slides Article
  24. Variation of new German verbal Anglicisms in a social media corpus. Presentation at the 6th Conference on CMC and Social Media Corpora for the Humanities, Antwerp, Belgium, 17 September 2018. Slides
  25. Slides (deutsche Version)
  26. Exploring code-switching and borrowing using word vectors. Presentation at the 14th Conference of the European Society for the Study of English, Brno, Czechia, 1 September 2018. Slides
  27. William A. Kretzschmar, Jr. and Steven Coats. Fractal visualization of corpus data. ICAME 39, University of Tampere, Finland, 30 May 2018.
  28. Skin tone emoji and sentiment on Twitter. Presentation at the Third Digital Humanities in the Nordic Countries Conference, University of Helsinki, Finland, 7 March 2018. Slides Article
  29. Profanity in the Nordics on Twitter. Invited presentation at the Higher Seminar Series, Södertörn University, Sweden, 18 January 2018. Slides
  30. Profanity in the Nordics on Twitter. Presentation at the 5th SwiSca Symposium "What the HEL", University of Helsinki, Finland, 23 November 2017. Slides
  31. European language ecology and bilingualism with English on Twitter. Presentation at the 5th Conference on CMC and Social Media Corpora for the Humanities, EURAC Research, Bozen/Bolzano, Italy, 3 October 2017. Slides Article
  32. Multilingual clusters and gender in Nordic Twitter. Presentation at the CLARIN-PLUS Workshop "Creation and Use of Social Media Resources", Vytautas Magnus University, Kaunas, Lithuania, 19 May 2017. Slides
  33. Multilingual clusters and gender in Nordic Twitter. Presentation at the Second Digital Humanities in the Nordic Countries Conference, University of Gothenburg, Sweden, 14 March 2017. Slides
  34. Grammatical frequencies and gender in Nordic Twitter Englishes. Presentation at the 4th Conference on CMC and Social Media Corpora for the Humanities, University of Ljubljana, Slovenia, 27 September 2016. Slides Article
  35. Nordic Englishes on Twitter. Presentation at Digital Humanities in the Nordic Countries Conference, University of Oslo, Norway, 16 March 2016. Slides
  36. Gender and grammatical type frequencies in Finland Twitter English. Presentation at Interrelating Distance and Interaction Workshop, University of Oulu, Finland, 3 November 2015.
  37. Gender and lexical type frequencies in Finland Twitter English. Presentation at From data to evidence: Big data, rich data, uncharted data, University of Helsinki, Finland, 22 October 2015. Slides
  38. Non-standard lexical and grammatical resources in Finland Twitter English. Presentation at the 45th Poznań Linguistic Meeting, Poznań, Poland, 19 September 2015. Slides
  39. English-language social media in Finland: Twitter data collection and analysis. Presentation at the 12th Conference of the European Society for the Study of English, Košice, Slovakia, 31 August 2014. Slides
  40. Web corpora for discourse analysis: The language of travel and tourism. Presentation at the 79th Southeastern Conference on Linguistics, University of Kentucky, 13 April 2012.
  41. Lisa Lena Opas-Hänninen, Ilkka Juuso, William A. Kretzschmar, Jr., Tapio Seppänen, Steven Coats. The Digital Archive of Southern Speech. Helsinki Corpus Festival, 1 October 2011.
  42. Constituting mental maps: A corpus linguistics-based approach to perceptual geography in the GDR. Presentation at the 77th Southeastern Conference on Linguistics, University of Mississippi, 29 April 2010.
  43. William A. Kretzschmar, Jr., Paulina Bounds, Steven Coats, Tony Snodgrass, Lisa Lena Opas-Hänninen, Tapio Seppänen, and Ilkka Juuso. The Digital Archive of Southern Speech. 76th Southeastern Conference on Linguistics, Tulane University, 9 April 2009.

Education

Teaching

I teach courses on Academic Communication, Sociolinguistics, Digital Humanities, and North American studies. If you are a student in one of my courses, go to Moodle for course information and materials.

I was one of the group leaders at the Helsinki Digital Humanities Hackathon for the theme "Brexit in Transnational Social Media" in May 2019.

Professional activities

Visualizations

You can find a map of the semantic similarity of emoji types here. A representation of the links between languages for a sample of bilingual European Twitter users is available here. Check out some maps of variation in articulation rate in American English here. A table with more than 1,000 authentic double modals (with links to the videos at the time of utterance), as well as about 1,000 two-modal sequences that are instances of "self-repair" is here. The ZipfExplorer is a tool for the visualization of word frequency differences in texts.

My GitHub is here.