Steven Coats

steven.coats (at) oulu.fi
University Lecturer, English Philology, University of Oulu, Finland

I'm a linguist interested in corpus linguistics, language variation, online language and social media, and computational approaches to language analysis, among other topics.

My research background is mostly in dialectology, sociolinguistics, and digital humanities. I've created the Corpus of North American Spoken English (CoNASE), the Corpus of British Isles Spoken English (CoBISE), the Corpus of Australian and New Zealand Spoken English (CoANZSE), and the Corpus of German Speech (CoGS). These are large corpora of geolocated YouTube transcripts. CoANZSE Audio is the searchable online version of CoANZSE, which contains audio and alignments, in addition to speech transcripts. Two recent corpus projects are the The YouTube Corpus of Singapore English Podcasts and the The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings

Professional experience

University Lecturer, University of Oulu, Finland, 2015 –
University Teacher, University of Oulu, Finland, 2012 – 2015
Research Assisant, Linguistic Atlas Project, University of Georgia, USA, 2008 – 2010

Publications

Coats, Steven. (2026). A fine-tuned ASR model for historical American dialect recordings. In Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek and Antonio Toral (eds.), Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), 1372–1381. Link Article
Morin, Cameron, Steven Coats, and Jonathan Dunn. (2026). How register and region shape the language network: Evidence from Computational Construction Grammar. Constructions 18(1). Link
Coats, Steven and Dana Roemling. (2025). The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. In Annamária Fábián and Igor Trost (eds.), Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities, 45-49. Bayreuth, Germany: University of Bayreuth. Link Article Website
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin and Robert Fuchs. (2025). The YouTube Corpus of Singapore English Podcasts. English World-Wide 46(3), 274–298. Article Website
Coats, Steven. (2025). An automatic pipeline for processing streamed content: New horizons for corpus linguistics and phonetics. In Louis Cotgrove, Laura Herzberg, and Harald Lüngen (eds.), Exploring digitally-mediated communication with corpora: Methods, analyses, and corpus construction, 257–274. Berlin: De Gruyter Brill. Link Article
Morin, Cameron and Steven Coats. (2025). Double modals in Australian and New Zealand English. World Englishes 44(3), 415–438. Link
Coats, Steven. (2025). 'What the X' in Anglophone government meetings: Areal distribution, emotionality, and euphemism. Lingua 321. Link
Coats, Steven, Chloé Diskin-Holdaway, and Debbie Loakes. (2025). Regional distribution of the /el/-/æl/ merger in Australian English. In Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, and Marcos Zampieri (eds.), Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. Article
Coats, Steven. (2024). Language of social media and online communication in Germanic. In Sebastian Kürschner and Antje Dammel (eds.), The Oxford Encyclopedia of Germanic Linguistics. Link
Coats, Steven and Cameron Morin. (2024). Double modals beyond the Atlantic: New evidence from computational sociolinguistics. English Today. Link
Coats, Steven. (2024). A development outlook for CLARIN’s northernmost center. In Vincent Vandeghinste and Thalassia Kontino (eds.), CLARIN Annual Conference Proceedings 2024, 85–89. Link
Coats, Steven. (2024). Commenting on local politics: An analysis of YouTube video comments for local government videos. Research in Corpus Linguistics 13(1), 1–25. Link
Coats, Steven. (2024). A framework for analysis of speech and chat content in YouTube and Twitch streams. In Céline Poudat and Mathilde Guernut (eds.), Proceedings of the 11th Conference on CMC and Social Media Corpora for the Humanities, 16–19. Nice, France: CORLI. Link Article
Coats, Steven. (2024). Building a searchable online corpus of Australian and New Zealand aligned speech. Australian Journal of Linguistics. Link
Coats, Steven. (2024). CoANZSE Audio: Creation of an online corpus for linguistic and phonetic analysis of Australian and New Zealand Englishes. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 3407–3412. Link
Coats, Steven. (2024). Naturalistic double modals in North America. American Speech 99(1), 47–77. Link Article
Coats, Steven and Veronika Laippala, eds. (2024). Linguistics across disciplinary borders: The march of data. London: Bloomsbury Academic. Link
Coats, Steven and Veronika Laippala. (2024). Introduction. In Steven Coats and Veronika Laippala (eds.), Linguistics across disciplinary borders: The march of data, 1–16. London: Bloomsbury Academic. Link
Coats, Steven. (2024). Noisy data: Using automatic speech recognition transcripts for linguistic research. In Steven Coats and Veronika Laippala (eds.), Linguistics across disciplinary borders: The march of data, 17–39. London: Bloomsbury Academic. Link
Méli, Adrien, Steven Coats and Nicolas Ballier. (2023). Methods for phonetic scraping of Youtube videos. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), 244–249. Link
Coats, Steven. (2023). Raumgeographische Verteilung von Twitter-Hashtags im deutschen Sprachraum. Neuphilologische Mitteilungen 114(2), 97–126. Link
Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. Language Resources and Evaluation. Link
Coats, Steven. (2023). A pipeline for the large-scale acoustic analysis of streamed content. In Louis Cotgrove, Laura Herzberg, Harald Lüngen, and Ines Pisetta (eds.), Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023), 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache. Link Article
Coats, Steven. (2023). Double modals in contemporary British and Irish speech. English Language and Linguistics 27(4), 693–718. Link Article
Coats, Steven. (2023). Dialect corpora from YouTube. In Beatrix Busse, Nina Dumrukcic, and Ingo Kleiber (eds.), Language and linguistics in a complex world, 79–102. Berlin: de Gruyter. Link
Coats, Steven. (2022). The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. In Pradeesh Parameswaran, Jennifer Biggs, and David Powers (eds.), Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association, 1–5. Adelaide, Australia: Australasian Language Technology Association. (Awarded "best short paper") Link
Coats, Steven. (2022). A database of North American double modals and self-repairs from YouTube. Psychology of Language and Communication 26, 273–296. Link
Coats, Steven. (2022). The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. In Karl Berglund, Matti La Mela, and Inge Zwart (eds.), Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022, 187–194. Aachen, Germany: CEUR. Link
Coats, Steven. (2021). ZipfExplorer: A Tool for the Comparison of Shared Lexis. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne, and Jānis Daugavieti (eds.), Post-Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21–23, 2020, 145–155. Aachen, Germany: CEUR. Article Tool
Coats, Steven. (2021). 'Bad language' in the Nordics: profanity and gender in a social media corpus. Acta Linguistica Hafniensia 53(1), 22–57. Link Article
Coats, Steven. (2020). Comparing word frequencies and lexical diversity with the ZipfExplorer tool. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne, and Jānis Daugavieti (eds.), Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21–23, 2020, 219–225. Aachen, Germany: CEUR. Article Tool
Coats, Steven. (2020). Anglicism diversity in hyphenated German compounds. In Julien Longhi and Claudia Marinica (eds.), CMC Corpora through the Prism of Digital Humanities, 75–92. Paris: L'Harmattan. Link
Coats, Steven. (2020). Articulation rate in American English in a corpus of YouTube videos. Language and Speech 63(4), 799–831. Link Article
Coats, Steven and Adrien Barbaresi. (2019). Productivity of anglicism bases in hyphenated German compounds. In Julien Longhi and Claudi Marinica (eds.), Proceedings of the 7th Conference on CMC and Social Media Corpora for the Humanities, 53–58. Cergy, France: Cergy-Pontoise University. Link Article
Coats, Steven. (2019). Lexicon geupdated: New German anglicisms in a social media corpus. European Journal of Applied Linguistics 7(2), 255–280. Link Article
Coats, Steven. (2019). Language choice and gender in a Nordic social media corpus. Nordic Journal of Linguistics 42(1), 31–55. Link Article
Coats, Steven. (2019). Online language ecology: Twitter in Europe. In Egon Stemle and Ciara Wigham (eds.), Building computer-mediated communication corpora for sociolinguistic analysis, 73–96. Clermont-Ferrand: Presses universitaires Blaise Pascal. Link Article
Coats, Steven. (2019). A Corpus of regional American language from YouTube. In Costanza Navarretta et al. (eds.), Proceedings of the 4th Digital Humanities in the Nordic Countries Conference, Copenhagen, Denmark, March 6–8, 2019, 79–91. Aachen, Germany: CEUR. Article
Coats, Steven. (2018). Variation of new German verbal Anglicisms in a social media corpus. In Reinhild Vandekerckhove, Darja Fišer, and Lisa Hilte (eds.), Proceedings of the 6th conference on CMC and social media corpora for the humanities, 27–32. Antwerp, Belgium: University of Antwerp. Link Article Data
Coats, Steven. (2018). Skin tone emoji and sentiment on Twitter. In Eetu Mäkelä and Mikko Tolonen (eds.), Proceedings of the 3rd Digital Humanities in the Nordic Countries Conference, Helsinki, Finland, March 7–9, 2018, 122–138. Aachen, Germany: CEUR. Link Article
Coats, Steven. (2018). Collecting Twitter data. In Christine Mallinson, Becky Childs, and Gerard Van Herk (eds.), Data collection in sociolinguistics: Methods and applications (2nd Ed.), 248–251. London/New York: Routledge. Link
Coats, Steven. (2017). Gender and lexical type frequencies in Finland Twitter English. In Turo Hiltunen, Joe McVeigh, and Tanja Säily (eds.), Big and rich data in English corpus linguistics: Methods and explorations (= Studies in Variation, Contacts and Change in English 19). Helsinki, Finland: Varieng. Link
Coats, Steven. (2017). Gender and grammatical frequencies in social media English from the Nordic countries. In Darja Fišer and Michael Beißwenger (eds.), Investigating social media corpora, 102–121. Ljubljana, Slovenia: U. of Ljubljana Academic Publishing. Link
Coats, Steven. (2017). European language ecology and bilingualism with English on Twitter. In Egon Stemle and Ciara Wigham (eds.), Proceedings of the 5th conference on CMC and social media corpora for the humanities, 35–38. Bozen/Bolzano: Eurac Research. Article
Coats, Steven. (2016). Grammatical feature frequencies of English on Twitter in Finland. In Lauren Squires (ed.), English in computer-mediated communication: Variation, representation, and change, 179–210. Boston/Berlin: de Gruyter Mouton. Link Article
Coats, Steven. (2016). Grammatical frequencies and gender in Nordic Twitter Englishes. In Darja Fišer and Michael Beißwenger (eds.), Proceedings of the 4th conference on CMC and social media corpora for the humanities, 12–16. Ljubljana: U. of Ljubljana Academic Publishing. Article
Kretzschmar, William A. Jr., Paulina Bounds, Jacqueline Hettel, Steven Coats, Lee Pederson, Lisa-Lena Opas-Hänninen, Ilkka Juuso, and Tapio Seppänen. (2012). Digital Archive of Southern Speech. Philadelphia, PA: Linguistic Data Consortium. Link

Presentations

Fine-Tuning ASR for corpus linguistics: Singapore English. Presentation at ICAME 47, Koblenz, Germany, 28 May 2026. Slides
Compiling Corpora from Social Media: Combined Audio and Chat Transcripts for Recorded Video Streams. Presentation at the University of Bonn, Germany, 22 May 2026. Slides Code
MD_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell-Delbridge Recordings. Presentation at DialRes Workshop, LREC, Palma, Spain, 16 May 2026. Slides Dataset
A Fine-tuned ASR Model for Historical American Dialect Recordings. Presentation at LREC, Palma, Spain, 13 May 2026. Slides Model Dataset
Combined audio and chat transcripts for recorded video streams. Presentation at Love Data Week, Toulouse, France, 10 February 2026. Slides Code
Regional variation in Australian English monophthongs. Presentation at the Australian Linguistic Society Conference, Gold Coast, Australia, 4 December 2025. Slides
The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the 12th International Conference on CMC and Social Media Corpora for the Humanities, Bayreuth, Germany, 5 September 2025. Slides Website Article
The YouTube Corpus of Singapore English Podcasts. Presentation at the 8th ISLE Conference, Santiago de Compostela, Spain, 2 September 2025. Slides Website Article
What the heck? Euphemisms and emotionality in Anglophone government meetings. Presentation at the 9th SwiSca Confernece, Helsinki, Finland, 22 January 2025. Slides
Regional distribution of the /el/-/æl/ merger in Australian English. Poster and booster presentation at the 12th VarDial Workshop, COLING, Abu Dhabi, UAE, 19 January 2025. Slides Article
A development outlook for CLARIN’s northernmost center. Presentation at the CLARIN Annual Conference Proceedings 2024, Barcelona, Spain, 17 October 2024. Slides
Regional variation in monophthongs in Australian and New Zealand Englishes: A big data approach. Presentation at the 10th Biannual Conference for the Linguistics of Contemporary English, Alicante, Spain, 27 September 2024. Slides
A framework for analysis of speech and chat content in YouTube and Twitch streams. Presentation at the 11th International Conference on CMC and Social Media Corpora for the Humanities, Nice, France, 5 September 2024. Slides Code
Analysis of online discourse from text to multimedia. Keynote presentation for the University of Eastern Finland SCE Program, 20 May 2024. Slides
YouTube Phonetics Pipeline Workshop. Workshop for the ALOES pre-conference event, Paris, France, 28 March 2024. Slides Code
A pipeline for the large-scale acoustic analysis of streamed content. Presentation at the 10th International Conference on CMC and Social Media Corpora for the Humanities, Mannheim, Germany, 15 September 2023. Slides
The Corpus of Australian and New Zealand Spoken English (CoANZSE). Virtual presentation at the Workshop on Language Corpora in Australia, Canberra, Australia, 3 July 2023. Slides
Corpora for the study of multimodal variation in English: Acoustic analysis from CoNASE. Presentation at Kielitieteen Päivät, Oulu, Finland, 25 May 2023. Slides
Corpora of automatic speech recognition transcripts for the study of variation in English: Syntactic and phonetic perspectives. Presentation at PAC2023, Nanterre, France, 12 April 2023. Slides
CoANZSE: The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. Virtual presentation at the 20th Annual Workshop of the Australasian Language Technology Association, Adelaide, Australia, 16 December 2022. Slides
Civic engagement with local government videos: Comparing YouTube transcripts with user comments. Presentation at the 9th Conference on CMC and Social Media Corpora for the Humanities, Santiago de Compostela, Spain, 29 September 2022. Slides
CoANZSE: The Corpus of Australian and New Zealand Spoken English. Computational Thinking in the Humanities Online Workshop, Brisbane, Australia, 1 September 2022. Slides
Double modals in YouTube videos from North America and the British Isles. Presentation at CoCorDial Workshop, Helsinki, Finland, 27 April 2022. Slides
The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. Virtual presentation at DHNB 2022, Uppsala, Sweden, 17 March 2022. Slides
Scraping online dictionaries for usage annotations. Virtual presentation at the 7th SwiSca Symposium, Reykjavík, Iceland, 2 December 2021. Slides
A database of North American multiple modals from YouTube. Presentation at the 8th Conference on CMC and Social Media Corpora for the Humanities, Nijmegen, the Netherlands, 29 October 2021. Slides
Multiple modals in the wild: A study of 24,530 multiple modal sequences in naturalistic North American speech. Virtual presentation for the workshop "The March of Data" at the Sixth International Society for the Study of English Conference, Joensuu, Finland, 2 June 2021. Slides
Comparing word frequencies and lexical diversity with the ZipfExplorer tool. Virtual presentation at the Fifth Digital Humanities in the Nordic Countries Conference, National Library of Latvia, Riga, Latvia, 23 October 2020. Slides
Dialect corpora from YouTube. Virtual presentation at ICAME 41, University of Heidelberg, Germany, 20–24 May 2020. Slides Video
Steven Coats and Adrien Barbaresi. Productivity of anglicism bases in hyphenated German compounds. Presentation at the 7th Conference on CMC and Social Media Corpora for the Humanities, Cergy, France, 10 September 2019. Slides
Regional variation in speech rate in American English from YouTube videos. Presentation at Research Data and Humanities Conference, University of Oulu, Finland, 14 August 2019, and 9th Conference of the Finnish Society for the Study of English, University of Tampere, Finland, 15 August 2019. Slides
Swearing on Twitter: Harvesting and visualizing data. Slides for workshop at the 6th SwiSca Symposium, Södertörn University, Sweden, 23 May 2019. Slides
A Corpus of regional American language from YouTube. Presentation at the Fourth Digital Humanities in the Nordic Countries Conference, University of Copenhagen, Denmark, 8 March 2019. Slides Article
Variation of new German verbal Anglicisms in a social media corpus. Presentation at the 6th Conference on CMC and Social Media Corpora for the Humanities, Antwerp, Belgium, 17 September 2018. Slides

Slides (deutsche Version)

Exploring code-switching and borrowing using word vectors. Presentation at the 14th Conference of the European Society for the Study of English, Brno, Czechia, 1 September 2018. Slides
William A. Kretzschmar, Jr. and Steven Coats. Fractal visualization of corpus data. ICAME 39, University of Tampere, Finland, 30 May 2018.
Skin tone emoji and sentiment on Twitter. Presentation at the Third Digital Humanities in the Nordic Countries Conference, University of Helsinki, Finland, 7 March 2018. Slides Article
Profanity in the Nordics on Twitter. Invited presentation at the Higher Seminar Series, Södertörn University, Sweden, 18 January 2018. Slides
Profanity in the Nordics on Twitter. Presentation at the 5th SwiSca Symposium "What the HEL", University of Helsinki, Finland, 23 November 2017. Slides
European language ecology and bilingualism with English on Twitter. Presentation at the 5th Conference on CMC and Social Media Corpora for the Humanities, EURAC Research, Bozen/Bolzano, Italy, 3 October 2017. Slides Article
Multilingual clusters and gender in Nordic Twitter. Presentation at the CLARIN-PLUS Workshop "Creation and Use of Social Media Resources", Vytautas Magnus University, Kaunas, Lithuania, 19 May 2017. Slides
Multilingual clusters and gender in Nordic Twitter. Presentation at the Second Digital Humanities in the Nordic Countries Conference, University of Gothenburg, Sweden, 14 March 2017. Slides
Grammatical frequencies and gender in Nordic Twitter Englishes. Presentation at the 4th Conference on CMC and Social Media Corpora for the Humanities, University of Ljubljana, Slovenia, 27 September 2016. Slides Article
Nordic Englishes on Twitter. Presentation at Digital Humanities in the Nordic Countries Conference, University of Oslo, Norway, 16 March 2016. Slides
Gender and grammatical type frequencies in Finland Twitter English. Presentation at Interrelating Distance and Interaction Workshop, University of Oulu, Finland, 3 November 2015.
Gender and lexical type frequencies in Finland Twitter English. Presentation at From data to evidence: Big data, rich data, uncharted data, University of Helsinki, Finland, 22 October 2015. Slides
Non-standard lexical and grammatical resources in Finland Twitter English. Presentation at the 45th Poznań Linguistic Meeting, Poznań, Poland, 19 September 2015. Slides
English-language social media in Finland: Twitter data collection and analysis. Presentation at the 12th Conference of the European Society for the Study of English, Košice, Slovakia, 31 August 2014. Slides
Web corpora for discourse analysis: The language of travel and tourism. Presentation at the 79th Southeastern Conference on Linguistics, University of Kentucky, 13 April 2012.
Lisa Lena Opas-Hänninen, Ilkka Juuso, William A. Kretzschmar, Jr., Tapio Seppänen, Steven Coats. The Digital Archive of Southern Speech. Helsinki Corpus Festival, 1 October 2011.
Constituting mental maps: A corpus linguistics-based approach to perceptual geography in the GDR. Presentation at the 77th Southeastern Conference on Linguistics, University of Mississippi, 29 April 2010.
William A. Kretzschmar, Jr., Paulina Bounds, Steven Coats, Tony Snodgrass, Lisa Lena Opas-Hänninen, Tapio Seppänen, and Ilkka Juuso. The Digital Archive of Southern Speech. 76th Southeastern Conference on Linguistics, Tulane University, 9 April 2009.

Education

University of Georgia, Athens, Georgia, USA: Ph.D., Linguistics, 2015
Ruprecht-Karls-Universität Heidelberg, Germany: Magister Artium, Deutsche Philologie, 2007
Oberlin College, Ohio, USA: B.A., Biology, 1996

I teach courses on Academic Communication, Sociolinguistics, Digital Humanities, and North American studies. If you are a student in one of my courses, go to Moodle for course information and materials.

I was one of the group leaders at the Helsinki Digital Humanities Hackathon for the theme "Brexit in Transnational Social Media" in May 2019.

Professional activities

Reviewer, CMC-Corpora conference series, DHNB conference series, VarDial, ASRU, and other conferences
Reviewer, Text & Talk, Human IT, Journal of Pragmatics, AJL, and other publications
Steering committee, Conference on CMC and Social Media Corpora for the Humanities series
Member, European Association for Digital Humanities
Member, Finnish Society for the Study of English

Visualizations

You can find a map of the semantic similarity of emoji types here. A representation of the links between languages for a sample of bilingual European Twitter users is available here. Check out some maps of variation in articulation rate in American English here. A table with more than 1,000 authentic double modals (with links to the videos at the time of utterance), as well as about 1,000 two-modal sequences that are instances of "self-repair" is here. The ZipfExplorer is a tool for the visualization of word frequency differences in texts.

My GitHub is here.