class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br> ## Double modals in YouTube videos from North America and the British Isles <br><br><br> Steven Coats<br> English Philology, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> Corpus-based and Computational Approaches to Variation Workshop, Helsinki <br> April 27th, 2022<br> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                             Double modals | CorCoDial Workshop, Helsinki</span></div> --- <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                             Double modals | CorCoDial Workshop, Helsinki</span></div> ## Outline 1. CoNASE and CoBISE 2. YouTube ASR captions files, data collection and geocoding 3. Methods: Frequency analysis (frequent features), manual inspection/annotation (rare features) 4. Double modals in North America and in the British Isles 5. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                             Double modals | CorCoDial Workshop, Helsinki</span></div> --- ### Introduction - Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span> - Available corpora of transcribed spoken English <span class="small">(Anderwald & Wagner 2007; Corbett 2014; Corrigan et al. 2012; Du Bois et al. 2000-2005; Kallen & Kirk 2007)</span> are small or lack a broad geographic focus; size may make it difficult to find some features - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span> - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m token corpus of 38,680 ASR transcripts <span class="small">(Coats forthcoming a)</span> - Correspond to more than 166,000 hours of video from more than 3,000 YouTube channels of local councils and other government entities in locations in the US, Canada, England, Scotland, Wales, Northern Ireland, and the Republic of Ireland - Freely available for research use; download from the Harvard Dataverse [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) and [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD) --- ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither - User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software - Auto-generated captions are generated by YT's speech-to-text service - CoNASE and CoBISE: target YT ASR captions --- exclude: true ### WebVTT file ![](data:image/png;base64,#WY9RPeXA3pw_vtt.png) --- ### Focus on regional and local council channels Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable --- ### Data collection and processing - Identification of relevant channels (YouTube API, searches of public-facing server, lists of councils with YT channels) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YouTube-DL](https://github.com/ytdl-org/youtube-dl) - VPN or [Tor](https://www.torproject.org/) to circumvent IP blocking - Geocoding: String containing council name + channel name + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- ### Transcript accuracy - ASR transcripts contain errors (WER ~22%) - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers - Low-frequency phenomena: manually inspect corpus hits ![:scale 60%](data:image/png;base64,#asr_wordfreqs.png) --- exclude: true ### Corpus use cases and size - Regional language (dialectology): e.g. syntax, mood and modality - Pragmatics: Turn-taking, politeness markers - Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale .small[.pull-left[ **CoNASE** Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- US |2,189 |270,931 |1,149,030,824 | 141,455.11 Canada | 383 |30,916 |103,035,369 |12,586.77 **CoANZSE** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Australia |408 |38,786 |111,470,235 | 13,885.1 New Zealand | 74 |18,029 |84,058,661 |1,083.75 ]] <div style="top:-40px"> .small[.pull-right[ **CoBISE** Country | Channels|Videos|Tokens |Length (h) -------------------|---------|------|-----------|----------- England |324 |23,657|72,879,173 | 8,518.39 Northern Ireland | 10 |1,898 |6,508,505 |774.17 Republic of Ireland| 26 |2,525 |6,264,276 |680.81 Scotland |75 |8,135 |17,111,396 |1,845.35 Wales |18 |2,465 |8,800,264 |982.66 **CoGS** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Germany |1,313 |39,495 |50,554,070 | 7,223.44 ]] </div> --- ### Example analysis: Double modals - Non-standard rare syntactic feature in the British Isles, North America, and elsewhere <span class="small">(Montgomery & Nagle 1994; Coats 2022)</span> - **Will you can help me with this?** - Occurs only in the American Southeast and in Scotland/Northern Ireland/Northern England? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(*LAMSAS*, *LAGS*, Murray 1873; Wright 1898-1905; Anderwald & Wagner 2007; Kallen & Kirk 2007; Smith et al. 2019)</span> --- ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regex of two-tier combinations, plus forms with intervening pronouns, auxiliary verbs, negations ```python import re hits = [] for i,x in cobise_df.iterrows(): pat1 = re.compile("((\\w+_\\S+_\\S+\\s){3}'+x[0]+'_\\w+_\\S+ '+x[1]+'n?_\\w+_\\S+(\\w+_\\S+_\\S+\\s){3})",re.IGNORECASE) if pat1.search(x["text_pos"]): finds1 = pat1.findall(x["text_pos"])[0] seq = " ".join([x.split("_")[0] for x in finds1[0].split()]) time = finds1[0].split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` --- class: small ### Excerpt from generated table
--- ### Double modals - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) --- class: center, middle background-image: url(data:image/png;base64,#dm_map_0.png) background-size: contain --- class: center, middle background-image: url(data:image/png;base64,#uk_dm_map.png) background-size: contain --- ### A few caveats - Meetings of local government not representative of speech in general - ASR errors, quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> --- ### Summary and outlook - Large corpora of ASR transcripts from YouTube channels of local governments in the US, Canada, Britain, and Ireland (coming soon: Australia/NZ, 190m tokens, Germany, 56m tokens) - Useful for corpus studies of spoken language, dialectology, pragmatics - Double modals are more widespread than has previously been documented --- #Thank you! --- ### References .small[ .hangingindent[ Agarwal, S., S. Godbole, D. Punjani & S. Roy. 2007. [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Anderwald, L. & S. Wagner. 2007. The Freiburg English Dialect Corpus: Applying corpus-linguistic research tools to the analysis of dialect data. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 35–53. Palgrave Macmillan. Coats, S. In review. Double modals in contemporary British and Irish Speech. Coats, S. Forthcoming a. Dialect corpora from YouTube. *Proceedings of ICAME41*. De Gruyter. Coats, S. 2022. [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Corbett, J. 2014. Syntactic variation: Evidence from the Scottish Corpus of Text and Speech. In: R. Lawson (Ed.), *Sociolinguistics in Scotland*, 258–276. Palgrave Macmillan. Corrigan, K. P., I. Buchstaller, A. Mearns & H. Moisl. 2012. [*The Diachronic Electronic Corpus of Tyneside English*](https://research.ncl.ac.uk/decte). Du Bois, J. W., W. L. Chafe, C. Meyer, S. A. Thompson, R. Englebretson & N. Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium. Grieve, J., C. Montgomery, A. Nini, A. Murakami & D. Guo. 2019. [Mapping lexical dialect variation in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2. Honnibal, M., I. Montani, H. Peters, S. V. Landeghem, M. Samsonov, J. Geovedi, J. Regan, G. Orosz, S. L. Kristiansen, P. O. McCann, D. Altinok, Roman, G. Howard, S. Bozek, E. Bot, M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang, B. Böing, P. K. Tippa, jeannefukumaru, G. Dubbin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj, wbwseeker, M. Burton, thomasO & A. Patel. 2019. [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). Kallen, J. & J. Kirk. 2007. ICE-Ireland: Local variations on global standards. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 121–162. Palgrave Macmillan. ]] --- ### References II .small[ .hangingindent[ Markl, N. & C. Lai. 2021. [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. Meyer, J., L. Rauchenstein, J. D. Eisenberg & N. Howell. 2020. [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Montgomery, M. B. & S. J. Nagle. 1994. Double modals in Scotland and the Southern United States: Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108. Murray, J. 1873. *The dialect of the southern counties of Scotland: Its pronunciation, grammar, and historical relations.* London: Asher & Co. Nerbonne, J. 2009. Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198. Smith, J., D. Adger, B. Aitken, C. Heycock, E. Jamieson & G. Thoms. 2019. [*The Scots Syntax Atlas*](https://scotssyntaxatlas.ac.uk). University of Glasgow. Szmrecsanyi, B. 2013. *Grammatical variation in British English dialects: A study in corpus-based dialectometry*. Cambridge University Press. Szmrecsanyi, B. 2011. Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76. Tatman, R. 2017. [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. Wright, J. 1898–1905. *The English dialect dictionary* (6 volumes). London: Henry Frowde. ]]