class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br> ## <span style="color:black;-webkit-text-fill-color: orange;-webkit-text-stroke: 1px;">CoANZSE: The Corpus of Australian and New Zealand Spoken English</span> <br><br><br> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> Computational Thinking in the Humanities Online Workshop<br> September 1st, 2022<br> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          CoANZSE | Computational Approaches to Language Variation, Joensuu</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          CoANZSE | Computational Approaches to Language Variation, Joensuu</span></div> ## Outline 1. Background, YouTube ASR captions files, data collection and processing 2. CoANZSE locations and size 3. Double modals in Australia and New Zealand? 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          CoANZSE | Computational Approaches to Language Variation, Joensuu</span></div> --- ### Background - Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span> - Many available corpora of transcribed spoken English are small or lack sufficient geographical granularity to make inferences about regional distributions of features .small[ Corpus |Location(s) |nr_words| Reference ----------------------|-------------------|--------|-------------------------- FRED | Britain |~2.5m | Anderwald & Wagner 2007 SCOTS Corpus | Scotland |~1m | Corbett 2014 NECTE/DECTE | Newcastle/Tyneside|~315k | Corrigan et al. 2012 Santa Barbara Corpus | US |~249k | Du Bois et al. 2000-2005 ICE-Ireland | Ireland |~600k | Kallen & Kirk 2007 ICE-Aus | Australia |~600k | Cassidy et al. 2012 Monash Corpus | Melbourne |~96k | Bradshaw et al. 2010 Griffith Corpus | Brisbane |~32k | Cassidy et al. 2012 ] - Automatic Speech Recognition (ASR) transcripts are available online for speech from specific locations - Videos from local councils and other government entities can be harvested to create large corpora --- ### Example video <iframe width="560" height="315" src="https://www.youtube.com/embed/cn8vWlUae7Y?rel=0&&showinfo=0&cc_load_policy=1&cc_lang_pref=en" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ### WebVTT file ![](data:image/png;base64,#./Maranoa_webvtt_example.png) --- exclude: true ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither - User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software - Auto-generated captions are generated by YT's speech-to-text service - CoANZSE (and CoNASE and CoBISE): target YT ASR captions --- ### YouTube ASR Corpora US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Australia, New Zealand, and Germany - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span> - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats forthcoming b)</span> - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 39.5k transcripts, 1,308 locations <span class="small">(Coats in review b)</span> - <span class="large">[CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 57k transcripts, 482 locations </span><span class="small">(Coats in review b)</span> Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD),[CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK)) --- ### Data format <div> <table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;"> <thead> <tr style="text-align: right;"> <th></th> <th>country</th> <th>state</th> <th>name</th> <th>channel_name</th> <th>channel_url</th> <th>video_title</th> <th>video_id</th> <th>upload_date</th> <th>video_length</th> <th>text_pos</th> <th>location</th> <th>latlong</th> <th>nr_words</th> </tr> </thead1> <tbody1> <tr> <th>0</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Road Resurfacing Video</td> <td>zVr6S5XkJ28</td> <td>20181127</td> <td>146.120</td> <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>433</td> </tr> <tr> <th>1</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td> <td>p4MjirCc1oU</td> <td>20220301</td> <td>181.959</td> <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>620</td> </tr> <tr> <th>2</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Transport Capital Works Video</td> <td>DXlkVTcmeho</td> <td>20180417</td> <td>140.450</td> <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>347</td> </tr> <tr> <th>3</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Council Meeting Wrap Up February 2022</td> <td>2NhuhF2fBu8</td> <td>20220224</td> <td>107.840</td> <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>341</td> </tr> <tr> <th>4</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>CITY DEAL 4 March 2018</td> <td>4-cv69ZcwVs</td> <td>20180305</td> <td>130.159</td> <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>420</td> </tr1> </tbody1> </table1></div> --- ### Focus on regional and local council channels Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable - In most jurisdictions government content is in the public domain --- ### Data collection and processing - Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- ### CoANZSE channel locations <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/ausnz_dot.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- ### CoANZSE corpus size by country/state/territory .small[ Territory |nr_channels|nr_videos |nr_words|video_length (h) ----------------------------|---|-------|-----------|---- Australian Capital Territory| 8 |650 |915,542 |111.79 New South Wales |114|9,741 |27,580,773 |3,428.87 Northern Territory |11 | 289 |315,300 |48.72 New Zealand |74 |18,029 |84,058,661 |10,175.80 Queensland |58 |7,356 |19,988,051 |2,642.75 South Australia |50 |3,537 |13,856,275 |1,716.72 Tasmania |21 |1,260 |5,086,867 |636.99 Victoria |78 |12,138 |35,304,943 |4,205.40 Western Australia |68 |3,815 |8,422,484 |1,063.78 | | | | Total |482|56,815 |195,528,896|24,030.82 ] --- exclude: true ### Potential analyses - Non-numerical quantifiers *heaps* and *lots* --- exclude: true ### Corpus use cases and size - Regional language (dialectology): e.g. syntax, mood and modality - Pragmatics: Turn-taking, politeness markers - Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale .small[.pull-left[ **CoNASE** Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- US |2,189 |270,931 |1,149,030,824 | 141,455.11 Canada | 383 |30,916 |103,035,369 |12,586.77 **CoANZSE** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Australia |408 |38,786 |111,470,235 | 13,885.1 New Zealand | 74 |18,029 |84,058,661 |1,083.75 ]] <div style="top:-40px"> .small[.pull-right[ **CoBISE** Country | Channels|Videos|Tokens |Length (h) -------------------|---------|------|-----------|----------- England |324 |23,657|72,879,173 |8,518.39 Northern Ireland | 10 |1,898 |6,508,505 |774.17 Republic of Ireland| 26 |2,525 |6,264,276 |680.81 Scotland |75 |8,135 |17,111,396 |1,845.35 Wales |18 |2,465 |8,800,264 |982.66 **CoGS** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Germany |1,313 |39,495 |50,554,070 | 7,223.44 ]] </div> --- ### Example analysis: Double modals - Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022)</span> - **I might could help you with this** - Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from *LAMSAS*, *LAGS*, surveys administered mostly in American Southeast and North of Britain)</span> - More widely used in North America and the British Isles than previously thought (Coats 2022, Coats in review) - Not yet reported in Australian or NZ speech --- ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regexes of two-tier combinations ```python import re hits = [] for i,x in coanzse_df.iterrows(): pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE) if pat1.search(x["text_pos"]): finds1 = pat1.findall(x["text_pos"])[0] seq = " ".join([x.split("_")[0] for x in finds1[0].split()]) time = finds1[0].split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` --- class: small ### Excerpt from generated table
--- ### Finding features - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) - Also in Australia and New Zealand! --- ### A few caveats - Videos of local government not representative of speech in general - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers --- exclude: true ### Summary and outlook - Large corpus of ASR transcripts from YouTube channels of local governments in Australia/NZ - Possibly useful for corpus studies of spoken language, dialectology, pragmatics - Double modals are more widespread than has previously been documented --- #Thank you! --- ### References .small[ .hangingindent[ Agarwal, S., S. Godbole, D. Punjani & S. Roy. 2007. [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Anderwald, L. & S. Wagner. 2007. The Freiburg English Dialect Corpus: Applying corpus-linguistic research tools to the analysis of dialect data. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 35–53. Palgrave Macmillan. Bradshaw, J., Burridge, K. & Clyne, M. (2010). The Monash Corpus of Spoken Australian English. In de L. Beuzeville & P. Peters (Eds.), Proceedings of the 2008 Conference of the Australian Linguistics Society. Cassidy, S., Haugh, M., Peters, P., & Fallu, M. (2012). The Australian National Corpus: National infrastructure for language resources. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 3295–3299. http://www.lrec-conf.org/proceedings/lrec2012/pdf/400_Paper.pdf Coats, S. In review a. Double modals in contemporary British and Irish Speech. Coats, S. In review b. New Corpora of Geolocated ASR Transcripts from Australia/New Zealand and Germany. Coats, S. Forthcoming a. Dialect corpora from YouTube. *Proceedings of ICAME41*. De Gruyter. Coats, S. Forthcoming b. The Corpus of British Isles Spoken English (CoBISE) A New Resource of Contemporary British and Irish Speech. *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022*. Coats, S. 2022. [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Corbett, J. 2014. Syntactic variation: Evidence from the Scottish Corpus of Text and Speech. In: R. Lawson (Ed.), *Sociolinguistics in Scotland*, 258–276. Palgrave Macmillan. Corrigan, K. P., I. Buchstaller, A. Mearns & H. Moisl. 2012. [*The Diachronic Electronic Corpus of Tyneside English*](https://research.ncl.ac.uk/decte). Du Bois, J. W., W. L. Chafe, C. Meyer, S. A. Thompson, R. Englebretson & N. Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium. ]] --- ### References II .small[ .hangingindent[ Grieve, J., C. Montgomery, A. Nini, A. Murakami & D. Guo. 2019. [Mapping lexical dialect variation in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2. Honnibal, M., I. Montani, H. Peters, S. V. Landeghem, M. Samsonov, J. Geovedi, J. Regan, G. Orosz, S. L. Kristiansen, P. O. McCann, D. Altinok, Roman, G. Howard, S. Bozek, E. Bot, M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang, B. Böing, P. K. Tippa, jeannefukumaru, G. Dubbin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj, wbwseeker, M. Burton, thomasO & A. Patel. 2019. [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). Kallen, J. & J. Kirk. 2007. ICE-Ireland: Local variations on global standards. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 121–162. Palgrave Macmillan. Markl, N. & C. Lai. 2021. [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. Meyer, J., L. Rauchenstein, J. D. Eisenberg & N. Howell. 2020. [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Montgomery, M. B. & S. J. Nagle. 1994. Double modals in Scotland and the Southern United States: Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108. Nerbonne, J. 2009. Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198. Szmrecsanyi, B. 2013. *Grammatical variation in British English dialects: A study in corpus-based dialectometry*. Cambridge University Press. Szmrecsanyi, B. 2011. Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76. Tatman, R. 2017. [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. ]]