class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br> ## <span style="color:black;-webkit-text-fill-color: #32CD32;-webkit-text-stroke: 1px;">The Corpus of Australian and New Zealand Spoken English: A New Resource of Naturalistic Speech Transcripts</span> <br><br><br> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> ALTA 2022<br> December 16th, 2022<br> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                                CoANZSE | ALTA Conference, Adelaide</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                                CoANZSE | ALTA Conference, Adelaide</span></div> ## Outline 1. Background, YouTube ASR captions files, data collection and processing 2. CoANZSE locations and size 3. Example use cases: Double modals, variety classification, acoustic analysis pipeline 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                                CoANZSE | ALTA Conference, Adelaide</span></div> --- ### Background - Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span> - Some corpora of transcribed spoken English have limited availability, are small in size, or lack sufficient geographical granularity to make inferences about regional distributions of features .small[ Corpus |Location(s) |nr_words| Reference ----------------------|-------------------|--------|-------------------------- ICE-Aus | Australia |~600k | Cassidy et al. 2012 Monash Corpus | Melbourne |~96k | Bradshaw et al. 2010 Griffith Corpus | Brisbane |~32k | Cassidy et al. 2012 Wellington Corpus | NZ |~1m | Holmes et al. 1998 ONZE Corpus | NZ |? | Gordon et al. 2007 ] - Automatic Speech Recognition (ASR) transcripts are available online for speech from specific locations - Videos from local councils and other government entities can be harvested to create large corpora --- ### Example video <iframe width="560" height="315" src="https://www.youtube.com/embed/cn8vWlUae7Y?rel=0&&showinfo=0&cc_load_policy=1&cc_lang_pref=en" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ### WebVTT file ![](data:image/png;base64,#./Maranoa_webvtt_example.png) --- exclude: true ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither - User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software - Auto-generated captions are generated by YT's speech-to-text service - CoANZSE (and CoNASE and CoBISE): target YT ASR captions --- ### YouTube ASR Corpora US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Germany, Australia, and New Zealand - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span> - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span> - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 39.5k transcripts, 1,308 locations <span class="small">(Coats in review)</span> - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 57k transcripts, 482 locations Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK)) --- ### Data format <div> <table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;"> <thead> <tr style="text-align: right;"> <th></th> <th>country</th> <th>state</th> <th>name</th> <th>channel_name</th> <th>channel_url</th> <th>video_title</th> <th>video_id</th> <th>upload_date</th> <th>video_length</th> <th>text_pos</th> <th>location</th> <th>latlong</th> <th>nr_words</th> </tr> </thead1> <tbody1> <tr> <th>0</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Road Resurfacing Video</td> <td>zVr6S5XkJ28</td> <td>20181127</td> <td>146.120</td> <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>433</td> </tr> <tr> <th>1</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td> <td>p4MjirCc1oU</td> <td>20220301</td> <td>181.959</td> <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>620</td> </tr> <tr> <th>2</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Transport Capital Works Video</td> <td>DXlkVTcmeho</td> <td>20180417</td> <td>140.450</td> <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>347</td> </tr> <tr> <th>3</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Council Meeting Wrap Up February 2022</td> <td>2NhuhF2fBu8</td> <td>20220224</td> <td>107.840</td> <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>341</td> </tr> <tr> <th>4</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>CITY DEAL 4 March 2018</td> <td>4-cv69ZcwVs</td> <td>20180305</td> <td>130.159</td> <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>420</td> </tr1> </tbody1> </table1></div> --- ### Focus on regional and local council channels Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable - In most jurisdictions government content is in the public domain --- ### Data collection and processing - Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- ### CoANZSE channel locations .small[Circle size corresponds to channel size in number of words] <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/anz_dot2.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- ### CoANZSE corpus size by country/state/territory .small[ Territory |nr_channels|nr_videos |nr_words|video_length (h) ----------------------------|---|-------|-----------|---- Australian Capital Territory| 8 |650 |915,542 |111.79 New South Wales |114|9,741 |27,580,773 |3,428.87 Northern Territory |11 | 289 |315,300 |48.72 New Zealand |74 |18,029 |84,058,661 |10,175.80 Queensland |58 |7,356 |19,988,051 |2,642.75 South Australia |50 |3,537 |13,856,275 |1,716.72 Tasmania |21 |1,260 |5,086,867 |636.99 Victoria |78 |12,138 |35,304,943 |4,205.40 Western Australia |68 |3,815 |8,422,484 |1,063.78 | | | | Total |482|56,815 |195,528,896|24,030.82 ] --- exclude: true ### Potential analyses - Non-numerical quantifiers *heaps* and *lots* --- exclude: true ### Corpus use cases and size - Regional language (dialectology): e.g. syntax, mood and modality - Pragmatics: Turn-taking, politeness markers - Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale .small[.pull-left[ **CoNASE** Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- US |2,189 |270,931 |1,149,030,824 | 141,455.11 Canada | 383 |30,916 |103,035,369 |12,586.77 **CoANZSE** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Australia |408 |38,786 |111,470,235 | 13,885.1 New Zealand | 74 |18,029 |84,058,661 |1,083.75 ]] <div style="top:-40px"> .small[.pull-right[ **CoBISE** Country | Channels|Videos|Tokens |Length (h) -------------------|---------|------|-----------|----------- England |324 |23,657|72,879,173 |8,518.39 Northern Ireland | 10 |1,898 |6,508,505 |774.17 Republic of Ireland| 26 |2,525 |6,264,276 |680.81 Scotland |75 |8,135 |17,111,396 |1,845.35 Wales |18 |2,465 |8,800,264 |982.66 **CoGS** (coming soon) Country | Channels|Videos|Tokens |Length (h) --------------|---------|------|-----------|----------- Germany |1,313 |39,495 |50,554,070 | 7,223.44 ]] </div> --- ### Example analysis: Double modals - Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span> - *I might could help you with this* - Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span> - More widely used in North America and the British Isles than previously thought (Coats 2022a, Coats in review) - Little studied in Australian and New Zealand speech --- ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regexes of two-tier combinations ```python import re hits = [] for x in modals: for i,y in coanzse_df.iterrows(): pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE) finds = pat1.findall(y["text_pos"]) if finds: for z in finds: seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip() time = z.split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` - The script creates a URL for each search hit at a time 3 seconds before the targeted utterance - In the resulting data frame, each utterance can be annotated after examining the targeted video sequence - Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors) --- class: small ### Excerpt from generated table
--- exclude: True ### Finding features - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) - Also in Australia and (especially) New Zealand! --- ### Training a classifier on the basis of common word types - Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE <br><br> <style type="text/css"> .tg {border-collapse:collapse;border-color:#aaa;border-spacing:0;} .tg td{background-color:#fff;border-color:#aaa;border-style:solid;border-width:0px;color:#333; font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{background-color:#f38630;border-color:#aaa;border-style:solid;border-width:0px;color:#fff; font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-0lax{text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-0lax"></th> <th class="tg-0pky">Precision</th> <th class="tg-0pky">Recall</th> <th class="tg-0pky">F1</th> <th class="tg-0pky">Support</th> <th class="tg-0pky">Accuracy</th> </tr> </thead> <tbody> <tr> <td class="tg-0lax">Australia</td> <td class="tg-baqh">0.82</td> <td class="tg-baqh">0.90</td> <td class="tg-baqh">0.86</td> <td class="tg-baqh">1,359</td> <td rowspan="2" align="center">0.80</td> </tr> <tr> <td class="tg-0lax">New Zealand</td> <td class="tg-baqh">0.74</td> <td class="tg-baqh">0.59</td> <td class="tg-baqh">0.66</td> <td class="tg-baqh">641</td> </tr> </tbody> </table> --- ### Pipeline for acoustic analysis - Regular expressions to target specific words/phrases in the corpus - Extract audio span containing the targeted item(s) from YT stream - Feed audio and transcript excerpt to forced aligner - Extract desired sounds/acoustic phenomena --- ### Extracted *today* tokens <iframe width="800" height="500" src="https://cc.oulu.fi/~scoats/coanzse_today.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> --- ### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong ![:scale 70%](data:image/png;base64,#./eY_coanzse.png) --- ### A few caveats - Videos of local government not representative of speech in general - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers --- exclude: true ### Summary and outlook - Large corpus of ASR transcripts from YouTube channels of local governments in Australia/NZ - Possibly useful for corpus studies of spoken language, dialectology, pragmatics - Double modals are more widespread than has previously been documented --- #Thank you! --- ### References .small[ .hangingindent[ Agarwal, S., Godbole, S., Punjani, D., & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Bradshaw, J., Burridge, K., & Clyne, M. (2010). The Monash Corpus of Spoken Australian English. In L. de Beuzeville & P. Peters (Eds.), *Proceedings of the 2008 Conference of the Australian Linguistics Society*. Cassidy, S., Haugh, M., Peters, P., & Fallu, M. (2012). The Australian National Corpus: National infrastructure for language resources. In *Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)*, 3295–3299. http://www.lrec-conf.org/proceedings/lrec2012/pdf/400_Paper.pdf Coats, S. (In review). Double modals in contemporary British and Irish Speech. Coats, S. (Forthcoming). Dialect corpora from YouTube. In B. Busse, N. Dumrukcic, & I. Kleiber (Eds.), *Lanugage and linguistics in a complex world*. De Gruyter. Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Coats, S. (2022b). The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. In K. Berglund, M. La Mela, & I. Zwart (Eds.), [*Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*](http://ceur-ws.org/Vol-3232/paper15.pdf), 187–194. Aachen, Germany: CEUR. Gordon, E., Maclagan, M. & Hay, J. (2007). The ONZE corpus. In J. C. Beal, K. P. Corrigan, & H. Moisl (Eds.) *Creating and digitizing language corpora volume 2: Diachronic databases*, 82–104.Palgrave Macmillan. Grieve, J., Montgomery, C., Nini, A., Murakami, A., & Guo, D. (2019). [Mapping lexical dialect variation in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2. ]] --- ### References II .small[ .hangingindent[ Holmes, J., Vine, B., & Johnson, G. (1998). [*Guide to the Wellington Corpus of Spoken New Zealand English*](http://korpus.uib.no/icame/manuals/WSC/INDEX.HTM). Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. Meyer, J., Rauchenstein, L., Eisenberg, J. D., & Howell, N. (2020). [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States: Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108. Nerbonne, J. (2009). Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198. Szmrecsanyi, B. (2013). *Grammatical variation in British English dialects: A study in corpus-based dialectometry*. Cambridge University Press. Szmrecsanyi, B. (2011). Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76. Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. ]]