class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br><br><br><br><br> .pull-right[ <span style="font-family:Roboto Condensed;font-size:24pt;font-weight: 900;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">The Corpus of Australian and New Zealand Spoken English (CoANZSE)</span> ] <br><br><br><br> <p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;"> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> Workshop on Language Corpora in Australia<br> July 3rd, 2023<br> </p> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                               CoANZSE | WLCA 2023, Canberra</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                               CoANZSE | WLCA 2023, Canberra</span></div> ## Outline 1. Background, YouTube ASR captions files, data collection and processing 2. CoANZSE overview 3. Examples: Double modals, acoustic analysis pipeline 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                               CoANZSE | WLCA 2023, Canberra</span></div> --- ### Background - Vast amounts of streamed content are available online - Use of automatic speech recognition (ASR) transcripts is ubiquitous - Technical protocols for streaming (DASH, HLS): data accessible via HTTP Possible to create specialized corpora for specific locations/topics/speech genres - Transcripts (**CoANZSE**, CoNASE, CoBISE, CoGS) - Analysis of grammar/syntax, lexis, pragmatics, discourse - Audio - Analysis of phonetic and prosodic variation - Video - Analysis of multimodal communication --- ### Example video <iframe width="560" height="315" src="https://www.youtube.com/embed/cn8vWlUae7Y?rel=0&&showinfo=0&cc_load_policy=1&cc_lang_pref=en" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ### WebVTT file ![](data:image/png;base64,#./Maranoa_webvtt_example.png) --- ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, YouTube's ASR captions, or both, or neither - User-uploaded captions may be manually created or generated automatically by 3rd-party ASR software - CoANZSE (and CoNASE, CoBISE, CoGS): target YT ASR captions --- ### CoANZSE and other YouTube ASR Corpora Corpus of Australian and New Zealand Spoken English - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): Processed ASR captions from 56k transcripts, collected from 478 Australian and New Zealand YouTube channels of local or district councils, 196m word tokens corresponding to 24,007 hours of video from 2007–2022 <span class="small">(Coats 2023a)</span> Corpus of North American Spoken English - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 302k transcripts, 2,572 channels, 1.29b tokens <span class="small">(Coats 2023c, also available with a searchable online interface: https://lncl6.lawcorpus.byu.edu)</span> Corpus of Britain and Ireland Spoken English - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 39k transcripts, 497 channels, 112m tokens <span class="small">(Coats 2022b)</span> Corpus of German Speech - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 39.5k transcripts, 1,313 channels, 50.5m tokens <span class="small">(Coats in review)</span> All are freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK)) --- ### Focus on council channels Content consists of recordings council meetings, news announcements, interviews, cultural events, etc. Advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable - Government content is non-profit: can be used under "fair dealings" provisions of copyright law (e.g. Australian Copyright Act of 1968, U.S.C. Title 17) --- ### Data collection and processing - Identification of relevant channels (lists of councils with web pages ➡ scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [yt-dlp](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- ### CoANZSE data format <div> <table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;"> <thead> <tr style="text-align: left;"> <th></th> <th>country</th> <th>state</th> <th>name</th> <th>channel_name</th> <th>channel_url</th> <th>video_title</th> <th>video_id</th> <th>upload_date</th> <th>video_length</th> <th>text_pos</th> <th>location</th> <th>latlong</th> <th>nr_words</th> </tr> </thead1> <tbody1> <tr> <th>0</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Road Resurfacing Video</td> <td>zVr6S5XkJ28</td> <td>20181127</td> <td>146.120</td> <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>433</td> </tr> <tr> <th>1</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td> <td>p4MjirCc1oU</td> <td>20220301</td> <td>181.959</td> <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>620</td> </tr> <tr> <th>2</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Transport Capital Works Video</td> <td>DXlkVTcmeho</td> <td>20180417</td> <td>140.450</td> <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>347</td> </tr> <tr> <th>3</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Council Meeting Wrap Up February 2022</td> <td>2NhuhF2fBu8</td> <td>20220224</td> <td>107.840</td> <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>341</td> </tr> <tr> <th>4</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>CITY DEAL 4 March 2018</td> <td>4-cv69ZcwVs</td> <td>20180305</td> <td>130.159</td> <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>420</td> </tr1> </tbody1> </table1></div> --- exclude: true ### Potential analyses - Non-numerical quantifiers *heaps* and *lots* --- ### CoANZSE corpus size by country/state/territory .small[ Location |nr_channels|nr_videos |nr_words|video_length (h) ----------------------------|---|-------|-----------|---- Australian Capital Territory| 8 |650 |915,542 |111.79 New South Wales |114|9,741 |27,580,773 |3,428.87 Northern Territory |11 | 289 |315,300 |48.72 New Zealand |74 |18,029 |84,058,661 |10,175.80 Queensland |58 |7,356 |19,988,051 |2,642.75 South Australia |50 |3,537 |13,856,275 |1,716.72 Tasmania |21 |1,260 |5,086,867 |636.99 Victoria |78 |12,138 |35,304,943 |4,205.40 Western Australia |68 |3,815 |8,422,484 |1,063.78 | | | | Total |482|56,815 |195,528,896|24,030.82 ] --- ### CoANZSE channel locations .small[Circle size corresponds to channel size in number of words] <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/anz_dot2.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- exclude: true ### ASR transcript and audio quality metric - The quality of ASR transcripts can be evaluated by using a language model trained on a very large set of ASR transcripts generated for the same audio files at different rates of compression <span class="small">(Yuksel et al. 2023)</span> .pull-left[ .small[ *5 ASR transcripts generated from the same video* rank |compression|quality|hypothetical ASR excerpt --------|-----------|-------|------------------------- 1 | none | best |it's really fantastic that we 2 | little | good | it's really fantastic we 3 | medium | middle| it's really fantasy with 4 | high | poor | it rifle fantasy that wonder 5 | most | worst | Ik reed met fantasie ]] .pull-right[ .large[ <br><br> ➡️ language model ➡️ classification of transcripts/audio ]] <br><br> - Applied with an adapted PyTorch model <span class="small">(https://huggingface.co/aixplain/NoRefER)</span> - Assigns a numerical rating 0 (very bad ASR/audio) to 1 (excellent ASR/audio) --- exclude: True ### Corpus use cases: Syntax/grammar/pragmatics - Regional variation in syntax, mood and modality - Lexical items - Contractions - Hortatives/commands/interjections - Pragmatics: Turn-taking, politeness markers - Multidimensional analysis à la Biber - Typological comparison at country/state/regional level --- ### Example analysis: Double modals - Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span> - *I might could help you with this* - Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span> - More widely used in North America and the British Isles than previously thought <span class="small">(Coats 2022a, Coats 2023b)</span> - Little studied in Australian and New Zealand speech .verysmall[
] --- exclude: true ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regexes of two-tier combinations ```python import re hits = [] for x in modals: for i,y in coanzse_df.iterrows(): pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE) finds = pat1.findall(y["text_pos"]) if finds: for z in finds: seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip() time = z.split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` - The script creates a URL for each search hit at a time 3 seconds before the targeted utterance - In the resulting data frame, each utterance can be annotated after examining the targeted video sequence - Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors) --- exclude: true class: small ### Excerpt from generated table
--- exclude: True ### Finding features - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) - Also in Australia and (especially) New Zealand! --- exclude: True ### Training a classifier on the basis of common word types - Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE <br><br> <style type="text/css"> .tg {border-collapse:collapse;border-color:#aaa;border-spacing:0;} .tg td{background-color:#fff;border-color:#aaa;border-style:solid;border-width:0px;color:#333; font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{background-color:#f38630;border-color:#aaa;border-style:solid;border-width:0px;color:#fff; font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-0lax{text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-0lax"></th> <th class="tg-0pky">Precision</th> <th class="tg-0pky">Recall</th> <th class="tg-0pky">F1</th> <th class="tg-0pky">Support</th> <th class="tg-0pky">Accuracy</th> </tr> </thead> <tbody> <tr> <td class="tg-0lax">Australia</td> <td class="tg-baqh">0.82</td> <td class="tg-baqh">0.90</td> <td class="tg-baqh">0.86</td> <td class="tg-baqh">1,359</td> <td rowspan="2" align="center">0.80</td> </tr> <tr> <td class="tg-0lax">New Zealand</td> <td class="tg-baqh">0.74</td> <td class="tg-baqh">0.59</td> <td class="tg-baqh">0.66</td> <td class="tg-baqh">641</td> </tr> </tbody> </table> --- ### CoANZSE audio data (work in progress) - Cut YouTube transcripts into 20-word chunks - Using transcript timing information and the DASH manifest, extract audio segment for each chunk with yt-dlp - Feed audio and transcript excerpts to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span> - <span class="small">Grapheme to phoneme dictionary, pronunciation dictionary: US ARPAbet</span> - <span class="small">Acoustic model: from Librispeech Corpus (Panayotov et al. 2015)</span> - <span class="small">Language model: MFA English 2.0.0</span> - Output is Praat textgrids - Get features of interest from textgrids + audio chunks with Parselmouth <span class="small">(Python port of Praat functions; Jadoul et al. 2018)</span> - Analyze phenomena of interest (formants, voice onset time, pitch, etc.) - Currently 30m vowels, 130m measurements --- ### Pipeline for acoustic analysis ![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png) - A Jupyter notebook that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants - Click your way through the process in a Colab environment - Can be used for any language that has ASR transcripts - With a few script modifications, also works for manual transcripts https://github.com/stcoats/phonetics_pipeline --- ### Example: Excerpt from a video of the City of Adelaide <span class="small">(former mayor Sandy Verschoor, https://www.youtube.com/watch?v=f-GX8-qszPE)</span> <iframe width="500" height="400" controls src="https://cc.oulu.fi/~scoats/Sandy_example.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- ### Pipeline for acoustic analysis: Vowel formants For each transcript/audio pair in the collection: - Send transcript + audio to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>; output is Praat TextGrids <span class="small">(Boersma & Weenink 2023)</span> - Select features of interest using TextGrid timings and Parselmouth <span class="small">(Python port of Praat functions; Jadoul et al. 2018)</span> <pre style="font-size:11px">were raised by councillors which discussed [oʊ]<br/>a broad range of topics and issues of<br />particular note was the further promotion</pre> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/OdhGckWy5Dw_0001358500014315_17.wav" type="audio/wav"> </audio> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/OdhGckWy5Dw_0001358500014315_17_vw.wav" type="audio/wav"> </audio> <img src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/adelaide_praat_example2.png" width="600px" class="center"> --- ### Formants: F1/F2 values for a single utterance .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/example_Adelaide_sandy.html" height = "500", width = "500" class = "center"></iframe> ] .pull-right[ - Script makes 9 f1/f2 measurements per token at quantiles of the token duration - Circles are individual measurement points - The line represents the formant trajectory for a single token - Retain segments for which at least 5 measurements were possible ] --- ### Formants: F1/F2 values for a single location (filtered) .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/example_Adelaide_t4.html" height = "500px", width = "500px" class = "center"></iframe> ] .pull-right[ - Sample of [oʊ] realizations from the City of Adelaide channel - Retain tokens for which at least 5 measurements were possible - This visualization filters out segments 100 milliseconds in duration ] --- ### Formants: Mean values .pull-left[ ![:scale 100%](data:image/png;base64,#./adelaide_formant_plot2.png) ] .pull-right[ - Mean values for a single video, a single channel, a single location, etc. - Circle locations represent the average value for that duration quantile (subscript) - Circle size is proportional to the number of measurements for that quantile (more likely to get formant values in the middle of the vowel than at the beginning/end) ] --- ### GOAT vowel - First target of /oʊ/ is more back and closed in South Australia compared to other Australian locations <span class="small">(Butcher 2007, Cox & Palethorpe 2019)</span> --- #### Average F1 and F2 values for the first targets of the diphthongs /eɪ/, /aɪ/, /oʊ/, and /aʊ/, spatial autocorrelation <span class="small">(2,339,812 vowel tokens) <iframe width="800" height="500" src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/coanzse_diph_formants_WA_NT_SA_TAS.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> <span style="float: right; width:20%;">- Locations with at least 100 tokens<br>- Getis-Ord Gi* values based on a 20-nearest neighbor binary spatial weights matrix<br>- Only SA, WT, NT in this visualization (others states still being downloaded)</span> --- exclude: True ### Comparison <small>(Grieve, Speelman & Geeraerts 2013, p. 37)</small> .pull-left[ ![](data:image/png;base64,#.\Grieve_et_al_2013_eY.png) ] .pull-right[ - Grieve et al. (2013) used a similar technique used to analyze formant measurements from the *Atlas of North American English* (Labov et al. 2006) - ANAE contains approximately 134,000 vowel measurements in total ] --- exclude: True ### Multimodality - Use regular expressions to search corpus - Extract video as well as audio - Manually or automatically analyze: - Gesture - Posture/body/head inclination - Facial expression - Handling of objects - Touching - (etc.) --- exclude: True ### 'Heaps of' in Australian English <iframe width="800" height="600" controls src="https://cc.oulu.fi/~scoats/heaps_of_CoANZSE_excerpt.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- exclude: True ### Extracted *today* tokens <iframe width="800" height="500" src="https://cc.oulu.fi/~scoats/coanzse_today.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> --- exclude: True ### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong ![:scale 70%](data:image/png;base64,#./eY_coanzse.png) --- ### A few caveats - Videos of local government not representative of speech in general - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers - Machine learning model to identify higher quality transcripts/audio <span class="small">(Yuksel et al. 2023)</span> - MFA pronunciation dictionary and acoustic model: US English models might fail for some features (rhotacism)? <span class="small">BUT see Gonzalez et al. (2020), Mackenzie and Turton (2020)</span> - Need to analyze error rates of forced alignment - Diarization, speaker demographic information --- ### Summary and outlook - CoANZSE is a large corpus of ASR transcripts from YouTube channels of local governments in AUS and NZ - It can be used for studies regional variation in grammar, syntax, discourse - CoANZSE audio can be used for studies of phonetic variation: multivariate spatial analysis of vowel formants in Australian English --- #Thank you! ###Please feel free to download and use the corpus! --- ### References .small[ .hangingindent[ Agarwal, S., Godbole, S., Punjani, D. & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Boersma, P. & Weenink, D. (2023). Praat: doing phonetics by computer. Version 6.3.09. http://www.praat.org Coats, Steven. (2023a). CoANZSE: [The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts](https://doi.org/10.2478/plc-2022-13 ). In P. Parameswaran, J. Biggs & D. Powers (Eds.), *Proceedings of the the 20th Annual Workshop of the Australasian Language Technology Association*, 1–5. Australasian Language Technology Association. Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*. Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter. Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Coats, S. (2022b). [The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech](http://ceur-ws.org/Vol-3232/paper15.pdf). In K. Berglund, M. La Mela & I. Zwart (Eds.), *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*, 187–194. CEUR. Gonzalez, S., Grama, J. & Travis, C. (2020). [Comparing the performance of forced aligners used in sociophonetic research](https://doi.org/10.1515/lingvan-2019-0058). Linguistics Vanguard 5. Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). ]] --- ### References II .small[ .hangingindent[ Jadoul, Y., Thompson, B. & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. *Journal of Phonetics*, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001 MacKenzie, L. & Turton, D. (2020). [Assessing the accuracy of existing forced alignment software on varieties of British English](https://doi.org/10.1515/lingvan-2018-0061). Linguistics Vanguard 6. Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In *Proceedings of the 18th Conference of the International Speech Communication Association*. Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States: Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108. Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015) [Librispeech: An ASR corpus based on public domain audio books](https://doi.org/10.1109/ICASSP.2015.7178964). In *Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 5206–5210. Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. Yuksel, K. A., Ferreira, T., Javadi, G., El-Badrashiny, M. & Gunduz, A. (2023). [NoRefER: A referenceless quality metric for Automatic Speech Recognition via semi-supervised language model fine-tuning with contrastive learning](https://arxiv.org/abs/2306.12577). arXiv:2306.12577 [cs.CL]. ]]