class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br><br><br><br><br> .pull-right[ <span style="font-family:Roboto Condensed;font-size:24pt;font-weight: 900;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Corpora for the study of multimodal variation in English: Acoustic analysis from CoNASE</span> ] <br><br><br><br> <p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;"> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> KTP 2023<br> May 25th, 2023<br> </p> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                           Corpora for Multimodal Variation | KTP 2023, Oulu</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          Corpora for Multimodal Variation | KTP 2023, Oulu</span></div> ## Outline 1. Background, YouTube ASR captions files, data collection and processing 2. CoNASE, CoBISE, CoANZSE 3. Example: Acoustic analysis pipeline 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                      Corpora for Multimodal Variation | KTP 2023, Oulu</span></div> --- ### Background - Renaissance in corpus-based study of regional English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013)</span> - Research data often consist of text sourced from the web and social media <span class="small">(e.g., Davies 2008–; Grieve et al. 2019)</span> - There are relatively few large, geolocated multimodal corpora containing audio/video as well as transcribed text .small[ Corpus |Location |# Words | Reference ----------------------|-------------------|--------|-------------------------- Santa Barbara Corpus | US |~249k | Du Bois et al. 2000-2005 Spoken BNC2014 | UK |~10m | Love et al. 2017; Brezina et al. 2018 ] - Vast amounts of streamed video data are available online, much of which can be harnessed for linguistic research - Combining streamed content with Automatic Speech Recognition (ASR) transcripts and geolocation: - creation of multimodal corpora for specific locations - forced alignment for phonetic/prosodic analysis <span class="small">(Coto-Solano et al. 2021)</span> - Grammatical, acoustic, pragmatic, and possibly visual properties of naturalistic speech --- ### Example video <iframe width="560" height="315" src="https://www.youtube.com/embed/WY9RPeXA3pw?rel=0&&showinfo=0&cc_load_policy=1&cc_lang_pref=en" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ### WebVTT file ![](data:image/png;base64,#./WY9RPeXA3pw_vtt.png) --- ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither - User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software - Auto-generated captions are generated by YT's speech-to-text service - CoNASE, CoANZSE, CoBISE: target YT ASR captions --- ### YouTube ASR Corpora US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Australia, and New Zealand, Germany - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats 2023)</span> - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span> - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 482 locations, 57k transcripts <span class="small">(Coats 2022b)</span> Also [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 1,308 locations, 39.5k transcripts <span class="small">(Coats in review)</span> Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK)) --- ### Data format <div> <table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;"> <thead> <tr style="text-align: right;"> <th></th> <th>country</th> <th>state</th> <th>name</th> <th>channel_name</th> <th>channel_url</th> <th>video_title</th> <th>video_id</th> <th>upload_date</th> <th>video_length</th> <th>text_pos</th> <th>location</th> <th>latlong</th> <th>nr_words</th> </tr> </thead1> <tbody1> <tr> <th>0</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Road Resurfacing Video</td> <td>zVr6S5XkJ28</td> <td>20181127</td> <td>146.120</td> <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>433</td> </tr> <tr> <th>1</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td> <td>p4MjirCc1oU</td> <td>20220301</td> <td>181.959</td> <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>620</td> </tr> <tr> <th>2</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Transport Capital Works Video</td> <td>DXlkVTcmeho</td> <td>20180417</td> <td>140.450</td> <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>347</td> </tr> <tr> <th>3</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>Council Meeting Wrap Up February 2022</td> <td>2NhuhF2fBu8</td> <td>20220224</td> <td>107.840</td> <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>341</td> </tr> <tr> <th>4</th> <td>AUS</td> <td>NSW</td> <td>Wollondilly Shire Council</td> <td>Wollondilly Shire</td> <td>https://www.youtube.com/c/wollondillyshire</td> <td>CITY DEAL 4 March 2018</td> <td>4-cv69ZcwVs</td> <td>20180305</td> <td>130.159</td> <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ... <td>62/64 Menangle St, Picton NSW 2571, Australia</td> <td>(-34.1700078, 150.612913)</td> <td>420</td> </tr1> </tbody1> </table1></div> --- ### Focus on regional and local council channels Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable - In most jurisdictions government content is in the public domain --- ### Data collection and processing - Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- exclude: true ### Potential analyses - Non-numerical quantifiers *heaps* and *lots* --- **CoNASE** .verysmall[ | State | Channels | Videos | Words | Length (h) | State | Channels | Videos | Words | Length (h) | State | Channels | Videos | Words | Length (h) | | -------------------- | -------- | ------ | ---------- | ---------- | ------------ | -------- | ------ | ---------- | ---------- | ------------------------- | -------- | ------ | ---------- | ---------- | | Alabama | 27 | 2827 | 10,581,345 | 1,315.67 | Michigan | 90 | 9832 | 51,293,982 | 6,079.47 | Texas | 155 | 21,330 | 44,736,009 | 5,789.44 | | Alaska | 6 | 451 | 1,854,654 | 248.37 | Minnesota | 80 | 8666 | 31,366,468 | 3,661.89 | Utah | 21 | 2,561 | 7,766,782 | 940.21 | | Arizona | 35 | 6356 | 26,393,272 | 3,063.73 | Mississippi | 18 | 1448 | 2,613,901 | 346.07 | Vermont | 3 | 94 | 131,558 | 16.62 | | Arkansas | 14 | 986 | 6,748,658 | 882.77 | Missouri | 53 | 5093 | 15,094,086 | 1,946.43 | Virginia | 42 | 9,209 | 34,806,149 | 4,059.67 | | California | 211 | 18278 | 83,915,246 | 10,146.57 | Montana | 3 | 145 | 926,229 | 143.2 | Washington | 51 | 6,178 | 28,949,403 | 3,387.77 | | Colorado | 56 | 8802 | 36,551,218 | 4,299.68 | Nebraska | 16 | 677 | 2,487,171 | 312.51 | W. Virginia | 6 | 101 | 196,479 | 25.86 | | Connecticut | 25 | 3731 | 24,549,746 | 3,010.04 | Nevada | 5 | 2,759 | 6,110,915 | 638.06 | Wisconsin | 83 | 9,514 | 45,983,568 | 5,744.59 | | Delaware | 3 | 148 | 242,073 | 25.45 | N.H. | 11 | 1,305 | 10,913,552 | 1,469.04 | Wyoming | 7 | 251 | 2,638,963 | 348.39 | | District of Columbia | 3 | 242 | 261,209 | 32.9 | New Jersey | 88 | 6,982 | 29,523,334 | 3,977.57 | Alberta | 95 | 6,623 | 21,239,251 | 2,497.45 | | Florida | 89 | 17625 | 64,647,923 | 7,468.48 | New Mexico | 14 | 1,895 | 6,750,477 | 883.1 | British Columbia | 102 | 10,002 | 26,853,481 | 3,246.83 | | Georgia | 49 | 5487 | 18,565,796 | 2,421.53 | New York | 97 | 8,037 | 37,560,959 | 4,856.87 | Manitoba | 20 | 3,286 | 2,771,200 | 318.21 | | Hawaii | 1 | 152 | 123,617 | 15.42 | N. Carolina | 97 | 11,357 | 46,231,979 | 5781.4 | New Brunswick | 8 | 382 | 2,347,141 | 278.05 | | Idaho | 11 | 1547 | 8,747,885 | 1,012.14 | N. Dakota | 10 | 768 | 3,616,363 | 442.05 | Newfoundland and Labrador | 2 | 108 | 186,070 | 29.99 | | Illinois | 151 | 14243 | 54,613,612 | 6,725.31 | Ohio | 97 | 7,647 | 33,695,476 | 4,268.46 | Northwest Territories | 3 | 32 | 21,404 | 3.27 | | Indiana | 46 | 4017 | 12,958,084 | 1,643.88 | Oklahoma | 19 | 1,977 | 5,271,339 | 643.35 | Nova Scotia | 11 | 332 | 1,229,149 | 148.38 | | Iowa | 43 | 7516 | 24,286,940 | 3,072.57 | Oregon | 38 | 2,769 | 15,675,898 | 1,992.84 | Nunavut | 1 | 6 | 1,230 | 0.23 | | Kansas | 35 | 4444 | 19,862,293 | 2,504.08 | Pennsylvania | 74 | 6,984 | 32,571,217 | 3,970.32 | Ontario | 112 | 8,404 | 45,970,092 | 5,774.59 | | Kentucky | 26 | 4965 | 17,834,978 | 2,092.75 | Rhode Island | 7 | 822 | 3,195,777 | 530.94 | Prince Edward Island | 6 | 753 | 777,772 | 95.87 | | Louisiana | 16 | 2018 | 10,500,407 | 1,221.96 | S. Carolina | 24 | 3,894 | 8,716,589 | 1115.2 | Quebec | 6 | 166 | 486,265 | 60.29 | | Maine | 12 | 819 | 5,879,165 | 797.01 | S. Dakota | 12 | 1,819 | 18,619,258 | 2,172.97 | Saskatchewan | 10 | 663 | 895,143 | 103.12 | | Maryland | 32 | 7373 | 34,009,832 | 4,100.84 | Tennessee | 33 | 7,194 | 43,286,858 | 5,127.52 | Yukon | 7 | 159 | 257,171 | 30.48 | | Massachusetts | 44 | 17596 | 11,517,230 | 14,682.19 | | | | | | | | | | | | ] --- ### CoNASE channel locations <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/conase_channel_sizes.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- exclude: True ### CoBISE Country | Channels|Videos|Tokens |Length (h) -------------------|---------|------|-----------|----------- England |324 |23,657|72,879,173 |8,518.39 Northern Ireland | 10 |1,898 |6,508,505 |774.17 Republic of Ireland| 26 |2,525 |6,264,276 |680.81 Scotland |75 |8,135 |17,111,396 |1,845.35 Wales |18 |2,465 |8,800,264 |982.66 | | | | Total |453 |38,680|111,563,614|12,801.38 --- exclude: True ### CoBISE channel locations <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/cobise_channel_sizes.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- exclude: True ### CoANZSE .small[ Territory |nr_channels|nr_videos |nr_words|video_length (h) ----------------------------|---|-------|-----------|---- Australian Capital Territory| 8 |650 |915,542 |111.79 New South Wales |114|9,741 |27,580,773 |3,428.87 Northern Territory |11 | 289 |315,300 |48.72 New Zealand |74 |18,029 |84,058,661 |10,175.80 Queensland |58 |7,356 |19,988,051 |2,642.75 South Australia |50 |3,537 |13,856,275 |1,716.72 Tasmania |21 |1,260 |5,086,867 |636.99 Victoria |78 |12,138 |35,304,943 |4,205.40 Western Australia |68 |3,815 |8,422,484 |1,063.78 | | | | Total |482|56,815 |195,528,896|24,030.82 ] --- exclude: True ### CoANZSE channel locations .small[Circle size corresponds to channel size in number of words] <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/anz_dot2.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- exclude: True ### Corpus use cases: Syntax/grammar/pragmatics - Regional variation in syntax, mood and modality - Lexical items - Contractions - Hortatives/commands/interjections - Pragmatics: Turn-taking, politeness markers - Multidimensional analysis à la Biber - Typological comparison at country/state/regional level --- ### Example analysis: Double modals - Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span> - *I might could help you with this* - Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span> - More widely used in North America and the British Isles than previously thought <span class="small">(Coats 2022a, Coats 2023b)</span> - Little studied in Australian and New Zealand speech --- exclude: true ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regexes of two-tier combinations ```python import re hits = [] for x in modals: for i,y in coanzse_df.iterrows(): pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE) finds = pat1.findall(y["text_pos"]) if finds: for z in finds: seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip() time = z.split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` - The script creates a URL for each search hit at a time 3 seconds before the targeted utterance - In the resulting data frame, each utterance can be annotated after examining the targeted video sequence - Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors) --- exclude: true class: small ### Excerpt from generated table
--- exclude: True ### Finding features - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) - Also in Australia and (especially) New Zealand! --- exclude: True ### Training a classifier on the basis of common word types - Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE <br><br> <style type="text/css"> .tg {border-collapse:collapse;border-color:#aaa;border-spacing:0;} .tg td{background-color:#fff;border-color:#aaa;border-style:solid;border-width:0px;color:#333; font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{background-color:#f38630;border-color:#aaa;border-style:solid;border-width:0px;color:#fff; font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-0lax{text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-0lax"></th> <th class="tg-0pky">Precision</th> <th class="tg-0pky">Recall</th> <th class="tg-0pky">F1</th> <th class="tg-0pky">Support</th> <th class="tg-0pky">Accuracy</th> </tr> </thead> <tbody> <tr> <td class="tg-0lax">Australia</td> <td class="tg-baqh">0.82</td> <td class="tg-baqh">0.90</td> <td class="tg-baqh">0.86</td> <td class="tg-baqh">1,359</td> <td rowspan="2" align="center">0.80</td> </tr> <tr> <td class="tg-0lax">New Zealand</td> <td class="tg-baqh">0.74</td> <td class="tg-baqh">0.59</td> <td class="tg-baqh">0.66</td> <td class="tg-baqh">641</td> </tr> </tbody> </table> --- ### Pipeline for acoustic analysis (work in progress) - Regular expressions to target specific words/phrases in the corpora - Extract audio segments containing the targeted item(s) from YT stream - Feed audio and transcript excerpt to forced aligner - Extract desired sounds - Measure acoustic phenomena of interest (formants, voice onset time, pitch, etc.) --- ### Pipeline for acoustic analysis ![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png) - A Jupyter notebook that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants - Click your way through the process in a Colab environment - Can be used for any language that has ASR transcripts - With a few script modifications, also works for manual transcripts (e.g., for Finnish) https://github.com/stcoats/phonetics_pipeline --- ### Example: Excerpt from a council meeting in Gallatin, Tennessee (https://www.youtube.com/watch?v=yzjGnz_Rs7I) <iframe width="500" height="400" controls src="https://cc.oulu.fi/~scoats//yzjGnz_Rs7I_have_a_great_day_on_that.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- ### Pipeline for acoustic analysis: Vowel formants For each transcript/video in the collection: - Regular expressions to search for words with [eɪ] - yt-dlp to download audio segments in a window around the target word - Feed the segments (audio and corresponding transcript segment) to the Montreal Forced Aligner (McAuliffe et al. 2017); output is Praat TextGrids (Boersma & Weenink 2023) - Select vowel(s) of interest using TextGrid timings and Parselmouth (Python port of Praat functions; Jadoul et al. 2018) <pre style="font-size:12px">have a great d**ay** on that [eɪ]</pre> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://cc.oulu.fi/~scoats/yzjGnz_Rs7I_have a great day on that.wav" type="audio/wav"> </audio> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://cc.oulu.fi/~scoats/yzjGnz_Rs7I_have_a_great_day_on_that_vw.wav" type="audio/wav"> </audio> <img src="https://cc.oulu.fi/~scoats/yzjGnz_Rs7I_have%20a%20great%20day%20on%20that_TextGrid_praat.png" width="600px" class="center"> --- ### Formants: F1/F2 values for a single utterance .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/TN/example_Gallatin.html"height = "500px", width = "500px" class = "center"></iframe> ] .pull-right[ - 9 measurements per segment in order to get trajectory of vowel sounds - Retain segments for which at least 5 measurements were possible ] --- ### Formants: F1/F2 values for a single location (filtered) .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/TN/example_Gallatin_all.html"height = "500px", width = "500px" class = "center"></iframe> ] .pull-right[ - 9 measurements per segment in order to get trajectory of vowel sounds - Retain segments for which at least 5 measurements were possible - This visualization filters out segments that do not have the typical shape of the [eɪ] diphthong ] --- ### Formants: Values for a single location .pull-left[ <img src="https://cc.oulu.fi/~scoats/Hendersonville_TN_v2.png" width="5990px" class="center"> ] .pull-right[ - Circle locations represent the average value for that duration quantile (subscript) - Circle size is proportional to the number of measurements for that quantile (more likely to get formant values in the middle of the vowel than at the beginning/end) ] --- ## Average F1 and F2 values for the nuclei of the diphthongs /eɪ/, /aɪ/, /oʊ/, and /aʊ/, spatial autocorrelation <span class="small">(12,931,728 vowel tokens) <iframe width="800" height="500" src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/conase_formants.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> <span style="float: right; width:20%;">- Locations with at least 100 tokens<br>- Getis-Ord Gi* values based on a 20-nearest neighbor binary spatial weights matrix</span> --- ### Comparison <small>(Grieve, Speelman & Geeraert 2013, p. 37)</small> .pull-left[ ![](data:image/png;base64,#.\Grieve_et_al_2013_eY.png) ] .pull-right[ - Grieve et al. (2013) used a similar technique used to analyze formant measurements from the *Atlas of North American English* (Labov et al. 2006) - ANAE contains approximately 134,000 vowel measurements in total ] --- ### Multimodality - Use regular expressions to search corpus - Extract video as well as audio - Manually or automatically analyze: - Gesture - Posture/body/head inclination - Facial expression - Handling of objects - Touching - (etc.) --- ### 'Heaps of' in Australian English <iframe width="800" height="600" controls src="https://cc.oulu.fi/~scoats/heaps_of_CoANZSE_excerpt.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- exclude: True ### Extracted *today* tokens <iframe width="800" height="500" src="https://cc.oulu.fi/~scoats/coanzse_today.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> --- exclude: True ### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong ![:scale 70%](data:image/png;base64,#./eY_coanzse.png) --- ### A few caveats - Videos of local government not representative of speech in general - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers - Need to analyze error rates of forced alignment --- ### Summary and outlook - Large corpora of ASR transcripts from YouTube channels of local governments - Corpus studies of regional variation in spoken language: dialectology, pragmatics, phonetics, gestures - Large-scale studies of phonetic variation --- #Thank you! --- ### References .small[ .hangingindent[ Agarwal, S., Godbole, S., Punjani, D. & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Boersma, P. & Weenink, D. (2023). Praat: doing phonetics by computer. Version 6.3.09. http://www.praat.org Brezina, V., Love, R. & Aijmer, K. (2018). Corpus linguistics and sociolinguistics: Introducing the Spoken BNC2014. In V. Brezina, R. Love & K. Aijmer (Eds.), *Corpus approaches to contemporary British speech: Sociolinguistic studies of the Spoken BNC2014*, 3–9. Routledge. Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*. Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter. Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Coats, S. (2022b). [The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech](http://ceur-ws.org/Vol-3232/paper15.pdf). In K. Berglund, M. La Mela & I. Zwart (Eds.), *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*, 187–194. CEUR. Coto-Solano, R., Stanford, J. N. & Reddy, S. K. (2021). [Advances in completely automated vowel analysis for sociophonetics: Using end-to-End speech recognition systems with DARLA](https://doi.org/10.3389/frai.2021.662097). Frontiers in Artificial Intelligence, Section Language and Computation. Davies, Mark. (2008–). [The Corpus of Contemporary American English (COCA)](https://www.english-corpora.org/coca/) Du Bois, J. W., Chafe, W. L., Meyer, C., Thompson, S. A., Englebretson, R. & Martey, N. (2000-2005). *Santa Barbara corpus of spoken American English*, Parts 1-4. Linguistic Data Consortium. Grieve, J., Montgomery, C., Nini, A., Murakami, A. & Guo, D. (2019). [Mapping lexical dialect variation in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2. ]] --- ### References II .small[ .hangingindent[ Grieve, J., Speelman, D. & Geeraerts, D. (2013). [A multivariate spatial analysis of vowel formants in American English](https://doi.org/10.1017/jlg.2013.3). *Journal of Linguistic Geography* 1, 31–51. Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). Jadoul, Y., Thompson, B. & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. *Journal of Phonetics*, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001 Labov, W., Ash, S. & Boberg, C. (2006). *The Atlas of North American English*. Mouton de Gruyter. Love, R., Dembry, C., Hardie, A., Brezina, V. & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. In T. McEnery, R. Love & V. Brezina (Eds.), *Compiling and analysing the Spoken British National Corpus 2014* [ = International Journal of Corpus Linguistics 22(3)], 319–44. Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In *Proceedings of the 18th Conference of the International Speech Communication Association*. Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States: Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108. Nerbonne, J. (2009). Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198. Szmrecsanyi, B. (2013). *Grammatical variation in British English dialects: A study in corpus-based dialectometry*. Cambridge University Press. Szmrecsanyi, B. (2011). Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76. Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. ]]