class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br><br><br><br><br> .pull-right[ <span style="font-family:Roboto Condensed;font-size:24pt;font-weight: 900;font-style: normal;float:right;text-align: right;color:yellow;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">A Pipeline for the Large-Scale Acoustic Analysis of Streamed Content</span> ] <br><br><br><br> <p style="float:right;text-align: right;color:yellow;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;"> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> CMC-Corpora 10, Mannheim <br> September 15th, 2023<br> </p> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          Pipeline for Acoustic Analysis | CMC-Corpora 10, Mannheim</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          Pipeline for Acoustic Analysis | CMC-Corpora 10, Mannheim</span></div> ## Outline 1. Background 2. yt-dlp 3. Montreal Forced Aligner 4. Praat-Parselmouth 3. Examples: Double modals, acoustic analysis pipeline 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                          Pipeline for Acoustic Analysis | CMC-Corpora 10, Mannheim</span></div> --- ### Background - Vast amounts of streamed audio, video, and transcript data are available online - Standard technical protocols for online streaming: DASH and HLS - Creation of specialized corpora for specific locations/topics/speech genres 1. Transcript corpora from YouTube (or other platforms): CoANZSE, CoNASE, CoBISE, CoGS - Analysis of grammar/syntax, lexis, pragmatics, discourse 2. Audio extraction and forced alignment - Visualization and analysis of phonetic and prosodic variation 3. Video extraction - Analysis of multimodal communication --- ### Pipeline for acoustic analysis ![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png) - A Jupyter notebook for Python that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants - Click your way through the process in a Google Colab environment - Can be used for any language that has ASR transcripts - With a few script modifications, also works for manual transcripts https://github.com/stcoats/phonetics_pipeline --- ### Component: yt-dlp .pull-left[ ![](data:image/png;base64,#yt-dlp_screenshot.png) ] .pull-right[ - Open-source fork of YouTube-DL - When a viewer accesses a video on the YouTube web page, a cryptographic key is generated. Video, audio, and transcript content is then streamed to the browser using this key - yt-dlp gets this key to access the conent streams - Can be used to access any content streamed with DASH or HLS protocols ] --- ### Component: Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span> .pull-left30[ .small[ - Forced alignment is aligning a transcript to an audio track so that the exact start and end times of segments (words, phones) can be determined - Necessary for automated analysis of vowel quality or other phonetic analysis - MFA may perform better than some other aligners (P2FA, MAUS) ] ] .pull-right70[ ![](data:image/png;base64,#mfa_screenshot.png) ] --- ### Component: Praat-Parselmouth <span class="small">(Jadoul et al. 2018)</span> - Python interface to Praat, widely used software for acoustic analysis <span class="small">(Boersma & Weenink 2023)</span> - Intergration into Python simplifies workflows and analysis ![:scale 75%](data:image/png;base64,#praat_screenshot.png) --- ### Example: CoANZSE Audio - Cut YouTube transcripts into 20-word chunks - Using transcript timing information and the DASH manifest, extract audio segment for each chunk with yt-dlp - Feed audio and transcript excerpts to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span> - <span class="small">Grapheme to phoneme dictionary, pronunciation dictionary: US ARPAbet</span> - <span class="small">Acoustic model: from Librispeech Corpus (Panayotov et al. 2015)</span> - <span class="small">Language model: MFA English 2.0.0</span> - Output is Praat textgrids - Get features of interest from textgrids + audio chunks with Praat-Parselmouth - Analyze phenomena of interest (formants, voice onset time, pitch, etc.) --- ### Example: Excerpt from a video of the City of Adelaide <span class="small">(former mayor Sandy Verschoor, https://www.youtube.com/watch?v=f-GX8-qszPE)</span> <iframe width="500" height="400" controls src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/adelaide_output.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- exclude: true ### Example video <iframe width="560" height="315" src="https://www.youtube.com/embed/cn8vWlUae7Y?rel=0&&showinfo=0&cc_load_policy=1&cc_lang_pref=en" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- exclude: true ### WebVTT file ![](data:image/png;base64,#./Maranoa_webvtt_example.png) --- exclude: true ### YouTube captions files - Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither - User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software - Auto-generated captions are generated by YT's speech-to-text service - CoANZSE, CoNASE, CoBISE: target YT ASR captions --- exclude: true ### CoANZSE and other YouTube ASR Corpora Corpus of Australian and New Zealand Spoken English - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 196m tokens, 472 locations, 56k transcripts corresponding to 24,007 hours of video from 2007-2022 <span class="small">(Coats 2023a)</span> Corpus of North American Spoken English - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats 2023c, also available with a searchable online interface: https://lncl6.lawcorpus.byu.edu)</span> Corpus of Britain and Ireland Spoken English - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span> Corpus of German Speech - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 1,308 locations, 39.5k transcripts <span class="small">(Coats in review)</span> All are freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK)) --- exclude: true ### Data collection and processing - Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- exclude: true ### CoANZSE corpus size by country/state/territory .small[ Location |nr_channels|nr_videos |nr_words|video_length (h) ----------------------------|---|-------|-----------|---- Australian Capital Territory| 8 |650 |915,542 |111.79 New South Wales |114|9,741 |27,580,773 |3,428.87 Northern Territory |11 | 289 |315,300 |48.72 New Zealand |74 |18,029 |84,058,661 |10,175.80 Queensland |58 |7,356 |19,988,051 |2,642.75 South Australia |50 |3,537 |13,856,275 |1,716.72 Tasmania |21 |1,260 |5,086,867 |636.99 Victoria |78 |12,138 |35,304,943 |4,205.40 Western Australia |68 |3,815 |8,422,484 |1,063.78 | | | | Total |482|56,815 |195,528,896|24,030.82 ] --- exclude: true ### CoANZSE channel locations .small[Circle size corresponds to channel size in number of words] <div class="container"> <iframe src="https://cc.oulu.fi/~scoats/anz_dot2.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- exclude: true ### ASR transcript and audio quality metric - The quality of ASR transcripts can be evaluated by using a language model trained on a very large set of ASR transcripts generated for the same audio files at different rates of compression <span class="small">(Yuksel et al. 2023)</span> .pull-left[ .small[ *5 ASR transcripts generated from the same video* rank |compression|quality|hypothetical ASR excerpt --------|-----------|-------|------------------------- 1 | none | best |it's really fantastic that we 2 | little | good | it's really fantastic we 3 | medium | middle| it's really fantasy with 4 | high | poor | it rifle fantasy that wonder 5 | most | worst | Ik reed met fantasie ]] .pull-right[ .large[ <br><br> ➡️ language model ➡️ classification of transcripts/audio ]] <br><br> - Applied with an adapted PyTorch model <span class="small">(https://huggingface.co/aixplain/NoRefER)</span> - Assigns a numerical rating 0 (very bad ASR/audio) to 1 (excellent ASR/audio) --- exclude: True ### Corpus use cases: Syntax/grammar/pragmatics - Regional variation in syntax, mood and modality - Lexical items - Contractions - Hortatives/commands/interjections - Pragmatics: Turn-taking, politeness markers - Multidimensional analysis à la Biber - Typological comparison at country/state/regional level --- exclude: true ### Example analysis: Double modals - Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span> - *I might could help you with this* - Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland? - Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span> - More widely used in North America and the British Isles than previously thought <span class="small">(Coats 2022a, Coats 2023b)</span> - Little studied in Australian and New Zealand speech .verysmall[
] --- exclude: true ### Script: Generating a table for manual inspection of double modals - Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta* - Script to generate regexes of two-tier combinations ```python import re hits = [] for x in modals: for i,y in coanzse_df.iterrows(): pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE) finds = pat1.findall(y["text_pos"]) if finds: for z in finds: seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip() time = z.split()[0].split("_")[-1] hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3)))) pd.DataFrame(hits) ``` - The script creates a URL for each search hit at a time 3 seconds before the targeted utterance - In the resulting data frame, each utterance can be annotated after examining the targeted video sequence - Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors) --- exclude: true class: small ### Excerpt from generated table
--- exclude: True ### Finding features - Regular-expression-search and manual annotation approach - Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review) - Also in Australia and (especially) New Zealand! --- ### Vowel formants from underlying audio For each transcript/audio pair in the collection: - Send transcript + audio to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>; output is Praat TextGrids <span class="small">(Boersma & Weenink 2023)</span> - Select features of interest using TextGrid timings and Parselmouth <span class="small">(Python port of Praat functions; Jadoul et al. 2018)</span> <pre style="font-size:11px">were raised by councillors which discussed [oʊ]<br/>a broad range of topics and issues of<br />particular note was the further promotion</pre> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/OdhGckWy5Dw_0001358500014315_17.wav" type="audio/wav"> </audio> <audio controls id="player" autostart="0" autostart="false" preload ="none" name="media"> <source src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/OdhGckWy5Dw_0001358500014315_17_vw.wav" type="audio/wav"> </audio> <img src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/adelaide_praat_example2.png" width="600px" class="center"> --- ### Formants: F1/F2 values for a single utterance .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/example_Adelaide_sandy.html" height = "500", width = "500" class = "center"></iframe> ] .pull-right[ - Script makes 9 f1/f2 measurements per token at deciles of the token duration - Circles are individual measurement points - The line represents the formant trajectory for a single token - Retain segments for which at least 5 measurements were possible ] --- ### Formants: F1/F2 values for a single location (filtered) .pull-left[ <iframe src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/example_Adelaide_t4.html" height = "500px", width = "500px" class = "center"></iframe> ] .pull-right[ - Sample of [oʊ] realizations from the City of Adelaide channel - Retain tokens for which at least 5 measurements were possible - This visualization filters out segments 100 milliseconds in duration ] --- ### Formants: Mean values .pull-left[ ![:scale 100%](data:image/png;base64,#./adelaide_formant_plot2.png) ] .pull-right[ - Mean values for a single video, a single channel, a single location, etc. - Circle locations represent the average value for that duration quantile (subscript) - Circle size is proportional to the number of measurements for that quantile (more likely to get formant values in the middle of the vowel than at the beginning/end) ] --- exclude: true ### GOAT vowel - First target of /oʊ/ is more back and closed in South Australia compared to other Australian locations <span class="small">(Butcher 2007, Cox & Palethorpe 2019)</span> --- #### Average F1 and F2 values for the first targets of the diphthongs /eɪ/, /aɪ/, /oʊ/, and /aʊ/, spatial autocorrelation <span class="small">(2,339,812 vowel tokens) <iframe width="800" height="500" src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/coanzse_diph_formants_WA_NT_SA_TAS.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> <span style="float: right; width:20%;">- Locations with at least 100 tokens<br>- Getis-Ord Gi* values based on a 20-nearest neighbor binary spatial weights matrix<br>- Only SA, WT, NT in this visualization (others states still being downloaded)</span> --- exclude: True ### Comparison <small>(Grieve, Speelman & Geeraerts 2013, p. 37)</small> .pull-left[ ![](data:image/png;base64,#.\Grieve_et_al_2013_eY.png) ] .pull-right[ - Grieve et al. (2013) used a similar technique used to analyze formant measurements from the *Atlas of North American English* (Labov et al. 2006) - ANAE contains approximately 134,000 vowel measurements in total ] --- exclude: True ### Multimodality - Use regular expressions to search corpus - Extract video as well as audio - Manually or automatically analyze: - Gesture - Posture/body/head inclination - Facial expression - Handling of objects - Touching - (etc.) --- exclude: True ### 'Heaps of' in Australian English <iframe width="800" height="600" controls src="https://cc.oulu.fi/~scoats/heaps_of_CoANZSE_excerpt.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- exclude: True ### Extracted *today* tokens <iframe width="800" height="500" src="https://cc.oulu.fi/~scoats/coanzse_today.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" scrolling="no" allowfullscreen></iframe> --- ### In development: CoANZSE Audio .pull-left30[ - 195 million words of Australian and NZ English - Audio and TextGrids - coanzse.org ] .pull-right70[ ![](data:image/png;base64,#./coanzse.org_screenshot.png) ] --- ### A few caveats - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers - Machine learning model to identify higher quality transcripts/audio <span class="small">(Yuksel et al. 2023)</span> - MFA pronunciation dictionary and acoustic model: US English models might fail for some features (rhotacism)? <span class="small">BUT see Gonzalez et al. (2020), Mackenzie and Turton (2020)</span> - Need to analyze error rates of forced alignment - Diarization, speaker demographic information --- ### Summary and outlook - Access to online audio data via DASH and HLS protocols - Pipeline can get audio data from YouTube or other sites - Automatic acoustic analysis of vowel formants or other speech properties - CoANZSE Audio, built with the pipeline, for Australian and New Zealand English --- #Thank you! --- ### References .small[ .hangingindent[ Agarwal, S., Godbole, S., Punjani, D. & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12. Boersma, P. & Weenink, D. (2023). Praat: doing phonetics by computer. Version 6.3.09. http://www.praat.org Coats, Steven. (2023a). CoANZSE: [The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts](https://doi.org/10.2478/plc-2022-13 ). In P. Parameswaran, J. Biggs & D. Powers (Eds.), *Proceedings of the the 20th Annual Workshop of the Australasian Language Technology Association*, 1–5. Australasian Language Technology Association. Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*. Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter. Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*. Coats, S. (2022b). [The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech](http://ceur-ws.org/Vol-3232/paper15.pdf). In K. Berglund, M. La Mela & I. Zwart (Eds.), *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*, 187–194. CEUR. Gonzalez, S., Grama, J. & Travis, C. (2020). [Comparing the performance of forced aligners used in sociophonetic research](https://doi.org/10.1515/lingvan-2019-0058). Linguistics Vanguard 5. Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug fixes](https://doi.org/10.5281/zenodo.3358113). ]] --- ### References II .small[ .hangingindent[ Jadoul, Y., Thompson, B. & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. *Journal of Phonetics*, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001 MacKenzie, L. & Turton, D. (2020). [Assessing the accuracy of existing forced alignment software on varieties of British English](https://doi.org/10.1515/lingvan-2018-0061). Linguistics Vanguard 6. Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In *Proceedings of the 18th Conference of the International Speech Communication Association*. Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468. Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015) [Librispeech: An ASR corpus based on public domain audio books](https://doi.org/10.1109/ICASSP.2015.7178964). In *Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 5206–5210. Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology. org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics. Yuksel, K. A., Ferreira, T., Javadi, G., El-Badrashiny, M. & Gunduz, A. (2023). [NoRefER: A referenceless quality metric for Automatic Speech Recognition via semi-supervised language model fine-tuning with contrastive learning](https://arxiv.org/abs/2306.12577). arXiv:2306.12577 [cs.CL]. ]]