class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <div style="position: absolute; top: 10%; left: 5%; background-color: rgba(255, 255, 255, 0.5); padding: 30px 40px; border-radius: 8px; max-width: 42%; box-shadow: 0px 10px 25px rgba(0,0,0,0.1); text-align: left;"> <h1 style="font-family: 'Rubik', sans-serif; font-size: 2.4em; font-weight: 700; color: #1a202c; line-height: 1.15; margin: 0 0 20px 0;"> A Fine-tuned ASR Model for Historical American Dialect Recordings </h1> <div style="width: 70px; height: 6px; background-color: #901a1e; margin: 0 0 25px 0;"></div> <p style="font-family: 'Rubik', sans-serif; font-size: 1.2em; font-weight: 400; color: #1a202c; line-height: 1.5; margin: 0;"> <strong>Steven Coats</strong><br> University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi" style="color: #901a1e; text-decoration: none;">steven.coats@oulu.fi</a><br> LREC 2026 </p> </div> --- layout: true <div class="my-header"><img border="0" alt="Oulu logo" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                               Historical dialect ASR | LREC 2026</span></div> --- exclude: true <div class="my-header"><img border="0" alt="Oulu logo" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                               Historical dialect ASR | LREC 2026</span></div> --- exclude: true ### Outline 1. Background and motivation 2. DASS2019\_NLP dataset 3. Fine-tuning and evaluation 4. Main results 5. Error analysis, outlook, and limitations --- ### Motivation - Modern ASR performs very well on many contemporary speech domains <span class="small"> (Radford et al., 2022; Puvvada et al.; Saon et al., 2025; Peng et al., 2025)</span> - Performance drops under **domain mismatch**: - Conversational speech - Dialectal speech - Historical recordings - Noisy archival audio - Historical dialect archives contain thousands of hours of American English speech, but much remains untranscribed - Fine-tuned ASR can make these materials searchable and analyzable at scale --- ### Background: Linguistic Atlas Project, LAGS, and DASS - Large-scale U.S. dialectological fieldwork beginning in **1929** - Thousands of interviews documenting regional and social variation in American English - Major archival resources now housed at the **University of Kentucky** and the **University of Georgia** .pull-left[ - **Linguistic Atlas of the Gulf States (LAGS)** <span class="small">(Pederson et al., 1986-92)</span> - Subproject of LAP - Fieldwork conducted 1968-1983 - 1,121 interviews - **Digital Archive of Southern Speech (DASS)**: 64-informant sample of digitized LAGS interviews <span class="small">(Kretzschmar et al., 2012)</span> - **DASS2019**: manually transcribed and time-aligned release <span class="small">(Olsen et al., 2017; Kretzschmar et al., 2019)</span> ] .pull-right[  ] --- exclude: true ### Related work - DASS and DASS2019 have already been used for phonetic and dialectological research - Prior studies examine features such as: - The <span class="ipa">/aɪ/</span> diphthong <span class="small">(Olsen et al., 2018)</span> - Front vowels in chain shifts <span class="small">(Renwick & Stanley, 2017)</span> - The *whine-wine* merger <span class="small">(Bridwell & Renwick, 2024)</span> - For English, dialect-specific Whisper fine-tuning remains limited, especially for historical American dialect recordings, but also for African-American speech <span class="small">(Chang et al., 2024; Koenecke et al., 2020; Mojarad & Tang, 2025)</span> --- ### Research questions 1. Can Whisper models be successfully fine-tuned on historical Southern American English recordings? 2. How large are the gains on **in-domain** held-out DASS2019\_NLP data? 3. Do these gains transfer to a subset of the Corpus of Regional African American Language CORAAL? <span class="small">(Kendall & Farrington, 2023)</span> 4. What happens on a mismatched read-speech dataset, Common Voice v22? <span class="small">(Ardila et al., 2020)</span> --- ### DASS2019\_NLP - Starting point: manually transcribed and time-aligned DASS2019 recordings - Transcripts cleaned and normalized, audio cut into ~20-30s segments - 139 speakers, 48,214 segments, 3,084,208 transcribed words, 344.04 hours of speech - Result: a training resource suitable for fine-tuning multiple Whisper model sizes - Public release includes: - the **DASS2019\_NLP** dataset https://huggingface.co/datasets/stcoats/DASS2019_NLP - the best-performing fine-tuned model (**Whisper-large-v3-DASS2019-ct2**) https://huggingface.co/stcoats/whisper-large-v3-DASS2019-ct2 --- ### Fine-tuning and evaluation - Six Whisper model sizes: tiny, base, small, medium, large-v2, large-v3 - Fine-tuned on DASS2019\_NLP, training setup: - 3 epochs, batch size 4, learning rate 5×10⁻⁶ - Weight decay 0.01, linear scheduler - BF16 mixed precision, gradient checkpointing - Trained on the CSC's <img src="data:image/png;base64,#./lumi.png" width="60"/> supercomputer (16× MI250X GPUs) - Data split: - 80 / 10 / 10 (train / validation / test, segment-level) - Evaluation (WER, CER): - DASS2019\_NLP test: in-domain - CORAAL subset: near-domain - Common Voice Southern US: out-of-domain --- ### Evaluation results .small[ | Model size | Variant | DASS WER | DASS CER | CORAAL WER | CORAAL CER | CV Southern US WER | CV Southern US CER | |------------|-------------|----------|----------|------------|------------|--------------------|--------------------| | tiny | OpenAI | 41.93 | 26.75 | 48.01 | 33.41 | *25.27* | *13.73* | | | Fine-tuned | **25.72**| **15.71**| **43.17** | **27.91** | **21.08** | **8.66** | | base | OpenAI | 32.09 | 20.96 | 36.67 | 26.99 | * **12.11** * | * **4.65** * | | | Fine-tuned | **20.85**| **13.55**| **35.67** | **23.87** | 16.63 | 6.54 | | small | OpenAI | 22.60 | 14.67 | 27.09 | 19.70 | * **7.58** * | * **2.80** * | | | Fine-tuned | **14.28**| **9.03** | **25.46** | **17.35** | 9.21 | 3.12 | | medium | OpenAI | 20.17 | 13.46 | 27.96 | 20.01 | * **5.45** * | * **1.89** * | | | Fine-tuned | **12.47**| **8.00** | **20.81** | **14.98** | 7.15 | 2.55 | | large-v2 | OpenAI | 19.64 | 13.14 | 24.54 | 18.96 | * **4.66** * | * **1.62** * | | | Fine-tuned | **12.09**| **7.62** | **20.14** | **14.72** | 5.72 | 1.93 | | large-v3 | OpenAI | 18.50 | 12.18 | 22.99 | 17.17 | **5.05** | **1.72** | | | Fine-tuned | * **11.83** * | * **7.44** * | * **19.11** * | * **14.24** * | 6.52 | 1.99 | ] --- ### Training curve for held-out DASS\_2019_NLP data  --- ### Main results - Held-out **DASS2019\_NLP** data: fine-tuning yields strong gains for all six model sizes - Average **37% reduction in WER**, improvement rate does not depend strongly on model size - Fine-tuning also improves performance on **CORAAL** - On Southern **Common Voice**, fine-tuning usually hurts performance - This pattern is consistent with style and domain mismatch --- exclude: true ### Training trajectories - Rapid improvement during the first **~1,000 training steps** - More gradual gains until roughly **step 5,000** - Little additional improvement after **~6,000 steps** - Suggests a relatively stable convergence profile across model sizes --- ### Cross-domain interpretation - **CORAAL** gains are smaller than in-domain gains - Transfer may be helped by overlap between Southern American English and African American English - Southern American English and African American English share important historical and contact-related features, e.g. monophthongization, vowel mergers <span class="small">(Thomas, 2007)</span> - Best CORAAL performance is also for Whisper-large-v3-DASS2019-ct2 - Common Voice: Specialization via fine-tuning on conversational historical speech can reduce accuracy on clean read speech --- ### Example 1: Disfluencies retained <iframe style="width:420px;height:40px;border:none;overflow:hidden;" scrolling="no" srcdoc=" <body style='margin:0;overflow:hidden;background:transparent'> <audio controls style='width:320px;height:50px;display:block'> <source src='https://cc.oulu.fi/~scoats/Coats_LREC2026_DASS/863_segment.mp3' type='audio/mpeg'> </audio> </body> "> </iframe> **Reference (manual transcript)** Mm-kay. Any, um, things that people used to make to, similar to a bedspread? **Base model (Whisper-large-v3)** Okay. Any things that people used to make similar to a bedspread? **Fine-tuned model (Whisper-large-v3-DASS2019-ct2)** Mm-kay. Any, um, things that people used to make to similar to a bedspread? **Key difference** → Disfluencies (*mm-kay*, *um*, *to*) preserved --- ### Example 2: Non-standard morphology <iframe style="width:420px;height:40px;border:none;overflow:hidden;" scrolling="no" srcdoc=" <body style='margin:0;overflow:hidden;background:transparent'> <audio controls style='width:320px;height:50px;display:block'> <source src='https://cc.oulu.fi/~scoats/Coats_LREC2026_DASS/856C_segment.mp3' type='audio/mpeg'> </audio> </body> "> </iframe> **Reference** <Interviewer>: and uh what other crops did you grow around here? <856C>: well that's about all I growed was cotton and corn, that's all I grew **Base model** And what other crops did you grow around here? Well, that's about all I grew was cotton and corn. That's all I grew. **Fine-tuned model** and uh what other crops did you grow around here. well that's about all I growed was cotton and corn that's all I grew. **Key difference** → Non-standard form (*growed*) retained --- ### Example 3: Lexical accuracy <iframe style="width:420px;height:40px;border:none;overflow:hidden;" scrolling="no" srcdoc=" <body style='margin:0;overflow:hidden;background:transparent'> <audio controls style='width:320px;height:50px;display:block'> <source src='https://cc.oulu.fi/~scoats/Coats_LREC2026_DASS/894_segment.mp3' type='audio/mpeg'> </audio> </body> "> </iframe> **Reference** <Interviewer>: Say if you had a little cut on your finger a brown liquid medicine you could put on that stains a lot <894>: Iodine **Base model** Say if you had a little cut on your finger, a brown liquid medicine you could put on that stings a lot.<br> I don’t. **Fine-tuned model** Say if you had a little cut on your finger a brown liquid medicine you could put on that stings a lot. Iodine **Key differences** → Correct lexical item recovered (*iodine* vs *I don't*)<br> → Human transcriber has mistranscribed *stains* instead of *stings* in the reference --- ### Summary and outlook - Default Whisper models often normalize toward clearer, more standard-like output - Fine-tuning improves verbatim accuracy for conversational and dialect speech (important for historical sociolinguistics, where non-standard forms and discourse features are analytically valuable) - **DASS2019\_NLP** provides a cleaned computational version of an important historical dialect resource - Fine-tuned Whisper models substantially improve transcription accuracy on in-domain data - Improvements also transfer to **CORAAL**, but not generally to **read speech** - Outlook: apply Whisper-large-v3-DASS2019-ct2 to thousands of hours of LAGS recordings; similar approach for other untranscribed LAP data - Build an integrated pipeline with: - Voice activity detection - Speaker diarization - Possible speaker-role correction <span class="small">(Cheng et al., 2025)</span> --- exclude: true ### Caveats and limitations - Recording quality varies considerably across archival materials - Manual transcripts contain some errors, which may affect training and evaluation - Overlapping speech was only partially represented in the training labels - Even so, the results show that carefully curated dialect data can substantially improve ASR for historical speech --- ### Thanks for your attention! #### Acknowledgements - Supported by the **European Union -- NextGenerationEU** instrument - Funded by the **Research Council of Finland**, grant **358720** - Computational resources provided by **Finland's Centre for Scientific Computing** --- ### References .verysmall[ Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2020). Common Voice: A massively multilingual speech corpus. https://commonvoice.mozilla.org Cheng, L., Wang, H., Deng, C., Zheng, S., Chen, Y., Huang, R., Zhang, Q., Chen, Q., Li, X., & Wang, W. (2025). Integrating audio, visual, and semantic information for enhanced multimodal speaker diarization. In *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics* (pp. 19914–19928). https://aclanthology.org Kendall, T., & Farrington, C. (2023). *The Corpus of Regional African American Language*. https://doi.org/10.7264/1ad5-6t35 Kretzschmar, W. A., Renwick, M. E. L., Lipani, L. M., Olsen, M. L., Olsen, R. M., Shi, Y., & Stanley, J. A. (2019). Transcriptions of the Digital Archive of Southern Speech. https://doi.org/10.35111/5bnt-r659 Kretzschmar, W. A., Bounds, P., Hettel, J., Coats, S., Pederson, L., Opas-Hänninen, L. L., Juuso, I., & Seppänen, T. (2012). *Digital Archive of Southern Speech*. Linguistic Data Consortium. https://doi.org/10.35111/5bnt-r659 Olsen, R. M., Olsen, M. L., Stanley, J. A., Renwick, M. E. L., & Kretzschmar, W. A. (2017). Methods for transcription and forced alignment of a legacy speech corpus. *Proceedings of Meetings on Acoustics, 30*(1), 060001. https://doi.org/10.1121/2.0000654 Pederson, L., McDaniel, S. L., & Adams, C. M. (Eds.). (1986–1992). *Linguistic Atlas of the Gulf States* (Vols. 1–7). University of Georgia Press. Peng, Y., Shakeel, M., Sudo, Y., Chen, W., Tian, J., Lin, C.-J., & Watanabe, S. (2025). OWSM v4: Improving Open Whisper-style speech models via data scaling and cleaning. In *Proceedings of INTERSPEECH 2025*. https://www.isca-speech.org/archive Puvvada, K. C., Żelasko, P., Huang, H., Hrinchuk, O., Koluguri, N. R., Dhawan, K., Majumdar, S., Rastorgueva, E., Chen, Z., Lavrukhin, V., Balam, J., & Ginsburg, B. (2024). Less is more: Accurate speech recognition translation without web-scale data. In *Proceedings of INTERSPEECH 2024* (pp. 3964–3968). https://www.isca-speech.org/archive Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://arxiv.org/abs/2212.04356 Renwick, M. E. L., & Stanley, J. A. (2017). Static and dynamic approaches to vowel shifting in the Digital Archive of Southern Speech. *Proceedings of Meetings on Acoustics, 30*(1), 060003. https://doi.org/10.1121/2.0000656 Saon, G., Dekel, A., Brooks, A., Nagano, T., Daniels, A., Satt, A., Mittal, A., Kingsbury, B., Haws, D., Morais, E., Kurata, G., Aronowitz, H., Ibrahim, I., Kuo, J., Soule, K., Lastras, L., Suzuki, M., Hoory, R., Thomas, S., Novitasari, S., Fukuda, T., Sunder, V., Cui, X., & Kons, Z. (2025). Granite-speech: Open-source speech-aware LLMs with strong English ASR capabilities. https://arxiv.org Thomas, E. R. (2007). Phonological and phonetic characteristics of African American Vernacular English. *Language and Linguistics Compass, 1*(5), 450–475. https://doi.org/10.1111/j.1749-818X.2007.00029.x ]