MD_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell-Delbridge Recordings through LLM-Assisted Speaker Attribution

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

<h2 style="font-family: 'Rubik', sans-serif; font-size: 1.8em; font-weight: 700; color: #1a202c; line-height: 1.15; margin: 0 0 20px 0;">
    MD_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell-Delbridge Recordings
  </h2>

<p style="font-family: 'Rubik', sans-serif; font-size: 1.2em; font-weight: 400; color: #1a202c; line-height: 1.5; margin: 0;">
    <strong>Steven Coats</strong><br>
    University of Oulu, Finland<br>
    <a href="mailto:steven.coats@oulu.fi" style="color: #901a1e; text-decoration: none;">steven.coats@oulu.fi</a><br>
    DialRes, LREC 2026
  </p>

</div>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Mitchell-Delbridge NLP | DialRes, LREC 2026</span></div>

---

---

### Outline

1. Background and motivation

2. DASS2019\_NLP dataset

3. Fine-tuning and evaluation

4. Main results

5. Error analysis, outlook, and limitations

---

### Background: Australian English Dialects?

- "Australia is, generally speaking, linguistically unified" <span class="small">(Mitchell & Delbridge 1965: 13)</span>
- The *Mitchell-Delbridge* recordings (1959/60)
- Recordings of 7,735 secondary school pupils in 327 locations across Australia <span class="small">(Mitchell and Delbridge, 1998)</span>
- Extensive metadata (age, school location, school type, birthplace, parents' birthplaces, father's occupation)
- Students read a word list, a test sentence, and engaged in a short conversation
- Tape recordings digitized in 1998 
- Important for the study of Australian English <span class="small">(Cox et al., 2014, 2024)</span>
- Highly variable acoustic quality, narratives not previously transcribed

]

![](data:image/png;base64,#./MD_Narrative_locations.png)
]

---

### The Approach: A hybrid ASR workflow

- ASR: WhisperX for initial transcription <span class="small">(Radford et al., 2023; Bain et al. 2023)</span>

- Diarization: Pyannote 4.0.1 <span class="small">(Bredin, 2023)</span>

- Diarization correction: Gemini-flash-2.5 LLM to fix the discourse roles using interactional structure <span class="small">(cf. Cheng et al., 2025)</span>

- Alignment: Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span> for precise word-level boundaries

]

![](data:image/png;base64,#./MD_NLP_flowchart.png)

]

---

### Accuracy improvement

![](data:image/png;base64,#./MD_NLP_comparison.png)
<iframe
  style="width:420px;height:40px;border:none;overflow:hidden;"
  scrolling="no"
  srcdoc="
    <body style='margin:0;overflow:hidden;background:transparent'>
      <audio controls style='width:320px;height:50px;display:block'>
        <source src='https://cc.oulu.fi/~scoats/Coats_LREC2026_MD_NLP/LREC_Dialres_example_combined.mp3' type='audio/mpeg'>
      </audio>
    </body>
  ">
</iframe><br>

]

Speaker turn accuracy

| System | Accuracy (%) | 
|-------|----------|
| Baseline (WhisperX + Pyannote)  | 62.70  | 
| Full Pipeline (LLM-assisted)  | **95.68** |

- LLM-assisted pipeline improves accuracy by 33%

]

---

### The *MD_NLP* Dataset: https://huggingface.co/datasets/stcoats/MD_NLP

![](data:image/png;base64,#./MD_NLP_HF_screenshot.png)

- 177.2 hours of speech, 1.79m word tokens
- `interview_metadata.csv` file in the Hugging Face dataset contains additional metadata fields for each informant
- Resarchers can now instantly query 177 hours of historical AusE, filter by student background, map it to specific coordinates, and extract phonological data with precise timestamps

---

### Conclusion

- Unlocks spatial and diachronic research for Australian English.

- Pipeline architecture is language-agnostic and modular

- Provides a blueprint for rescuing other legacy dialect archives (e.g. US Linguistic Atlas Project)

---

### Thanks for your attention!

#### Acknowledgements

- Supported by the **European Union -- NextGenerationEU** instrument

- Funded by the **Research Council of Finland**, grant **358720**

- Computational resources provided by **Finland's Centre for Scientific Computing**

---

### References

Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-accurate speech transcription of long-form audio. In *Proceedings of Interspeech 2023* (pp. 4489–4493). https://doi.org/10.21437/Interspeech.2023-78

Bredin, H. (2023). Pyannote.audio 2.1 Speaker diarization pipeline: Principle, benchmark, and recipe. In *Proceedings of Interspeech 2023*, (pp. 1983–1987). https://doi.org/10.21437/Interspeech.2023-105

Cheng, L., Wang, H., Deng, C., Zheng, S., Chen, Y., Huang, R., Zhang, Q., Chen, Q., Li, X., & Wang, W. (2025). Integrating audio, visual, and semantic information for enhanced multimodal speaker diarization. In *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics* (pp. 19914–19928). https://aclanthology.org

Cox, F., Penney, J., and Palethorpe, S. (2024). [Australian English Monophthong Change across 50 Years: Static versus Dynamic Measures](https://doi.org/10.3390/languages9030099). *Languages* 9(3), 99.

Cox, F., Palethorpe, S., and Bentink, S. (2014). [Phonetic Archaeology and 50 Years of Change to Australian English /iː/](https://doi.org/10.1080/07268602.2014.875455). *Australian Journal of Linguistics* 34(1), 50–75.

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-
speech alignment using Kaldi. In *Proceedings of Interspeech 2017* (pp. 498–502). https://doi.org/10.21437/Interspeech.2017-1386

Mitchell, A. G., and Delbridge, A. (1998). *The speech of Australian adolescents: Research data and recordings collected by AG Mitchell and Arthur Delbridge in 1959 and 1960*. The University of Sydney. https://doi.org/10.25910/jkwy-wk76

Mitchell, A. G., and Delbridge, A. (1965). *The Pronunciation of English in Australia*. Angus and Robertson.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In *Proceedings of the 40th International Conference on Machine Learning, 202*, 28448–28481. https://doi.org/10.1145/3581783.3611771

]