The Corpus of Australian and New Zealand Spoken English

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-right[
<span style="font-family:Roboto Condensed;font-size:24pt;font-weight: 900;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">The Corpus of Australian and New Zealand Spoken English (CoANZSE)</span>
]

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
Workshop on Language Corpora in Australia<br> 
July 3rd, 2023<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;CoANZSE | WLCA 2023, Canberra</span></div>

---

## Outline

1. Background, YouTube ASR captions files, data collection and processing

2. CoANZSE overview

3. Examples: Double modals, acoustic analysis pipeline

4. Caveats, summary

### Background

- Vast amounts of streamed content are available online
- Use of automatic speech recognition (ASR) transcripts is ubiquitous 
- Technical protocols for streaming (DASH, HLS): data accessible via HTTP

Possible to create specialized corpora for specific locations/topics/speech genres

- Transcripts (**CoANZSE**, CoNASE, CoBISE, CoGS)
  - Analysis of grammar/syntax, lexis, pragmatics, discourse 
  
- Audio 
  - Analysis of phonetic and prosodic variation
  
- Video 
  - Analysis of multimodal communication

---

### Example video

---

### WebVTT file

![](data:image/png;base64,#./Maranoa_webvtt_example.png)

---

### YouTube captions files

- Videos can have multiple captions files: user-uploaded captions, YouTube's ASR captions, or both, or neither

- User-uploaded captions may be manually created or generated automatically by 3rd-party ASR software

- CoANZSE (and CoNASE, CoBISE, CoGS): target YT ASR captions

---

### CoANZSE and other YouTube ASR Corpora

Corpus of Australian and New Zealand Spoken English
- [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): Processed ASR captions from 56k transcripts, collected from 478 Australian and New Zealand YouTube channels of 
local or district councils, 196m word tokens corresponding to 24,007 hours of video from 2007–2022 <span class="small">(Coats 2023a)</span>

Corpus of North American Spoken English
- [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 302k transcripts, 2,572 channels, 1.29b tokens <span class="small">(Coats 2023c, also available with a searchable online interface: https://lncl6.lawcorpus.byu.edu)</span>

Corpus of Britain and Ireland Spoken English
- [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 39k transcripts, 497 channels, 112m tokens <span class="small">(Coats 2022b)</span>

Corpus of German Speech
- [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 39.5k transcripts, 1,313 channels, 50.5m tokens <span class="small">(Coats in review)</span>
  
All are freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB),
[CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK))

---

### Focus on council channels

Content consists of recordings council meetings, news announcements, interviews, cultural events, etc.

Advantages in terms of representativeness and comparability

- Speaker place of residence (cf. videos collected based on place-name search alone)

- Topical contents and communicative contexts comparable

- Government content is non-profit: can be used under "fair dealings" provisions of copyright law (e.g. Australian Copyright Act of 1968, U.S.C. Title 17)

---

### Data collection and processing

- Identification of relevant channels (lists of councils with web pages ➡ scrape pages for links to YouTube)
- Inspection of returned channels to remove false positives
- Retrieval of ASR transcripts using [yt-dlp](https://github.com/yt-dlp/yt-dlp)
- Geocoding: String containing council name + address + country location to Google's geocoding service
- PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span>

---

### CoANZSE data format

<div>
<table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;">
  <thead>
    <tr style="text-align: left;">
      <th></th>
      <th>country</th>
      <th>state</th>
      <th>name</th>
      <th>channel_name</th>
      <th>channel_url</th>
      <th>video_title</th>
      <th>video_id</th>
      <th>upload_date</th>
      <th>video_length</th>
      <th>text_pos</th>
      <th>location</th>
      <th>latlong</th>
      <th>nr_words</th>
    </tr>
  </thead1>
  <tbody1>
    <tr>
      <th>0</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Road Resurfacing Video</td>
      <td>zVr6S5XkJ28</td>
      <td>20181127</td>
      <td>146.120</td>
      <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>433</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td>
      <td>p4MjirCc1oU</td>
      <td>20220301</td>
      <td>181.959</td>
      <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>620</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Transport Capital Works Video</td>
      <td>DXlkVTcmeho</td>
      <td>20180417</td>
      <td>140.450</td>
      <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>347</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Council Meeting Wrap Up February 2022</td>
      <td>2NhuhF2fBu8</td>
      <td>20220224</td>
      <td>107.840</td>
      <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>341</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>CITY DEAL  4 March 2018</td>
      <td>4-cv69ZcwVs</td>
      <td>20180305</td>
      <td>130.159</td>
      <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>420</td>
    </tr1>
  </tbody1>
</table1></div>

---

### Potential analyses

- Non-numerical quantifiers *heaps* and *lots*

---

### CoANZSE corpus size by country/state/territory

.small[
Location	                  |nr_channels|nr_videos  |nr_words|video_length (h)
----------------------------|---|-------|-----------|----				
Australian Capital Territory|	8	|650	  |915,542	  |111.79
New South Wales             |114|9,741  |27,580,773	|3,428.87
Northern Territory	        |11 |	289	  |315,300	  |48.72
New Zealand	                |74	|18,029	|84,058,661	|10,175.80
Queensland	                |58	|7,356	|19,988,051	|2,642.75
South Australia	            |50	|3,537	|13,856,275	|1,716.72
Tasmania	                  |21	|1,260	|5,086,867	|636.99
Victoria	                  |78	|12,138	|35,304,943	|4,205.40
Western Australia	          |68	|3,815	|8,422,484	|1,063.78
| | | |
Total                       |482|56,815 |195,528,896|24,030.82
]

---

### CoANZSE channel locations

---

### ASR transcript and audio quality metric

- The quality of ASR transcripts can be evaluated by using a language model trained on a very large set of ASR transcripts generated for the same audio files at different rates of compression <span class="small">(Yuksel et al. 2023)</span>

rank  |compression|quality|hypothetical ASR excerpt	
--------|-----------|-------|-------------------------
  	1   | none      | best  |it's really fantastic that we
  	2   | little    | good  | it's really fantastic we
  	3   | medium    | middle| it's really fantasy	with 
	  4   | high      | poor  | it rifle fantasy that wonder
  	5   | most      | worst | Ik reed met fantasie

]]

<br><br>
- Applied with an adapted PyTorch model <span class="small">(https://huggingface.co/aixplain/NoRefER)</span>
- Assigns a numerical rating 0 (very bad ASR/audio) to 1 (excellent ASR/audio)

---

### Corpus use cases: Syntax/grammar/pragmatics

- Regional variation in syntax, mood and modality
- Lexical items
- Contractions
- Hortatives/commands/interjections
- Pragmatics: Turn-taking, politeness markers
- Multidimensional analysis à la Biber
- Typological comparison at country/state/regional level

---

### Example analysis: Double modals

- Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span>
  - *I might could help you with this*
- Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span>
- More widely used in North America and the British Isles than previously thought <span class="small">(Coats 2022a, Coats 2023b)</span>
- Little studied in Australian and New Zealand speech

.verysmall[
<div id="htmlwidget-180bea23e6374ead4dc2" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-180bea23e6374ead4dc2">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57"],["NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","VIC","VIC","VIC","SA","SA","SA","SA","SA","SA","SA","QLD","QLD","QLD","TAS","TAS","TAS","TAS","TAS","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ"],["Central Darling Shire Council","Dubbo Regional Council","Inner West Council","Ku-ring-gai Council","Ku-ring-gai Council","mosmancouncil","Wingecarribee Shire Council","Wingecarribee Shire Council","Hunter Joint Organisation","Cardinia TV","Latrobe City Council","WyndhamCity","City of Adelaide","City of Burnside","Town of Gawler","Town of Gawler","City of Onkaparinga","CityOfPlayford","City of Victor Harbor","RRCouncil","Logan City Council","NOOSA COUNCIL TV","Clarence City Council","George Town Council Tasmania","Glamorgan Spring Bay Council","Huon Valley Council","King Island TV","Bay of Plenty Regional Council","Environment Canterbury","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Dunedin City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hastings District Council","Napier City Council","Nelson City Council","Otago Regional Council Official","Taupo District Council","Tauranga City Council","Tauranga City Council","Waikato District Council","Waikato District Council","Waikato District Council","Waipa District Council","Westland District Council","Whanganui District Council"],["24 February 2021 Part 2","Dubbo City Council State of the City Report 2014","Speaker Series - Shiver with Allie Reynolds","3D Bushfire Simulation and CWC Workshop","Ordinary Meeting of Council 20_08_2019","Mosman Art Prize - In Conversation Salote Tawale","Extraordinary Council Meeting 16 Feb 2022","Ordinary Meeting of Council 13 May 2020 - part one","Hunter Global Summit Day 1 Session 1","Cardinia Shire Council Meeting, May 18 2020","Latrobe City Council Webinars - Branding and Design","Powers of Attorney & Planning for the future - 2019 Disability Expo","Council Assessment Panel Meeting - 25 May 2020","Burnside Council Meeting 12 March 2019 7 pm","Council Member Workshop - Climate Emergency Action Plan","Special Council Meeting - 29_9_2020","Council meeting 18 May 2021","Ordinary Council - 15 December 2020","City of Victor Harbor Ordinary Council Meeting _ July 2021 (Continuation)","The Short Fall talking about RADF","16_02_21 - The City Planning, Economic Development & Environment Committee","Noosa Council Services & Organisation Committee Meeting - 10 November 2020","Clarence City Council - Council Meeting 28th February 2022","George Town Council Ordinary Meeting held 22nd February 2022 Part 3","Ordinary Meeting of Council - December 11, 2018","Huon Valley Council - Ordinary Council Meeting 27 May 2020","Interview Peter Youd","Public Transport Committee Zoom VIDEO Recording - 30 November 2021","Council Meeting 9 December 2021","09.12.21 - Item 31 - Memorandum of Understanding","30.04.19 - Item 3 - Colin Meurk","09.08.18 - Christchurch City Council meeting","09.08.2018 - Item 17 - Water Supply Programme Update","28.06.18 - Item 10 - Voluntary Smokefree Outdoor Dining in Council-licenced footpath areas","05.04.18 - Item 8 - Coastal - Burwood Community Board Report to Council","19.05.15 - Item 1 - Hearings of Submissions - The Cranmer Bridge Club - John Nimmo","17.04.14 - District Plan Review - Part 4","Committee Meetings - 31 August 2020","HCC HCC Meeting 14 March 2019 Part 4","HCC Annual Plan Meeting 26 February 2019 Part 2","HCC Finance Meeting 28 August Part 1","HCC Meeting 24th May 2018 Part 1","HCC G&I meeting 20 Feb 201 7 part 2","HCC Community and services meeting 27th June Part 2","Council Meeting â€“ 14_07_2020","Sustainable Napier Committee - 13th February 2020 - Part 3","Council meeting Thursday, 24 June 2021","Strategy and Planning Committee - 11 August 2021","2014-08-26 Taupo Council Meeting - Part 2","Policy Committee - 19 February 2020","Urban Form and Transport Development Committee meeting - 17 March 2020 - Part 1 of 2","Local Alcohol Policy Workshop - 11 April 2022","Raglan Community Board - 27 October 2021","LTP Workshop - 30 June 2020","Finance & Corporate Committee - Zoom Meeting","Capital Projects & Tenders Committee Meeting","Building Owners Meeting - 27 May 2019"],["would might","'ll can","would might","might would","would might","might could","would might","would might","will can","would might","would might","would might","might could","'ll can","would might","would might","might could","would might","would might","would might","would might","would might","would might","would might","'ll can","would might","might would","would might","would might","might could","'ll can","might would","might would","might would","would might","would might","might would","would might","would might","would might","would might","'ll can","would might","would might","would might","would might","would might","would might","might can","would might","would might","would might","would might","would might","might would","might could","'ll can"],["<a href=https://youtu.be/4JhDv6H_rMQ?t=63>https://youtu.be/4JhDv6H_rMQ?t=63<\/a>","<a href=https://youtu.be/zOyDAMACmFk?t=190>https://youtu.be/zOyDAMACmFk?t=190<\/a>","<a href=https://youtu.be/WrmDQhsqv5s?t=568>https://youtu.be/WrmDQhsqv5s?t=568<\/a>","<a href=https://youtu.be/KhxiXPQBFXs?t=1232>https://youtu.be/KhxiXPQBFXs?t=1232<\/a>","<a href=https://youtu.be/n80tXfiqQzA?t=6192>https://youtu.be/n80tXfiqQzA?t=6192<\/a>","<a href=https://youtu.be/jQbDqA1yvhM?t=117>https://youtu.be/jQbDqA1yvhM?t=117<\/a>","<a href=https://youtu.be/kwGrKSIIDcQ?t=2997>https://youtu.be/kwGrKSIIDcQ?t=2997<\/a>","<a href=https://youtu.be/whP9EfvuouQ?t=3822>https://youtu.be/whP9EfvuouQ?t=3822<\/a>","<a href=https://youtu.be/6kHJiJMugPs?t=2351>https://youtu.be/6kHJiJMugPs?t=2351<\/a>","<a href=https://youtu.be/LX88aDEQCHY?t=1206>https://youtu.be/LX88aDEQCHY?t=1206<\/a>","<a href=https://youtu.be/7ukJvOujPfQ?t=1044>https://youtu.be/7ukJvOujPfQ?t=1044<\/a>","<a href=https://youtu.be/jFwUaeH452Q?t=804>https://youtu.be/jFwUaeH452Q?t=804<\/a>","<a href=https://youtu.be/6Tk9LilbFQU?t=2586>https://youtu.be/6Tk9LilbFQU?t=2586<\/a>","<a href=https://youtu.be/NwPfjcB8cq8?t=9061>https://youtu.be/NwPfjcB8cq8?t=9061<\/a>","<a href=https://youtu.be/nayq_0Stx2E?t=1519>https://youtu.be/nayq_0Stx2E?t=1519<\/a>","<a href=https://youtu.be/qgN_NF2Plqc?t=6825>https://youtu.be/qgN_NF2Plqc?t=6825<\/a>","<a href=https://youtu.be/e5kOcWgU4o8?t=13474>https://youtu.be/e5kOcWgU4o8?t=13474<\/a>","<a href=https://youtu.be/H35lwri328Q?t=4148>https://youtu.be/H35lwri328Q?t=4148<\/a>","<a href=https://youtu.be/TAIn0QH8VKM?t=10799>https://youtu.be/TAIn0QH8VKM?t=10799<\/a>","<a href=https://youtu.be/3a_MEXeW7H8?t=55>https://youtu.be/3a_MEXeW7H8?t=55<\/a>","<a href=https://youtu.be/6ro3nmNtutc?t=3751>https://youtu.be/6ro3nmNtutc?t=3751<\/a>","<a href=https://youtu.be/efGprIT2zho?t=5256>https://youtu.be/efGprIT2zho?t=5256<\/a>","<a href=https://youtu.be/cW_jBLyo0vo?t=6760>https://youtu.be/cW_jBLyo0vo?t=6760<\/a>","<a href=https://youtu.be/1lUsn3fwm_Y?t=450>https://youtu.be/1lUsn3fwm_Y?t=450<\/a>","<a href=https://youtu.be/4mum1Yur000?t=703>https://youtu.be/4mum1Yur000?t=703<\/a>","<a href=https://youtu.be/uBMu9GMDFaU?t=4480>https://youtu.be/uBMu9GMDFaU?t=4480<\/a>","<a href=https://youtu.be/pFb49I4p0xQ?t=308>https://youtu.be/pFb49I4p0xQ?t=308<\/a>","<a href=https://youtu.be/mHtIRAlc2w4?t=7061>https://youtu.be/mHtIRAlc2w4?t=7061<\/a>","<a href=https://youtu.be/h-Ue9-iD3mc?t=6800>https://youtu.be/h-Ue9-iD3mc?t=6800<\/a>","<a href=https://youtu.be/JO7vMyroJQo?t=1425>https://youtu.be/JO7vMyroJQo?t=1425<\/a>","<a href=https://youtu.be/MRZSHSAhqZ4?t=281>https://youtu.be/MRZSHSAhqZ4?t=281<\/a>","<a href=https://youtu.be/jzZzR2yHjf4?t=7062>https://youtu.be/jzZzR2yHjf4?t=7062<\/a>","<a href=https://youtu.be/khHQeskq9VY?t=3782>https://youtu.be/khHQeskq9VY?t=3782<\/a>","<a href=https://youtu.be/T5PwFRVU2vo?t=863>https://youtu.be/T5PwFRVU2vo?t=863<\/a>","<a href=https://youtu.be/BM25w7hI628?t=1034>https://youtu.be/BM25w7hI628?t=1034<\/a>","<a href=https://youtu.be/nmgg2LeCRh8?t=380>https://youtu.be/nmgg2LeCRh8?t=380<\/a>","<a href=https://youtu.be/l7fuhKQ-Nrs?t=197>https://youtu.be/l7fuhKQ-Nrs?t=197<\/a>","<a href=https://youtu.be/ifMwL7L4ZRc?t=3353>https://youtu.be/ifMwL7L4ZRc?t=3353<\/a>","<a href=https://youtu.be/CbR4GSo5Tr0?t=1291>https://youtu.be/CbR4GSo5Tr0?t=1291<\/a>","<a href=https://youtu.be/KR7DEpF6cPo?t=2352>https://youtu.be/KR7DEpF6cPo?t=2352<\/a>","<a href=https://youtu.be/x2MIZAQbtlg?t=4392>https://youtu.be/x2MIZAQbtlg?t=4392<\/a>","<a href=https://youtu.be/PTCRbmvQ1_w?t=9366>https://youtu.be/PTCRbmvQ1_w?t=9366<\/a>","<a href=https://youtu.be/UGHGqS_OO6o?t=696>https://youtu.be/UGHGqS_OO6o?t=696<\/a>","<a href=https://youtu.be/cWElounayJo?t=5189>https://youtu.be/cWElounayJo?t=5189<\/a>","<a href=https://youtu.be/_u_QyZmmhq4?t=2725>https://youtu.be/_u_QyZmmhq4?t=2725<\/a>","<a href=https://youtu.be/gdzgqjJ4nhY?t=1453>https://youtu.be/gdzgqjJ4nhY?t=1453<\/a>","<a href=https://youtu.be/z3aqzSzw8ek?t=3266>https://youtu.be/z3aqzSzw8ek?t=3266<\/a>","<a href=https://youtu.be/nQ_zHzfPBXk?t=1346>https://youtu.be/nQ_zHzfPBXk?t=1346<\/a>","<a href=https://youtu.be/6tHoNtddg_4?t=78>https://youtu.be/6tHoNtddg_4?t=78<\/a>","<a href=https://youtu.be/InYpTU9ZuTI?t=2251>https://youtu.be/InYpTU9ZuTI?t=2251<\/a>","<a href=https://youtu.be/FpEpRZGeQDw?t=11023>https://youtu.be/FpEpRZGeQDw?t=11023<\/a>","<a href=https://youtu.be/KkSNB-dJZs8?t=2945>https://youtu.be/KkSNB-dJZs8?t=2945<\/a>","<a href=https://youtu.be/RWFqTJCqkYE?t=6684>https://youtu.be/RWFqTJCqkYE?t=6684<\/a>","<a href=https://youtu.be/AbrvLqxsSTg?t=1292>https://youtu.be/AbrvLqxsSTg?t=1292<\/a>","<a href=https://youtu.be/53yPfrqbpkE?t=2611>https://youtu.be/53yPfrqbpkE?t=2611<\/a>","<a href=https://youtu.be/yIR6wFdNUKI?t=1888>https://youtu.be/yIR6wFdNUKI?t=1888<\/a>","<a href=https://youtu.be/WTP15-spw3A?t=4809>https://youtu.be/WTP15-spw3A?t=4809<\/a>"],["t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","sr d t","t","t","t","t","t","fp a2 t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","fp a2 t","t"],["\"however, the senior planning officer would might may want to make comment\"","\"we'll, we'll can forget about that plan for a while\"","also in embedded manual transcript","\"for anything that might would... go wrong\""," "," ","\"if you would might just convey\"","\"if they could move them down the hill further, I think they would might find that\""," "," ","\"people would might have chosen a Commodore over a Merc\"","\"again, there would might be a conflict of interest\"","\"it's a reasonable height, it might could have been taller\""," "," ","\"that would might be a bit of a challenge\"","\"I accept that they now might could have been better worded\""," ","\"it would might be, and this is the reason why\"","younger people","\"are there any other further councilors that would might make comment?\""," ","\"then we would might be able to get\"","\"this morning's workshop, would it might have been a good place...\" dm in question","\"so I'll can raise that with the relevant people\""," ","\"My question might would have been\"","slight pause after dm"," ","\"if they'd done something different way back, things might could have been better\"","\"they'll can accumulate over the next week\"","\"something that might would be in public ownership\"","same as 2003","\"once they've finished with this cigarette, if they must might, we'd like them to go outside\""," "," ","\"and that might would assist them\""," "," "," ","\"how did it slip through would mighta been another way of putting it\""," ","\"we would might have liked them to do that\"","\"where would might it be covered\" question form","Scottish accent?","\"I think what would might be useful is\" Wh- word"," "," ","\"on what might can constitute\""," ","\"we would might want to do, as councilor Morris said\" American/Canadian accent"," "," "," ","\"she might would be aware there are 23 projects\"","\"that we would might get a\"","\"and he'll can walk into a building\" narrative"]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Location<\/th>\n      <th>Channel<\/th>\n      <th>Video<\/th>\n      <th>DM<\/th>\n      <th>Link<\/th>\n      <th>Type<\/th>\n      <th>Notes<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":100,"dom":"tip","scrollY":"200px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>
]

---

### Script: Generating a table for manual inspection of double modals

- Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta*
- Script to generate regexes of two-tier combinations

```python
import re
hits = []
for x in modals:
  for i,y in coanzse_df.iterrows():
      pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE)
      finds = pat1.findall(y["text_pos"])
      if finds:
  	    for z in finds:
    	    seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip()
    	    time = z.split()[0].split("_")[-1] 
    	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

- The script creates a URL for each search hit at a time 3 seconds before the targeted utterance 
- In the resulting data frame, each utterance can be annotated after examining the targeted video sequence
- Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors)

---

### Excerpt from generated table

<div id="htmlwidget-3caffac3af0be9264f88" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-3caffac3af0be9264f88">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57"],["NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","VIC","VIC","VIC","SA","SA","SA","SA","SA","SA","SA","QLD","QLD","QLD","TAS","TAS","TAS","TAS","TAS","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ"],["Central Darling Shire Council","Dubbo Regional Council","Inner West Council","Ku-ring-gai Council","Ku-ring-gai Council","mosmancouncil","Wingecarribee Shire Council","Wingecarribee Shire Council","Hunter Joint Organisation","Cardinia TV","Latrobe City Council","WyndhamCity","City of Adelaide","City of Burnside","Town of Gawler","Town of Gawler","City of Onkaparinga","CityOfPlayford","City of Victor Harbor","RRCouncil","Logan City Council","NOOSA COUNCIL TV","Clarence City Council","George Town Council Tasmania","Glamorgan Spring Bay Council","Huon Valley Council","King Island TV","Bay of Plenty Regional Council","Environment Canterbury","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Dunedin City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hastings District Council","Napier City Council","Nelson City Council","Otago Regional Council Official","Taupo District Council","Tauranga City Council","Tauranga City Council","Waikato District Council","Waikato District Council","Waikato District Council","Waipa District Council","Westland District Council","Whanganui District Council"],["24 February 2021 Part 2","Dubbo City Council State of the City Report 2014","Speaker Series - Shiver with Allie Reynolds","3D Bushfire Simulation and CWC Workshop","Ordinary Meeting of Council 20_08_2019","Mosman Art Prize - In Conversation Salote Tawale","Extraordinary Council Meeting 16 Feb 2022","Ordinary Meeting of Council 13 May 2020 - part one","Hunter Global Summit Day 1 Session 1","Cardinia Shire Council Meeting, May 18 2020","Latrobe City Council Webinars - Branding and Design","Powers of Attorney & Planning for the future - 2019 Disability Expo","Council Assessment Panel Meeting - 25 May 2020","Burnside Council Meeting 12 March 2019 7 pm","Council Member Workshop - Climate Emergency Action Plan","Special Council Meeting - 29_9_2020","Council meeting 18 May 2021","Ordinary Council - 15 December 2020","City of Victor Harbor Ordinary Council Meeting _ July 2021 (Continuation)","The Short Fall talking about RADF","16_02_21 - The City Planning, Economic Development & Environment Committee","Noosa Council Services & Organisation Committee Meeting - 10 November 2020","Clarence City Council - Council Meeting 28th February 2022","George Town Council Ordinary Meeting held 22nd February 2022 Part 3","Ordinary Meeting of Council - December 11, 2018","Huon Valley Council - Ordinary Council Meeting 27 May 2020","Interview Peter Youd","Public Transport Committee Zoom VIDEO Recording - 30 November 2021","Council Meeting 9 December 2021","09.12.21 - Item 31 - Memorandum of Understanding","30.04.19 - Item 3 - Colin Meurk","09.08.18 - Christchurch City Council meeting","09.08.2018 - Item 17 - Water Supply Programme Update","28.06.18 - Item 10 - Voluntary Smokefree Outdoor Dining in Council-licenced footpath areas","05.04.18 - Item 8 - Coastal - Burwood Community Board Report to Council","19.05.15 - Item 1 - Hearings of Submissions - The Cranmer Bridge Club - John Nimmo","17.04.14 - District Plan Review - Part 4","Committee Meetings - 31 August 2020","HCC HCC Meeting 14 March 2019 Part 4","HCC Annual Plan Meeting 26 February 2019 Part 2","HCC Finance Meeting 28 August Part 1","HCC Meeting 24th May 2018 Part 1","HCC G&I meeting 20 Feb 201 7 part 2","HCC Community and services meeting 27th June Part 2","Council Meeting â€“ 14_07_2020","Sustainable Napier Committee - 13th February 2020 - Part 3","Council meeting Thursday, 24 June 2021","Strategy and Planning Committee - 11 August 2021","2014-08-26 Taupo Council Meeting - Part 2","Policy Committee - 19 February 2020","Urban Form and Transport Development Committee meeting - 17 March 2020 - Part 1 of 2","Local Alcohol Policy Workshop - 11 April 2022","Raglan Community Board - 27 October 2021","LTP Workshop - 30 June 2020","Finance & Corporate Committee - Zoom Meeting","Capital Projects & Tenders Committee Meeting","Building Owners Meeting - 27 May 2019"],["would might","'ll can","would might","might would","would might","might could","would might","would might","will can","would might","would might","would might","might could","'ll can","would might","would might","might could","would might","would might","would might","would might","would might","would might","would might","'ll can","would might","might would","would might","would might","might could","'ll can","might would","might would","might would","would might","would might","might would","would might","would might","would might","would might","'ll can","would might","would might","would might","would might","would might","would might","might can","would might","would might","would might","would might","would might","might would","might could","'ll can"],["<a href=https://youtu.be/4JhDv6H_rMQ?t=63>https://youtu.be/4JhDv6H_rMQ?t=63<\/a>","<a href=https://youtu.be/zOyDAMACmFk?t=190>https://youtu.be/zOyDAMACmFk?t=190<\/a>","<a href=https://youtu.be/WrmDQhsqv5s?t=568>https://youtu.be/WrmDQhsqv5s?t=568<\/a>","<a href=https://youtu.be/KhxiXPQBFXs?t=1232>https://youtu.be/KhxiXPQBFXs?t=1232<\/a>","<a href=https://youtu.be/n80tXfiqQzA?t=6192>https://youtu.be/n80tXfiqQzA?t=6192<\/a>","<a href=https://youtu.be/jQbDqA1yvhM?t=117>https://youtu.be/jQbDqA1yvhM?t=117<\/a>","<a href=https://youtu.be/kwGrKSIIDcQ?t=2997>https://youtu.be/kwGrKSIIDcQ?t=2997<\/a>","<a href=https://youtu.be/whP9EfvuouQ?t=3822>https://youtu.be/whP9EfvuouQ?t=3822<\/a>","<a href=https://youtu.be/6kHJiJMugPs?t=2351>https://youtu.be/6kHJiJMugPs?t=2351<\/a>","<a href=https://youtu.be/LX88aDEQCHY?t=1206>https://youtu.be/LX88aDEQCHY?t=1206<\/a>","<a href=https://youtu.be/7ukJvOujPfQ?t=1044>https://youtu.be/7ukJvOujPfQ?t=1044<\/a>","<a href=https://youtu.be/jFwUaeH452Q?t=804>https://youtu.be/jFwUaeH452Q?t=804<\/a>","<a href=https://youtu.be/6Tk9LilbFQU?t=2586>https://youtu.be/6Tk9LilbFQU?t=2586<\/a>","<a href=https://youtu.be/NwPfjcB8cq8?t=9061>https://youtu.be/NwPfjcB8cq8?t=9061<\/a>","<a href=https://youtu.be/nayq_0Stx2E?t=1519>https://youtu.be/nayq_0Stx2E?t=1519<\/a>","<a href=https://youtu.be/qgN_NF2Plqc?t=6825>https://youtu.be/qgN_NF2Plqc?t=6825<\/a>","<a href=https://youtu.be/e5kOcWgU4o8?t=13474>https://youtu.be/e5kOcWgU4o8?t=13474<\/a>","<a href=https://youtu.be/H35lwri328Q?t=4148>https://youtu.be/H35lwri328Q?t=4148<\/a>","<a href=https://youtu.be/TAIn0QH8VKM?t=10799>https://youtu.be/TAIn0QH8VKM?t=10799<\/a>","<a href=https://youtu.be/3a_MEXeW7H8?t=55>https://youtu.be/3a_MEXeW7H8?t=55<\/a>","<a href=https://youtu.be/6ro3nmNtutc?t=3751>https://youtu.be/6ro3nmNtutc?t=3751<\/a>","<a href=https://youtu.be/efGprIT2zho?t=5256>https://youtu.be/efGprIT2zho?t=5256<\/a>","<a href=https://youtu.be/cW_jBLyo0vo?t=6760>https://youtu.be/cW_jBLyo0vo?t=6760<\/a>","<a href=https://youtu.be/1lUsn3fwm_Y?t=450>https://youtu.be/1lUsn3fwm_Y?t=450<\/a>","<a href=https://youtu.be/4mum1Yur000?t=703>https://youtu.be/4mum1Yur000?t=703<\/a>","<a href=https://youtu.be/uBMu9GMDFaU?t=4480>https://youtu.be/uBMu9GMDFaU?t=4480<\/a>","<a href=https://youtu.be/pFb49I4p0xQ?t=308>https://youtu.be/pFb49I4p0xQ?t=308<\/a>","<a href=https://youtu.be/mHtIRAlc2w4?t=7061>https://youtu.be/mHtIRAlc2w4?t=7061<\/a>","<a href=https://youtu.be/h-Ue9-iD3mc?t=6800>https://youtu.be/h-Ue9-iD3mc?t=6800<\/a>","<a href=https://youtu.be/JO7vMyroJQo?t=1425>https://youtu.be/JO7vMyroJQo?t=1425<\/a>","<a href=https://youtu.be/MRZSHSAhqZ4?t=281>https://youtu.be/MRZSHSAhqZ4?t=281<\/a>","<a href=https://youtu.be/jzZzR2yHjf4?t=7062>https://youtu.be/jzZzR2yHjf4?t=7062<\/a>","<a href=https://youtu.be/khHQeskq9VY?t=3782>https://youtu.be/khHQeskq9VY?t=3782<\/a>","<a href=https://youtu.be/T5PwFRVU2vo?t=863>https://youtu.be/T5PwFRVU2vo?t=863<\/a>","<a href=https://youtu.be/BM25w7hI628?t=1034>https://youtu.be/BM25w7hI628?t=1034<\/a>","<a href=https://youtu.be/nmgg2LeCRh8?t=380>https://youtu.be/nmgg2LeCRh8?t=380<\/a>","<a href=https://youtu.be/l7fuhKQ-Nrs?t=197>https://youtu.be/l7fuhKQ-Nrs?t=197<\/a>","<a href=https://youtu.be/ifMwL7L4ZRc?t=3353>https://youtu.be/ifMwL7L4ZRc?t=3353<\/a>","<a href=https://youtu.be/CbR4GSo5Tr0?t=1291>https://youtu.be/CbR4GSo5Tr0?t=1291<\/a>","<a href=https://youtu.be/KR7DEpF6cPo?t=2352>https://youtu.be/KR7DEpF6cPo?t=2352<\/a>","<a href=https://youtu.be/x2MIZAQbtlg?t=4392>https://youtu.be/x2MIZAQbtlg?t=4392<\/a>","<a href=https://youtu.be/PTCRbmvQ1_w?t=9366>https://youtu.be/PTCRbmvQ1_w?t=9366<\/a>","<a href=https://youtu.be/UGHGqS_OO6o?t=696>https://youtu.be/UGHGqS_OO6o?t=696<\/a>","<a href=https://youtu.be/cWElounayJo?t=5189>https://youtu.be/cWElounayJo?t=5189<\/a>","<a href=https://youtu.be/_u_QyZmmhq4?t=2725>https://youtu.be/_u_QyZmmhq4?t=2725<\/a>","<a href=https://youtu.be/gdzgqjJ4nhY?t=1453>https://youtu.be/gdzgqjJ4nhY?t=1453<\/a>","<a href=https://youtu.be/z3aqzSzw8ek?t=3266>https://youtu.be/z3aqzSzw8ek?t=3266<\/a>","<a href=https://youtu.be/nQ_zHzfPBXk?t=1346>https://youtu.be/nQ_zHzfPBXk?t=1346<\/a>","<a href=https://youtu.be/6tHoNtddg_4?t=78>https://youtu.be/6tHoNtddg_4?t=78<\/a>","<a href=https://youtu.be/InYpTU9ZuTI?t=2251>https://youtu.be/InYpTU9ZuTI?t=2251<\/a>","<a href=https://youtu.be/FpEpRZGeQDw?t=11023>https://youtu.be/FpEpRZGeQDw?t=11023<\/a>","<a href=https://youtu.be/KkSNB-dJZs8?t=2945>https://youtu.be/KkSNB-dJZs8?t=2945<\/a>","<a href=https://youtu.be/RWFqTJCqkYE?t=6684>https://youtu.be/RWFqTJCqkYE?t=6684<\/a>","<a href=https://youtu.be/AbrvLqxsSTg?t=1292>https://youtu.be/AbrvLqxsSTg?t=1292<\/a>","<a href=https://youtu.be/53yPfrqbpkE?t=2611>https://youtu.be/53yPfrqbpkE?t=2611<\/a>","<a href=https://youtu.be/yIR6wFdNUKI?t=1888>https://youtu.be/yIR6wFdNUKI?t=1888<\/a>","<a href=https://youtu.be/WTP15-spw3A?t=4809>https://youtu.be/WTP15-spw3A?t=4809<\/a>"],["t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","sr d t","t","t","t","t","t","fp a2 t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","fp a2 t","t"],["\"however, the senior planning officer would might may want to make comment\"","\"we'll, we'll can forget about that plan for a while\"","also in embedded manual transcript","\"for anything that might would... go wrong\""," "," ","\"if you would might just convey\"","\"if they could move them down the hill further, I think they would might find that\""," "," ","\"people would might have chosen a Commodore over a Merc\"","\"again, there would might be a conflict of interest\"","\"it's a reasonable height, it might could have been taller\""," "," ","\"that would might be a bit of a challenge\"","\"I accept that they now might could have been better worded\""," ","\"it would might be, and this is the reason why\"","younger people","\"are there any other further councilors that would might make comment?\""," ","\"then we would might be able to get\"","\"this morning's workshop, would it might have been a good place...\" dm in question","\"so I'll can raise that with the relevant people\""," ","\"My question might would have been\"","slight pause after dm"," ","\"if they'd done something different way back, things might could have been better\"","\"they'll can accumulate over the next week\"","\"something that might would be in public ownership\"","same as 2003","\"once they've finished with this cigarette, if they must might, we'd like them to go outside\""," "," ","\"and that might would assist them\""," "," "," ","\"how did it slip through would mighta been another way of putting it\""," ","\"we would might have liked them to do that\"","\"where would might it be covered\" question form","Scottish accent?","\"I think what would might be useful is\" Wh- word"," "," ","\"on what might can constitute\""," ","\"we would might want to do, as councilor Morris said\" American/Canadian accent"," "," "," ","\"she might would be aware there are 23 projects\"","\"that we would might get a\"","\"and he'll can walk into a building\" narrative"]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Location<\/th>\n      <th>Channel<\/th>\n      <th>Video<\/th>\n      <th>DM<\/th>\n      <th>Link<\/th>\n      <th>Type<\/th>\n      <th>Notes<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":100,"dom":"tip","scrollY":"400px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

---

### Finding features

- Regular-expression-search and manual annotation approach
- Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review)
- Also in Australia and (especially) New Zealand!

---

### Training a classifier on the basis of common word types

- Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE
<br><br>

---

### CoANZSE audio data (work in progress)

- Cut YouTube transcripts into 20-word chunks
- Using transcript timing information and the DASH manifest, extract audio segment for each chunk with yt-dlp
- Feed audio and transcript excerpts to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>
    - <span class="small">Grapheme to phoneme dictionary, pronunciation dictionary: US ARPAbet</span>
    - <span class="small">Acoustic model: from Librispeech Corpus (Panayotov et al. 2015)</span>
    - <span class="small">Language model: MFA English 2.0.0</span>
- Output is Praat textgrids
- Get features of interest from textgrids + audio chunks with Parselmouth <span class="small">(Python port of Praat functions; Jadoul et al. 2018)</span> 
- Analyze phenomena of interest (formants, voice onset time, pitch, etc.)
- Currently 30m vowels, 130m measurements

---

### Pipeline for acoustic analysis

![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png)

- A Jupyter notebook that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants
- Click your way through the process in a Colab environment
- Can be used for any language that has ASR transcripts
- With a few script modifications, also works for manual transcripts

https://github.com/stcoats/phonetics_pipeline

---

### Example: Excerpt from a video of the City of Adelaide <span class="small">(former mayor Sandy Verschoor, https://www.youtube.com/watch?v=f-GX8-qszPE)</span>

---

### Pipeline for acoustic analysis: Vowel formants

For each transcript/audio pair in the collection:

- Send transcript + audio to Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>; output is Praat TextGrids <span class="small">(Boersma & Weenink 2023)</span>
- Select features of interest using TextGrid timings and Parselmouth <span class="small">(Python port of Praat functions; Jadoul et al. 2018)</span>

<pre style="font-size:11px">were raised by councillors which discussed              [oʊ]<br/>a broad range of topics and issues of<br />particular note was the further promotion</pre>                      
<audio controls id="player" autostart="0" autostart="false" preload ="none" name="media">
  <source src="https://a3s.fi/swift/v1/AUTH_319b50570e56446f94b58088b66fcdb2/test_sounds1/OdhGckWy5Dw_0001358500014315_17.wav" type="audio/wav">
</audio>

---

### Formants: F1/F2 values for a single utterance

]

- Script makes 9 f1/f2 measurements per token at quantiles of the token duration

- Circles are individual measurement points

- The line represents the formant trajectory for a single token

- Retain segments for which at least 5 measurements were possible

]

---

### Formants: F1/F2 values for a single location (filtered)

]

- Sample of [oʊ] realizations from the City of Adelaide channel

- Retain tokens for which at least 5 measurements were possible

- This visualization filters out segments 100 milliseconds in duration

]
     
---

### Formants: Mean values

]

- Mean values for a single video, a single channel, a single location, etc.
- Circle locations represent the average value for that duration quantile (subscript)
- Circle size is proportional to the number of measurements for that quantile (more likely to get formant values in the middle of the vowel than at the beginning/end)
]

---

### GOAT vowel

- First target of /oʊ/ is more back and closed in South Australia compared to other Australian locations <span class="small">(Butcher 2007, Cox & Palethorpe 2019)</span>

---

#### Average  F1 and F2 values for the first targets of the diphthongs /eɪ/, /aɪ/, /oʊ/, and /aʊ/, spatial autocorrelation <span class="small">(2,339,812 vowel tokens)

---

### Comparison <small>(Grieve, Speelman & Geeraerts 2013, p. 37)</small>

![](data:image/png;base64,#.\Grieve_et_al_2013_eY.png)
]

.pull-right[
- Grieve et al. (2013) used a similar technique used to analyze formant measurements from the *Atlas of North American English* (Labov et al. 2006)
- ANAE contains approximately 134,000 vowel measurements in total
]

---

### Multimodality

- Use regular expressions to search corpus
- Extract video as well as audio
- Manually or automatically analyze:
  - Gesture
  - Posture/body/head inclination
  - Facial expression
  - Handling of objects
  - Touching
  - (etc.)

---

### 'Heaps of' in Australian English

---

### Extracted *today* tokens

---

### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong

![:scale 70%](data:image/png;base64,#./eY_coanzse.png)

---

### A few caveats

- Videos of local government not representative of speech in general
- ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span>
  - Low-frequency phenomena: manually inspect corpus hits
  - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers
- Machine learning model to identify higher quality transcripts/audio <span class="small">(Yuksel et al. 2023)</span>
- MFA pronunciation dictionary and acoustic model: US English models might fail for some features (rhotacism)? <span class="small">BUT see Gonzalez et al. (2020), Mackenzie and Turton (2020)</span> 
- Need to analyze error rates of forced alignment
- Diarization, speaker demographic information

---

### Summary and outlook

- CoANZSE  is a large corpus of ASR transcripts from YouTube channels of local governments in AUS and NZ
- It can be used for studies regional variation in grammar, syntax, discourse 
- CoANZSE audio can be used for studies of phonetic variation: multivariate spatial analysis of vowel formants in Australian English

---

#Thank you!

###Please feel free to download and use the corpus!

---

### References

Agarwal, S., Godbole, S., Punjani, D. & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12.

Boersma, P. & Weenink, D. (2023). Praat: doing phonetics by computer. Version 6.3.09. http://www.praat.org

Coats, Steven. (2023a). CoANZSE: [The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts](https://doi.org/10.2478/plc-2022-13 ). In P. Parameswaran, J. Biggs & D. Powers (Eds.), *Proceedings of the the 20th Annual Workshop of the Australasian Language Technology Association*, 1–5. Australasian Language Technology Association.

Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*.

Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter.

Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*.

Coats, S. (2022b). [The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech](http://ceur-ws.org/Vol-3232/paper15.pdf). In K. Berglund, M. La Mela & I. Zwart (Eds.), *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*, 187–194. CEUR.

Gonzalez, S., Grama, J. & Travis, C. (2020). [Comparing the performance of forced aligners used in sociophonetic research](https://doi.org/10.1515/lingvan-2019-0058). Linguistics Vanguard 5.

Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug
fixes](https://doi.org/10.5281/zenodo.3358113).
]]

---

### References II

Jadoul, Y., Thompson, B. & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. *Journal of Phonetics*, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001

MacKenzie, L. & Turton, D. (2020). [Assessing the accuracy of existing forced alignment software on varieties of British English](https://doi.org/10.1515/lingvan-2018-0061). Linguistics Vanguard 6.

Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering
user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics.

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In *Proceedings of the 18th Conference of the International Speech Communication Association*.

Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). [Artie bias corpus: An open dataset
for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468.

Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States:
Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108.

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015) [Librispeech: An ASR corpus based on public domain audio books](https://doi.org/10.1109/ICASSP.2015.7178964). In *Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 5206–5210.

Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology.
org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for Computational Linguistics.

Yuksel, K. A., Ferreira, T., Javadi, G., El-Badrashiny, M. & Gunduz, A. (2023). [NoRefER: A referenceless quality metric for Automatic Speech Recognition via semi-supervised language model fine-tuning with contrastive learning](https://arxiv.org/abs/2306.12577). arXiv:2306.12577 [cs.CL].

]]