The Corpus of Australian and New Zealand Spoken English: A New Resource of Naturalistic Speech Transcripts

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

<br>

## <span style="color:black;-webkit-text-fill-color: #32CD32;-webkit-text-stroke: 1px;">The Corpus of Australian and New Zealand Spoken English: A New Resource of Naturalistic Speech Transcripts</span>

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>

ALTA 2022<br> 
December 16th, 2022<br>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;CoANZSE | ALTA Conference, Adelaide</span></div>

---

## Outline

1. Background, YouTube ASR captions files, data collection and processing

2. CoANZSE locations and size

3. Example use cases: Double modals, variety classification, acoustic analysis pipeline

4. Caveats, summary

### Background

- Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span>
- Some corpora of transcribed spoken English have limited availability, are small in size, or lack sufficient geographical granularity to make inferences about regional distributions of features

.small[
Corpus	              |Location(s)        |nr_words| Reference
----------------------|-------------------|--------|--------------------------				
ICE-Aus               | Australia         |~600k   | Cassidy et al. 2012
Monash Corpus         | Melbourne         |~96k    | Bradshaw et al. 2010
Griffith Corpus       | Brisbane          |~32k    | Cassidy et al. 2012
Wellington Corpus     | NZ                |~1m     | Holmes et al. 1998
ONZE Corpus           | NZ                |?       | Gordon et al. 2007
]

- Automatic Speech Recognition (ASR) transcripts are available online for speech from specific locations
- Videos from local councils and other government entities can be harvested to create large corpora

---

### Example video

---

### WebVTT file

![](data:image/png;base64,#./Maranoa_webvtt_example.png)

---

### YouTube captions files

- Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither

- User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software

- Auto-generated captions are generated by YT's speech-to-text service

- CoANZSE (and CoNASE and CoBISE): target YT ASR captions

---

### YouTube ASR Corpora

US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Germany, Australia, and New Zealand 
  - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span>
  - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span>
  - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 39.5k transcripts, 1,308 locations <span class="small">(Coats in review)</span>
  - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 57k transcripts, 482 locations
  
Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB),
[CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK))

---

### Data format

<div>
<table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>country</th>
      <th>state</th>
      <th>name</th>
      <th>channel_name</th>
      <th>channel_url</th>
      <th>video_title</th>
      <th>video_id</th>
      <th>upload_date</th>
      <th>video_length</th>
      <th>text_pos</th>
      <th>location</th>
      <th>latlong</th>
      <th>nr_words</th>
    </tr>
  </thead1>
  <tbody1>
    <tr>
      <th>0</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Road Resurfacing Video</td>
      <td>zVr6S5XkJ28</td>
      <td>20181127</td>
      <td>146.120</td>
      <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>433</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td>
      <td>p4MjirCc1oU</td>
      <td>20220301</td>
      <td>181.959</td>
      <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>620</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Transport Capital Works Video</td>
      <td>DXlkVTcmeho</td>
      <td>20180417</td>
      <td>140.450</td>
      <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>347</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Council Meeting Wrap Up February 2022</td>
      <td>2NhuhF2fBu8</td>
      <td>20220224</td>
      <td>107.840</td>
      <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>341</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>CITY DEAL  4 March 2018</td>
      <td>4-cv69ZcwVs</td>
      <td>20180305</td>
      <td>130.159</td>
      <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>420</td>
    </tr1>
  </tbody1>
</table1></div>

---

### Focus on regional and local council channels

Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability

- Speaker place of residence (cf. videos collected based on place-name search alone)

- Topical contents and communicative contexts comparable

- In most jurisdictions government content is in the public domain

---

### Data collection and processing

- Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube)
- Inspection of returned channels to remove false positives
- Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp)
- Geocoding: String containing council name + address + country location to Google's geocoding service
- PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span>

---

### CoANZSE channel locations

---

### CoANZSE corpus size by country/state/territory

.small[
Territory	                  |nr_channels|nr_videos  |nr_words|video_length (h)
----------------------------|---|-------|-----------|----				
Australian Capital Territory|	8	|650	  |915,542	  |111.79
New South Wales             |114|9,741  |27,580,773	|3,428.87
Northern Territory	        |11 |	289	  |315,300	  |48.72
New Zealand	                |74	|18,029	|84,058,661	|10,175.80
Queensland	                |58	|7,356	|19,988,051	|2,642.75
South Australia	            |50	|3,537	|13,856,275	|1,716.72
Tasmania	                  |21	|1,260	|5,086,867	|636.99
Victoria	                  |78	|12,138	|35,304,943	|4,205.40
Western Australia	          |68	|3,815	|8,422,484	|1,063.78
| | | |
Total                       |482|56,815 |195,528,896|24,030.82
]
---

### Potential analyses

- Non-numerical quantifiers *heaps* and *lots*

---

### Corpus use cases and size

- Regional language (dialectology): e.g. syntax, mood and modality
- Pragmatics: Turn-taking, politeness markers
- Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
US            |2,189     |270,931 |1,149,030,824 | 141,455.11
Canada        | 383      |30,916 |103,035,369  |12,586.77

**CoANZSE** (coming soon)

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
Australia     |408    |38,786 |111,470,235 | 13,885.1
New Zealand   | 74     |18,029 |84,058,661  |1,083.75

]]

Country            | Channels|Videos|Tokens      |Length (h) 
-------------------|---------|------|-----------|-----------
England            |324      |23,657|72,879,173 |8,518.39  
Northern Ireland   | 10      |1,898 |6,508,505  |774.17     
Republic of Ireland| 26      |2,525 |6,264,276  |680.81     
Scotland           |75       |8,135 |17,111,396 |1,845.35   
Wales              |18       |2,465 |8,800,264  |982.66

**CoGS** (coming soon)

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
Germany     |1,313    |39,495 |50,554,070 | 7,223.44
]]
</div>

---

### Example analysis: Double modals

- Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span>
  - *I might could help you with this*
-  Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span>
- More widely used in North America and the British Isles than previously thought (Coats 2022a, Coats in review)
- Little studied in Australian and New Zealand speech

---

### Script: Generating a table for manual inspection of double modals

- Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta*
- Script to generate regexes of two-tier combinations

```python
import re
hits = []
for x in modals:
  for i,y in coanzse_df.iterrows():
      pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE)
      finds = pat1.findall(y["text_pos"])
      if finds:
  	    for z in finds:
    	    seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip()
    	    time = z.split()[0].split("_")[-1] 
    	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

- The script creates a URL for each search hit at a time 3 seconds before the targeted utterance 
- In the resulting data frame, each utterance can be annotated after examining the targeted video sequence
- Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors)

---

### Excerpt from generated table

<div id="htmlwidget-25c96499dc51214e6c3a" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-25c96499dc51214e6c3a">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57"],["NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","VIC","VIC","VIC","SA","SA","SA","SA","SA","SA","SA","QLD","QLD","QLD","TAS","TAS","TAS","TAS","TAS","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ"],["Central Darling Shire Council","Dubbo Regional Council","Inner West Council","Ku-ring-gai Council","Ku-ring-gai Council","mosmancouncil","Wingecarribee Shire Council","Wingecarribee Shire Council","Hunter Joint Organisation","Cardinia TV","Latrobe City Council","WyndhamCity","City of Adelaide","City of Burnside","Town of Gawler","Town of Gawler","City of Onkaparinga","CityOfPlayford","City of Victor Harbor","RRCouncil","Logan City Council","NOOSA COUNCIL TV","Clarence City Council","George Town Council Tasmania","Glamorgan Spring Bay Council","Huon Valley Council","King Island TV","Bay of Plenty Regional Council","Environment Canterbury","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Dunedin City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hastings District Council","Napier City Council","Nelson City Council","Otago Regional Council Official","Taupo District Council","Tauranga City Council","Tauranga City Council","Waikato District Council","Waikato District Council","Waikato District Council","Waipa District Council","Westland District Council","Whanganui District Council"],["24 February 2021 Part 2","Dubbo City Council State of the City Report 2014","Speaker Series - Shiver with Allie Reynolds","3D Bushfire Simulation and CWC Workshop","Ordinary Meeting of Council 20_08_2019","Mosman Art Prize - In Conversation Salote Tawale","Extraordinary Council Meeting 16 Feb 2022","Ordinary Meeting of Council 13 May 2020 - part one","Hunter Global Summit Day 1 Session 1","Cardinia Shire Council Meeting, May 18 2020","Latrobe City Council Webinars - Branding and Design","Powers of Attorney & Planning for the future - 2019 Disability Expo","Council Assessment Panel Meeting - 25 May 2020","Burnside Council Meeting 12 March 2019 7 pm","Council Member Workshop - Climate Emergency Action Plan","Special Council Meeting - 29_9_2020","Council meeting 18 May 2021","Ordinary Council - 15 December 2020","City of Victor Harbor Ordinary Council Meeting _ July 2021 (Continuation)","The Short Fall talking about RADF","16_02_21 - The City Planning, Economic Development & Environment Committee","Noosa Council Services & Organisation Committee Meeting - 10 November 2020","Clarence City Council - Council Meeting 28th February 2022","George Town Council Ordinary Meeting held 22nd February 2022 Part 3","Ordinary Meeting of Council - December 11, 2018","Huon Valley Council - Ordinary Council Meeting 27 May 2020","Interview Peter Youd","Public Transport Committee Zoom VIDEO Recording - 30 November 2021","Council Meeting 9 December 2021","09.12.21 - Item 31 - Memorandum of Understanding","30.04.19 - Item 3 - Colin Meurk","09.08.18 - Christchurch City Council meeting","09.08.2018 - Item 17 - Water Supply Programme Update","28.06.18 - Item 10 - Voluntary Smokefree Outdoor Dining in Council-licenced footpath areas","05.04.18 - Item 8 - Coastal - Burwood Community Board Report to Council","19.05.15 - Item 1 - Hearings of Submissions - The Cranmer Bridge Club - John Nimmo","17.04.14 - District Plan Review - Part 4","Committee Meetings - 31 August 2020","HCC HCC Meeting 14 March 2019 Part 4","HCC Annual Plan Meeting 26 February 2019 Part 2","HCC Finance Meeting 28 August Part 1","HCC Meeting 24th May 2018 Part 1","HCC G&I meeting 20 Feb 201 7 part 2","HCC Community and services meeting 27th June Part 2","Council Meeting â€“ 14_07_2020","Sustainable Napier Committee - 13th February 2020 - Part 3","Council meeting Thursday, 24 June 2021","Strategy and Planning Committee - 11 August 2021","2014-08-26 Taupo Council Meeting - Part 2","Policy Committee - 19 February 2020","Urban Form and Transport Development Committee meeting - 17 March 2020 - Part 1 of 2","Local Alcohol Policy Workshop - 11 April 2022","Raglan Community Board - 27 October 2021","LTP Workshop - 30 June 2020","Finance & Corporate Committee - Zoom Meeting","Capital Projects & Tenders Committee Meeting","Building Owners Meeting - 27 May 2019"],["would might","'ll can","would might","might would","would might","might could","would might","would might","will can","would might","would might","would might","might could","'ll can","would might","would might","might could","would might","would might","would might","would might","would might","would might","would might","'ll can","would might","might would","would might","would might","might could","'ll can","might would","might would","might would","would might","would might","might would","would might","would might","would might","would might","'ll can","would might","would might","would might","would might","would might","would might","might can","would might","would might","would might","would might","would might","might would","might could","'ll can"],["<a href=https://youtu.be/4JhDv6H_rMQ?t=63>https://youtu.be/4JhDv6H_rMQ?t=63<\/a>","<a href=https://youtu.be/zOyDAMACmFk?t=190>https://youtu.be/zOyDAMACmFk?t=190<\/a>","<a href=https://youtu.be/WrmDQhsqv5s?t=568>https://youtu.be/WrmDQhsqv5s?t=568<\/a>","<a href=https://youtu.be/KhxiXPQBFXs?t=1232>https://youtu.be/KhxiXPQBFXs?t=1232<\/a>","<a href=https://youtu.be/n80tXfiqQzA?t=6192>https://youtu.be/n80tXfiqQzA?t=6192<\/a>","<a href=https://youtu.be/jQbDqA1yvhM?t=117>https://youtu.be/jQbDqA1yvhM?t=117<\/a>","<a href=https://youtu.be/kwGrKSIIDcQ?t=2997>https://youtu.be/kwGrKSIIDcQ?t=2997<\/a>","<a href=https://youtu.be/whP9EfvuouQ?t=3822>https://youtu.be/whP9EfvuouQ?t=3822<\/a>","<a href=https://youtu.be/6kHJiJMugPs?t=2351>https://youtu.be/6kHJiJMugPs?t=2351<\/a>","<a href=https://youtu.be/LX88aDEQCHY?t=1206>https://youtu.be/LX88aDEQCHY?t=1206<\/a>","<a href=https://youtu.be/7ukJvOujPfQ?t=1044>https://youtu.be/7ukJvOujPfQ?t=1044<\/a>","<a href=https://youtu.be/jFwUaeH452Q?t=804>https://youtu.be/jFwUaeH452Q?t=804<\/a>","<a href=https://youtu.be/6Tk9LilbFQU?t=2586>https://youtu.be/6Tk9LilbFQU?t=2586<\/a>","<a href=https://youtu.be/NwPfjcB8cq8?t=9061>https://youtu.be/NwPfjcB8cq8?t=9061<\/a>","<a href=https://youtu.be/nayq_0Stx2E?t=1519>https://youtu.be/nayq_0Stx2E?t=1519<\/a>","<a href=https://youtu.be/qgN_NF2Plqc?t=6825>https://youtu.be/qgN_NF2Plqc?t=6825<\/a>","<a href=https://youtu.be/e5kOcWgU4o8?t=13474>https://youtu.be/e5kOcWgU4o8?t=13474<\/a>","<a href=https://youtu.be/H35lwri328Q?t=4148>https://youtu.be/H35lwri328Q?t=4148<\/a>","<a href=https://youtu.be/TAIn0QH8VKM?t=10799>https://youtu.be/TAIn0QH8VKM?t=10799<\/a>","<a href=https://youtu.be/3a_MEXeW7H8?t=55>https://youtu.be/3a_MEXeW7H8?t=55<\/a>","<a href=https://youtu.be/6ro3nmNtutc?t=3751>https://youtu.be/6ro3nmNtutc?t=3751<\/a>","<a href=https://youtu.be/efGprIT2zho?t=5256>https://youtu.be/efGprIT2zho?t=5256<\/a>","<a href=https://youtu.be/cW_jBLyo0vo?t=6760>https://youtu.be/cW_jBLyo0vo?t=6760<\/a>","<a href=https://youtu.be/1lUsn3fwm_Y?t=450>https://youtu.be/1lUsn3fwm_Y?t=450<\/a>","<a href=https://youtu.be/4mum1Yur000?t=703>https://youtu.be/4mum1Yur000?t=703<\/a>","<a href=https://youtu.be/uBMu9GMDFaU?t=4480>https://youtu.be/uBMu9GMDFaU?t=4480<\/a>","<a href=https://youtu.be/pFb49I4p0xQ?t=308>https://youtu.be/pFb49I4p0xQ?t=308<\/a>","<a href=https://youtu.be/mHtIRAlc2w4?t=7061>https://youtu.be/mHtIRAlc2w4?t=7061<\/a>","<a href=https://youtu.be/h-Ue9-iD3mc?t=6800>https://youtu.be/h-Ue9-iD3mc?t=6800<\/a>","<a href=https://youtu.be/JO7vMyroJQo?t=1425>https://youtu.be/JO7vMyroJQo?t=1425<\/a>","<a href=https://youtu.be/MRZSHSAhqZ4?t=281>https://youtu.be/MRZSHSAhqZ4?t=281<\/a>","<a href=https://youtu.be/jzZzR2yHjf4?t=7062>https://youtu.be/jzZzR2yHjf4?t=7062<\/a>","<a href=https://youtu.be/khHQeskq9VY?t=3782>https://youtu.be/khHQeskq9VY?t=3782<\/a>","<a href=https://youtu.be/T5PwFRVU2vo?t=863>https://youtu.be/T5PwFRVU2vo?t=863<\/a>","<a href=https://youtu.be/BM25w7hI628?t=1034>https://youtu.be/BM25w7hI628?t=1034<\/a>","<a href=https://youtu.be/nmgg2LeCRh8?t=380>https://youtu.be/nmgg2LeCRh8?t=380<\/a>","<a href=https://youtu.be/l7fuhKQ-Nrs?t=197>https://youtu.be/l7fuhKQ-Nrs?t=197<\/a>","<a href=https://youtu.be/ifMwL7L4ZRc?t=3353>https://youtu.be/ifMwL7L4ZRc?t=3353<\/a>","<a href=https://youtu.be/CbR4GSo5Tr0?t=1291>https://youtu.be/CbR4GSo5Tr0?t=1291<\/a>","<a href=https://youtu.be/KR7DEpF6cPo?t=2352>https://youtu.be/KR7DEpF6cPo?t=2352<\/a>","<a href=https://youtu.be/x2MIZAQbtlg?t=4392>https://youtu.be/x2MIZAQbtlg?t=4392<\/a>","<a href=https://youtu.be/PTCRbmvQ1_w?t=9366>https://youtu.be/PTCRbmvQ1_w?t=9366<\/a>","<a href=https://youtu.be/UGHGqS_OO6o?t=696>https://youtu.be/UGHGqS_OO6o?t=696<\/a>","<a href=https://youtu.be/cWElounayJo?t=5189>https://youtu.be/cWElounayJo?t=5189<\/a>","<a href=https://youtu.be/_u_QyZmmhq4?t=2725>https://youtu.be/_u_QyZmmhq4?t=2725<\/a>","<a href=https://youtu.be/gdzgqjJ4nhY?t=1453>https://youtu.be/gdzgqjJ4nhY?t=1453<\/a>","<a href=https://youtu.be/z3aqzSzw8ek?t=3266>https://youtu.be/z3aqzSzw8ek?t=3266<\/a>","<a href=https://youtu.be/nQ_zHzfPBXk?t=1346>https://youtu.be/nQ_zHzfPBXk?t=1346<\/a>","<a href=https://youtu.be/6tHoNtddg_4?t=78>https://youtu.be/6tHoNtddg_4?t=78<\/a>","<a href=https://youtu.be/InYpTU9ZuTI?t=2251>https://youtu.be/InYpTU9ZuTI?t=2251<\/a>","<a href=https://youtu.be/FpEpRZGeQDw?t=11023>https://youtu.be/FpEpRZGeQDw?t=11023<\/a>","<a href=https://youtu.be/KkSNB-dJZs8?t=2945>https://youtu.be/KkSNB-dJZs8?t=2945<\/a>","<a href=https://youtu.be/RWFqTJCqkYE?t=6684>https://youtu.be/RWFqTJCqkYE?t=6684<\/a>","<a href=https://youtu.be/AbrvLqxsSTg?t=1292>https://youtu.be/AbrvLqxsSTg?t=1292<\/a>","<a href=https://youtu.be/53yPfrqbpkE?t=2611>https://youtu.be/53yPfrqbpkE?t=2611<\/a>","<a href=https://youtu.be/yIR6wFdNUKI?t=1888>https://youtu.be/yIR6wFdNUKI?t=1888<\/a>","<a href=https://youtu.be/WTP15-spw3A?t=4809>https://youtu.be/WTP15-spw3A?t=4809<\/a>"],["t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","sr d t","t","t","t","t","t","fp a2 t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","fp a2 t","t"],["\"however, the senior planning officer would might may want to make comment\"","\"we'll, we'll can forget about that plan for a while\"","also in embedded manual transcript","\"for anything that might would... go wrong\""," "," ","\"if you would might just convey\"","\"if they could move them down the hill further, I think they would might find that\""," "," ","\"people would might have chosen a Commodore over a Merc\"","\"again, there would might be a conflict of interest\"","\"it's a reasonable height, it might could have been taller\""," "," ","\"that would might be a bit of a challenge\"","\"I accept that they now might could have been better worded\""," ","\"it would might be, and this is the reason why\"","younger people","\"are there any other further councilors that would might make comment?\""," ","\"then we would might be able to get\"","\"this morning's workshop, would it might have been a good place...\" dm in question","\"so I'll can raise that with the relevant people\""," ","\"My question might would have been\"","slight pause after dm"," ","\"if they'd done something different way back, things might could have been better\"","\"they'll can accumulate over the next week\"","\"something that might would be in public ownership\"","same as 2003","\"once they've finished with this cigarette, if they must might, we'd like them to go outside\""," "," ","\"and that might would assist them\""," "," "," ","\"how did it slip through would mighta been another way of putting it\""," ","\"we would might have liked them to do that\"","\"where would might it be covered\" question form","Scottish accent?","\"I think what would might be useful is\" Wh- word"," "," ","\"on what might can constitute\""," ","\"we would might want to do, as councilor Morris said\" American/Canadian accent"," "," "," ","\"she might would be aware there are 23 projects\"","\"that we would might get a\"","\"and he'll can walk into a building\" narrative"]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Location<\/th>\n      <th>Channel<\/th>\n      <th>Video<\/th>\n      <th>DM<\/th>\n      <th>Link<\/th>\n      <th>Type<\/th>\n      <th>Notes<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":100,"dom":"tip","scrollY":"400px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

---

### Finding features

- Regular-expression-search and manual annotation approach
- Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review)
- Also in Australia and (especially) New Zealand!

---

### Training a classifier on the basis of common word types

- Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE
<br><br>

---

### Pipeline for acoustic analysis

- Regular expressions to target specific words/phrases in the corpus
- Extract audio span containing the targeted item(s) from YT stream
- Feed audio and transcript excerpt to forced aligner
- Extract desired sounds/acoustic phenomena

---

### Extracted *today* tokens

---

### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong

![:scale 70%](data:image/png;base64,#./eY_coanzse.png)

---

### A few caveats

- Videos of local government not representative of speech in general
- ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span>
  - Low-frequency phenomena: manually inspect corpus hits
  - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers

---

### Summary and outlook

- Large corpus of ASR transcripts from YouTube channels of local governments in  Australia/NZ
- Possibly useful for corpus studies of spoken language, dialectology, pragmatics
- Double modals are more widespread than has previously been documented

---

#Thank you!

---

### References

Agarwal, S., Godbole, S., Punjani, D., & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12.

Bradshaw, J., Burridge, K., & Clyne, M. (2010). The Monash Corpus of Spoken Australian English. In L. de Beuzeville & P. Peters (Eds.), *Proceedings of the 2008 Conference of the Australian Linguistics Society*.

Cassidy, S., Haugh, M., Peters, P., & Fallu, M. (2012). The Australian National Corpus: National infrastructure for language resources. In *Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)*, 3295–3299. http://www.lrec-conf.org/proceedings/lrec2012/pdf/400_Paper.pdf

Coats, S. (In review). Double modals in contemporary British and Irish Speech.

Coats, S. (Forthcoming). Dialect corpora from YouTube. In B. Busse, N. Dumrukcic, & I. Kleiber (Eds.), *Lanugage and linguistics in a complex world*. De Gruyter.

Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*.

Coats, S. (2022b). The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. In K. Berglund, M. La Mela, & I. Zwart (Eds.), [*Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*](http://ceur-ws.org/Vol-3232/paper15.pdf), 187–194. Aachen, Germany: CEUR.

Gordon, E., Maclagan, M. & Hay, J. (2007). The ONZE corpus. In J. C. Beal,
K. P. Corrigan, & H. Moisl (Eds.) *Creating and digitizing language corpora volume 2: Diachronic databases*, 82–104.Palgrave Macmillan.

Grieve, J., Montgomery, C., Nini, A., Murakami, A., & Guo, D. (2019). [Mapping lexical dialect variation
in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2.
]]

---

### References II

.small[
.hangingindent[
Holmes, J., Vine, B., & Johnson, G. (1998). [*Guide to the Wellington Corpus of Spoken New Zealand English*](http://korpus.uib.no/icame/manuals/WSC/INDEX.HTM).

Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug
fixes](https://doi.org/10.5281/zenodo.3358113).

Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering
user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics.

Meyer, J., Rauchenstein, L., Eisenberg, J. D., & Howell, N. (2020). [Artie bias corpus: An open dataset
for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468.

Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States:
Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108.

Nerbonne, J. (2009). Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198.

Szmrecsanyi, B. (2013). *Grammatical variation in British English dialects: A study in corpus-based
dialectometry*. Cambridge University Press.

Szmrecsanyi, B. (2011). Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76.

Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology.
org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for
Computational Linguistics.

]]