Corpora for the study of multimodal variation in English: Acoustic analysis from CoNASE

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-right[
<span style="font-family:Roboto Condensed;font-size:24pt;font-weight: 900;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Corpora for the study of multimodal variation in English: Acoustic analysis from CoNASE</span>
]

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
KTP 2023<br> 
May 25th, 2023<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Corpora for Multimodal Variation | KTP 2023, Oulu</span></div>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Corpora for Multimodal Variation | KTP 2023, Oulu</span></div>

## Outline

1. Background, YouTube ASR captions files, data collection and processing

2. CoNASE, CoBISE, CoANZSE

3. Example: Acoustic analysis pipeline

4. Caveats, summary

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Corpora for Multimodal Variation | KTP 2023, Oulu</span></div> 
---

### Background

- Renaissance in corpus-based study of regional English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013)</span>
- Research data often consist of text sourced from the web and social media <span class="small">(e.g., Davies 2008–; Grieve et al. 2019)</span>
- There are relatively few large, geolocated multimodal corpora containing audio/video as well as transcribed text

.small[
Corpus	              |Location           |# Words | Reference
----------------------|-------------------|--------|--------------------------				
Santa Barbara Corpus  | US                |~249k   | Du Bois et al. 2000-2005
Spoken BNC2014        | UK                |~10m    | Love et al. 2017; Brezina et al. 2018
]

- Vast amounts of streamed video data are available online, much of which can be harnessed for linguistic research 
- Combining streamed content with Automatic Speech Recognition (ASR) transcripts and geolocation:
  - creation of multimodal corpora for specific locations
  - forced alignment for phonetic/prosodic analysis <span class="small">(Coto-Solano et al. 2021)</span>

- Grammatical, acoustic, pragmatic, and possibly visual properties of naturalistic speech

---

### Example video

---

### WebVTT file

![](data:image/png;base64,#./WY9RPeXA3pw_vtt.png)

---

### YouTube captions files

- Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither

- User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software

- Auto-generated captions are generated by YT's speech-to-text service

- CoNASE, CoANZSE, CoBISE: target YT ASR captions

---

### YouTube ASR Corpora

US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Australia, and New Zealand, Germany 
  - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats 2023)</span>
  - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span>
  - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 482 locations, 57k transcripts  <span class="small">(Coats 2022b)</span>
  
Also [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 1,308 locations, 39.5k transcripts <span class="small">(Coats in review)</span>
  
Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB),
[CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK))

---

### Data format

<div>
<table border="1" class="dataframe" style="font-size:8pt;border-collapse: collapse;">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>country</th>
      <th>state</th>
      <th>name</th>
      <th>channel_name</th>
      <th>channel_url</th>
      <th>video_title</th>
      <th>video_id</th>
      <th>upload_date</th>
      <th>video_length</th>
      <th>text_pos</th>
      <th>location</th>
      <th>latlong</th>
      <th>nr_words</th>
    </tr>
  </thead1>
  <tbody1>
    <tr>
      <th>0</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Road Resurfacing Video</td>
      <td>zVr6S5XkJ28</td>
      <td>20181127</td>
      <td>146.120</td>
      <td>g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>433</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Weather update 5pm 1 March 2022 - Mayor Matt Gould</td>
      <td>p4MjirCc1oU</td>
      <td>20220301</td>
      <td>181.959</td>
      <td>hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>620</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Transport Capital Works Video</td>
      <td>DXlkVTcmeho</td>
      <td>20180417</td>
      <td>140.450</td>
      <td>council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>347</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>Council Meeting Wrap Up February 2022</td>
      <td>2NhuhF2fBu8</td>
      <td>20220224</td>
      <td>107.840</td>
      <td>g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>341</td>
    </tr>
    <tr>
      <th>4</th>
      <td>AUS</td>
      <td>NSW</td>
      <td>Wollondilly Shire Council</td>
      <td>Wollondilly Shire</td>
      <td>https://www.youtube.com/c/wollondillyshire</td>
      <td>CITY DEAL  4 March 2018</td>
      <td>4-cv69ZcwVs</td>
      <td>20180305</td>
      <td>130.159</td>
      <td>[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ...
	  <td>62/64 Menangle St, Picton NSW 2571, Australia</td>
      <td>(-34.1700078, 150.612913)</td>
      <td>420</td>
    </tr1>
  </tbody1>
</table1></div>

---

### Focus on regional and local council channels

Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability

- Speaker place of residence (cf. videos collected based on place-name search alone)

- Topical contents and communicative contexts comparable

- In most jurisdictions government content is in the public domain

---

### Data collection and processing

- Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube)
- Inspection of returned channels to remove false positives
- Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp)
- Geocoding: String containing council name + address + country location to Google's geocoding service
- PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span>

---

### Potential analyses

- Non-numerical quantifiers *heaps* and *lots*

---

**CoNASE**

| State                | Channels | Videos | Words      | Length (h) | State        | Channels | Videos | Words      | Length (h) | State                     | Channels | Videos | Words      | Length (h) |
| -------------------- | -------- | ------ | ---------- | ---------- | ------------ | -------- | ------ | ---------- | ---------- | ------------------------- | -------- | ------ | ---------- | ---------- |
| Alabama              | 27       | 2827   | 10,581,345 | 1,315.67   | Michigan     | 90       | 9832   | 51,293,982 | 6,079.47   | Texas                     | 155      | 21,330 | 44,736,009 | 5,789.44   |
| Alaska               | 6        | 451    | 1,854,654  | 248.37     | Minnesota    | 80       | 8666   | 31,366,468 | 3,661.89   | Utah                      | 21       | 2,561  | 7,766,782  | 940.21     |
| Arizona              | 35       | 6356   | 26,393,272 | 3,063.73   | Mississippi  | 18       | 1448   | 2,613,901  | 346.07     | Vermont                   | 3        | 94     | 131,558    | 16.62      |
| Arkansas             | 14       | 986    | 6,748,658  | 882.77     | Missouri     | 53       | 5093   | 15,094,086 | 1,946.43   | Virginia                  | 42       | 9,209  | 34,806,149 | 4,059.67   |
| California           | 211      | 18278  | 83,915,246 | 10,146.57  | Montana      | 3        | 145    | 926,229    | 143.2      | Washington                | 51       | 6,178  | 28,949,403 | 3,387.77   |
| Colorado             | 56       | 8802   | 36,551,218 | 4,299.68   | Nebraska     | 16       | 677    | 2,487,171  | 312.51     | W. Virginia               | 6        | 101    | 196,479    | 25.86      |
| Connecticut          | 25       | 3731   | 24,549,746 | 3,010.04   | Nevada       | 5        | 2,759  | 6,110,915  | 638.06     | Wisconsin                 | 83       | 9,514  | 45,983,568 | 5,744.59   |
| Delaware             | 3        | 148    | 242,073    | 25.45      | N.H.         | 11       | 1,305  | 10,913,552 | 1,469.04   | Wyoming                   | 7        | 251    | 2,638,963  | 348.39     |
| District of Columbia | 3        | 242    | 261,209    | 32.9       | New Jersey   | 88       | 6,982  | 29,523,334 | 3,977.57   | Alberta                   | 95       | 6,623  | 21,239,251 | 2,497.45   |
| Florida              | 89       | 17625  | 64,647,923 | 7,468.48   | New Mexico   | 14       | 1,895  | 6,750,477  | 883.1      | British Columbia          | 102      | 10,002 | 26,853,481 | 3,246.83   |
| Georgia              | 49       | 5487   | 18,565,796 | 2,421.53   | New York     | 97       | 8,037  | 37,560,959 | 4,856.87   | Manitoba                  | 20       | 3,286  | 2,771,200  | 318.21     |
| Hawaii               | 1        | 152    | 123,617    | 15.42      | N. Carolina  | 97       | 11,357 | 46,231,979 | 5781.4     | New Brunswick             | 8        | 382    | 2,347,141  | 278.05     |
| Idaho                | 11       | 1547   | 8,747,885  | 1,012.14   | N. Dakota    | 10       | 768    | 3,616,363  | 442.05     | Newfoundland and Labrador | 2        | 108    | 186,070    | 29.99      |
| Illinois             | 151      | 14243  | 54,613,612 | 6,725.31   | Ohio         | 97       | 7,647  | 33,695,476 | 4,268.46   | Northwest Territories     | 3        | 32     | 21,404     | 3.27       |
| Indiana              | 46       | 4017   | 12,958,084 | 1,643.88   | Oklahoma     | 19       | 1,977  | 5,271,339  | 643.35     | Nova Scotia               | 11       | 332    | 1,229,149  | 148.38     |
| Iowa                 | 43       | 7516   | 24,286,940 | 3,072.57   | Oregon       | 38       | 2,769  | 15,675,898 | 1,992.84   | Nunavut                   | 1        | 6      | 1,230      | 0.23       |
| Kansas               | 35       | 4444   | 19,862,293 | 2,504.08   | Pennsylvania | 74       | 6,984  | 32,571,217 | 3,970.32   | Ontario                   | 112      | 8,404  | 45,970,092 | 5,774.59   |
| Kentucky             | 26       | 4965   | 17,834,978 | 2,092.75   | Rhode Island | 7        | 822    | 3,195,777  | 530.94     | Prince Edward Island      | 6        | 753    | 777,772    | 95.87      |
| Louisiana            | 16       | 2018   | 10,500,407 | 1,221.96   | S. Carolina  | 24       | 3,894  | 8,716,589  | 1115.2     | Quebec                    | 6        | 166    | 486,265    | 60.29      |
| Maine                | 12       | 819    | 5,879,165  | 797.01     | S. Dakota    | 12       | 1,819  | 18,619,258 | 2,172.97   | Saskatchewan              | 10       | 663    | 895,143    | 103.12     |
| Maryland             | 32       | 7373   | 34,009,832 | 4,100.84   | Tennessee    | 33       | 7,194  | 43,286,858 | 5,127.52   | Yukon                     | 7        | 159    | 257,171    | 30.48      |
| Massachusetts        | 44       | 17596  | 11,517,230 | 14,682.19  |              |          |        |            |            |                           |          |        |            |            |   |

]

---

### CoNASE channel locations

---

### CoBISE

Country            | Channels|Videos|Tokens      |Length (h) 
-------------------|---------|------|-----------|-----------
England            |324      |23,657|72,879,173 |8,518.39  
Northern Ireland   | 10      |1,898 |6,508,505  |774.17     
Republic of Ireland| 26      |2,525 |6,264,276  |680.81     
Scotland           |75       |8,135 |17,111,396 |1,845.35   
Wales              |18       |2,465 |8,800,264  |982.66
| | | |
Total              |453      |38,680|111,563,614|12,801.38

---

### CoBISE channel locations

---

### CoANZSE

.small[
Territory	                  |nr_channels|nr_videos  |nr_words|video_length (h)
----------------------------|---|-------|-----------|----				
Australian Capital Territory|	8	|650	  |915,542	  |111.79
New South Wales             |114|9,741  |27,580,773	|3,428.87
Northern Territory	        |11 |	289	  |315,300	  |48.72
New Zealand	                |74	|18,029	|84,058,661	|10,175.80
Queensland	                |58	|7,356	|19,988,051	|2,642.75
South Australia	            |50	|3,537	|13,856,275	|1,716.72
Tasmania	                  |21	|1,260	|5,086,867	|636.99
Victoria	                  |78	|12,138	|35,304,943	|4,205.40
Western Australia	          |68	|3,815	|8,422,484	|1,063.78
| | | |
Total                       |482|56,815 |195,528,896|24,030.82
]

---

### CoANZSE channel locations

---

### Corpus use cases: Syntax/grammar/pragmatics

- Regional variation in syntax, mood and modality
- Lexical items
- Contractions
- Hortatives/commands/interjections
- Pragmatics: Turn-taking, politeness markers
- Multidimensional analysis à la Biber
- Typological comparison at country/state/regional level

---

### Example analysis: Double modals

- Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2022a)</span>
  - *I might could help you with this*
-  Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span>
- More widely used in North America and the British Isles than previously thought <span class="small">(Coats 2022a, Coats 2023b)</span>
- Little studied in Australian and New Zealand speech

---

### Script: Generating a table for manual inspection of double modals

- Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta*
- Script to generate regexes of two-tier combinations

```python
import re
hits = []
for x in modals:
  for i,y in coanzse_df.iterrows():
      pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE)
      finds = pat1.findall(y["text_pos"])
      if finds:
  	    for z in finds:
    	    seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip()
    	    time = z.split()[0].split("_")[-1] 
    	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

- The script creates a URL for each search hit at a time 3 seconds before the targeted utterance 
- In the resulting data frame, each utterance can be annotated after examining the targeted video sequence
- Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors)

---

### Excerpt from generated table

<div id="htmlwidget-60def60bf38b614c4d83" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-60def60bf38b614c4d83">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57"],["NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","NSW","VIC","VIC","VIC","SA","SA","SA","SA","SA","SA","SA","QLD","QLD","QLD","TAS","TAS","TAS","TAS","TAS","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ","NZ"],["Central Darling Shire Council","Dubbo Regional Council","Inner West Council","Ku-ring-gai Council","Ku-ring-gai Council","mosmancouncil","Wingecarribee Shire Council","Wingecarribee Shire Council","Hunter Joint Organisation","Cardinia TV","Latrobe City Council","WyndhamCity","City of Adelaide","City of Burnside","Town of Gawler","Town of Gawler","City of Onkaparinga","CityOfPlayford","City of Victor Harbor","RRCouncil","Logan City Council","NOOSA COUNCIL TV","Clarence City Council","George Town Council Tasmania","Glamorgan Spring Bay Council","Huon Valley Council","King Island TV","Bay of Plenty Regional Council","Environment Canterbury","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Christchurch City Council","Dunedin City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hamilton City Council","Hastings District Council","Napier City Council","Nelson City Council","Otago Regional Council Official","Taupo District Council","Tauranga City Council","Tauranga City Council","Waikato District Council","Waikato District Council","Waikato District Council","Waipa District Council","Westland District Council","Whanganui District Council"],["24 February 2021 Part 2","Dubbo City Council State of the City Report 2014","Speaker Series - Shiver with Allie Reynolds","3D Bushfire Simulation and CWC Workshop","Ordinary Meeting of Council 20_08_2019","Mosman Art Prize - In Conversation Salote Tawale","Extraordinary Council Meeting 16 Feb 2022","Ordinary Meeting of Council 13 May 2020 - part one","Hunter Global Summit Day 1 Session 1","Cardinia Shire Council Meeting, May 18 2020","Latrobe City Council Webinars - Branding and Design","Powers of Attorney & Planning for the future - 2019 Disability Expo","Council Assessment Panel Meeting - 25 May 2020","Burnside Council Meeting 12 March 2019 7 pm","Council Member Workshop - Climate Emergency Action Plan","Special Council Meeting - 29_9_2020","Council meeting 18 May 2021","Ordinary Council - 15 December 2020","City of Victor Harbor Ordinary Council Meeting _ July 2021 (Continuation)","The Short Fall talking about RADF","16_02_21 - The City Planning, Economic Development & Environment Committee","Noosa Council Services & Organisation Committee Meeting - 10 November 2020","Clarence City Council - Council Meeting 28th February 2022","George Town Council Ordinary Meeting held 22nd February 2022 Part 3","Ordinary Meeting of Council - December 11, 2018","Huon Valley Council - Ordinary Council Meeting 27 May 2020","Interview Peter Youd","Public Transport Committee Zoom VIDEO Recording - 30 November 2021","Council Meeting 9 December 2021","09.12.21 - Item 31 - Memorandum of Understanding","30.04.19 - Item 3 - Colin Meurk","09.08.18 - Christchurch City Council meeting","09.08.2018 - Item 17 - Water Supply Programme Update","28.06.18 - Item 10 - Voluntary Smokefree Outdoor Dining in Council-licenced footpath areas","05.04.18 - Item 8 - Coastal - Burwood Community Board Report to Council","19.05.15 - Item 1 - Hearings of Submissions - The Cranmer Bridge Club - John Nimmo","17.04.14 - District Plan Review - Part 4","Committee Meetings - 31 August 2020","HCC HCC Meeting 14 March 2019 Part 4","HCC Annual Plan Meeting 26 February 2019 Part 2","HCC Finance Meeting 28 August Part 1","HCC Meeting 24th May 2018 Part 1","HCC G&I meeting 20 Feb 201 7 part 2","HCC Community and services meeting 27th June Part 2","Council Meeting â€“ 14_07_2020","Sustainable Napier Committee - 13th February 2020 - Part 3","Council meeting Thursday, 24 June 2021","Strategy and Planning Committee - 11 August 2021","2014-08-26 Taupo Council Meeting - Part 2","Policy Committee - 19 February 2020","Urban Form and Transport Development Committee meeting - 17 March 2020 - Part 1 of 2","Local Alcohol Policy Workshop - 11 April 2022","Raglan Community Board - 27 October 2021","LTP Workshop - 30 June 2020","Finance & Corporate Committee - Zoom Meeting","Capital Projects & Tenders Committee Meeting","Building Owners Meeting - 27 May 2019"],["would might","'ll can","would might","might would","would might","might could","would might","would might","will can","would might","would might","would might","might could","'ll can","would might","would might","might could","would might","would might","would might","would might","would might","would might","would might","'ll can","would might","might would","would might","would might","might could","'ll can","might would","might would","might would","would might","would might","might would","would might","would might","would might","would might","'ll can","would might","would might","would might","would might","would might","would might","might can","would might","would might","would might","would might","would might","might would","might could","'ll can"],["<a href=https://youtu.be/4JhDv6H_rMQ?t=63>https://youtu.be/4JhDv6H_rMQ?t=63<\/a>","<a href=https://youtu.be/zOyDAMACmFk?t=190>https://youtu.be/zOyDAMACmFk?t=190<\/a>","<a href=https://youtu.be/WrmDQhsqv5s?t=568>https://youtu.be/WrmDQhsqv5s?t=568<\/a>","<a href=https://youtu.be/KhxiXPQBFXs?t=1232>https://youtu.be/KhxiXPQBFXs?t=1232<\/a>","<a href=https://youtu.be/n80tXfiqQzA?t=6192>https://youtu.be/n80tXfiqQzA?t=6192<\/a>","<a href=https://youtu.be/jQbDqA1yvhM?t=117>https://youtu.be/jQbDqA1yvhM?t=117<\/a>","<a href=https://youtu.be/kwGrKSIIDcQ?t=2997>https://youtu.be/kwGrKSIIDcQ?t=2997<\/a>","<a href=https://youtu.be/whP9EfvuouQ?t=3822>https://youtu.be/whP9EfvuouQ?t=3822<\/a>","<a href=https://youtu.be/6kHJiJMugPs?t=2351>https://youtu.be/6kHJiJMugPs?t=2351<\/a>","<a href=https://youtu.be/LX88aDEQCHY?t=1206>https://youtu.be/LX88aDEQCHY?t=1206<\/a>","<a href=https://youtu.be/7ukJvOujPfQ?t=1044>https://youtu.be/7ukJvOujPfQ?t=1044<\/a>","<a href=https://youtu.be/jFwUaeH452Q?t=804>https://youtu.be/jFwUaeH452Q?t=804<\/a>","<a href=https://youtu.be/6Tk9LilbFQU?t=2586>https://youtu.be/6Tk9LilbFQU?t=2586<\/a>","<a href=https://youtu.be/NwPfjcB8cq8?t=9061>https://youtu.be/NwPfjcB8cq8?t=9061<\/a>","<a href=https://youtu.be/nayq_0Stx2E?t=1519>https://youtu.be/nayq_0Stx2E?t=1519<\/a>","<a href=https://youtu.be/qgN_NF2Plqc?t=6825>https://youtu.be/qgN_NF2Plqc?t=6825<\/a>","<a href=https://youtu.be/e5kOcWgU4o8?t=13474>https://youtu.be/e5kOcWgU4o8?t=13474<\/a>","<a href=https://youtu.be/H35lwri328Q?t=4148>https://youtu.be/H35lwri328Q?t=4148<\/a>","<a href=https://youtu.be/TAIn0QH8VKM?t=10799>https://youtu.be/TAIn0QH8VKM?t=10799<\/a>","<a href=https://youtu.be/3a_MEXeW7H8?t=55>https://youtu.be/3a_MEXeW7H8?t=55<\/a>","<a href=https://youtu.be/6ro3nmNtutc?t=3751>https://youtu.be/6ro3nmNtutc?t=3751<\/a>","<a href=https://youtu.be/efGprIT2zho?t=5256>https://youtu.be/efGprIT2zho?t=5256<\/a>","<a href=https://youtu.be/cW_jBLyo0vo?t=6760>https://youtu.be/cW_jBLyo0vo?t=6760<\/a>","<a href=https://youtu.be/1lUsn3fwm_Y?t=450>https://youtu.be/1lUsn3fwm_Y?t=450<\/a>","<a href=https://youtu.be/4mum1Yur000?t=703>https://youtu.be/4mum1Yur000?t=703<\/a>","<a href=https://youtu.be/uBMu9GMDFaU?t=4480>https://youtu.be/uBMu9GMDFaU?t=4480<\/a>","<a href=https://youtu.be/pFb49I4p0xQ?t=308>https://youtu.be/pFb49I4p0xQ?t=308<\/a>","<a href=https://youtu.be/mHtIRAlc2w4?t=7061>https://youtu.be/mHtIRAlc2w4?t=7061<\/a>","<a href=https://youtu.be/h-Ue9-iD3mc?t=6800>https://youtu.be/h-Ue9-iD3mc?t=6800<\/a>","<a href=https://youtu.be/JO7vMyroJQo?t=1425>https://youtu.be/JO7vMyroJQo?t=1425<\/a>","<a href=https://youtu.be/MRZSHSAhqZ4?t=281>https://youtu.be/MRZSHSAhqZ4?t=281<\/a>","<a href=https://youtu.be/jzZzR2yHjf4?t=7062>https://youtu.be/jzZzR2yHjf4?t=7062<\/a>","<a href=https://youtu.be/khHQeskq9VY?t=3782>https://youtu.be/khHQeskq9VY?t=3782<\/a>","<a href=https://youtu.be/T5PwFRVU2vo?t=863>https://youtu.be/T5PwFRVU2vo?t=863<\/a>","<a href=https://youtu.be/BM25w7hI628?t=1034>https://youtu.be/BM25w7hI628?t=1034<\/a>","<a href=https://youtu.be/nmgg2LeCRh8?t=380>https://youtu.be/nmgg2LeCRh8?t=380<\/a>","<a href=https://youtu.be/l7fuhKQ-Nrs?t=197>https://youtu.be/l7fuhKQ-Nrs?t=197<\/a>","<a href=https://youtu.be/ifMwL7L4ZRc?t=3353>https://youtu.be/ifMwL7L4ZRc?t=3353<\/a>","<a href=https://youtu.be/CbR4GSo5Tr0?t=1291>https://youtu.be/CbR4GSo5Tr0?t=1291<\/a>","<a href=https://youtu.be/KR7DEpF6cPo?t=2352>https://youtu.be/KR7DEpF6cPo?t=2352<\/a>","<a href=https://youtu.be/x2MIZAQbtlg?t=4392>https://youtu.be/x2MIZAQbtlg?t=4392<\/a>","<a href=https://youtu.be/PTCRbmvQ1_w?t=9366>https://youtu.be/PTCRbmvQ1_w?t=9366<\/a>","<a href=https://youtu.be/UGHGqS_OO6o?t=696>https://youtu.be/UGHGqS_OO6o?t=696<\/a>","<a href=https://youtu.be/cWElounayJo?t=5189>https://youtu.be/cWElounayJo?t=5189<\/a>","<a href=https://youtu.be/_u_QyZmmhq4?t=2725>https://youtu.be/_u_QyZmmhq4?t=2725<\/a>","<a href=https://youtu.be/gdzgqjJ4nhY?t=1453>https://youtu.be/gdzgqjJ4nhY?t=1453<\/a>","<a href=https://youtu.be/z3aqzSzw8ek?t=3266>https://youtu.be/z3aqzSzw8ek?t=3266<\/a>","<a href=https://youtu.be/nQ_zHzfPBXk?t=1346>https://youtu.be/nQ_zHzfPBXk?t=1346<\/a>","<a href=https://youtu.be/6tHoNtddg_4?t=78>https://youtu.be/6tHoNtddg_4?t=78<\/a>","<a href=https://youtu.be/InYpTU9ZuTI?t=2251>https://youtu.be/InYpTU9ZuTI?t=2251<\/a>","<a href=https://youtu.be/FpEpRZGeQDw?t=11023>https://youtu.be/FpEpRZGeQDw?t=11023<\/a>","<a href=https://youtu.be/KkSNB-dJZs8?t=2945>https://youtu.be/KkSNB-dJZs8?t=2945<\/a>","<a href=https://youtu.be/RWFqTJCqkYE?t=6684>https://youtu.be/RWFqTJCqkYE?t=6684<\/a>","<a href=https://youtu.be/AbrvLqxsSTg?t=1292>https://youtu.be/AbrvLqxsSTg?t=1292<\/a>","<a href=https://youtu.be/53yPfrqbpkE?t=2611>https://youtu.be/53yPfrqbpkE?t=2611<\/a>","<a href=https://youtu.be/yIR6wFdNUKI?t=1888>https://youtu.be/yIR6wFdNUKI?t=1888<\/a>","<a href=https://youtu.be/WTP15-spw3A?t=4809>https://youtu.be/WTP15-spw3A?t=4809<\/a>"],["t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","sr d t","t","t","t","t","t","fp a2 t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","t","fp a2 t","t"],["\"however, the senior planning officer would might may want to make comment\"","\"we'll, we'll can forget about that plan for a while\"","also in embedded manual transcript","\"for anything that might would... go wrong\""," "," ","\"if you would might just convey\"","\"if they could move them down the hill further, I think they would might find that\""," "," ","\"people would might have chosen a Commodore over a Merc\"","\"again, there would might be a conflict of interest\"","\"it's a reasonable height, it might could have been taller\""," "," ","\"that would might be a bit of a challenge\"","\"I accept that they now might could have been better worded\""," ","\"it would might be, and this is the reason why\"","younger people","\"are there any other further councilors that would might make comment?\""," ","\"then we would might be able to get\"","\"this morning's workshop, would it might have been a good place...\" dm in question","\"so I'll can raise that with the relevant people\""," ","\"My question might would have been\"","slight pause after dm"," ","\"if they'd done something different way back, things might could have been better\"","\"they'll can accumulate over the next week\"","\"something that might would be in public ownership\"","same as 2003","\"once they've finished with this cigarette, if they must might, we'd like them to go outside\""," "," ","\"and that might would assist them\""," "," "," ","\"how did it slip through would mighta been another way of putting it\""," ","\"we would might have liked them to do that\"","\"where would might it be covered\" question form","Scottish accent?","\"I think what would might be useful is\" Wh- word"," "," ","\"on what might can constitute\""," ","\"we would might want to do, as councilor Morris said\" American/Canadian accent"," "," "," ","\"she might would be aware there are 23 projects\"","\"that we would might get a\"","\"and he'll can walk into a building\" narrative"]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Location<\/th>\n      <th>Channel<\/th>\n      <th>Video<\/th>\n      <th>DM<\/th>\n      <th>Link<\/th>\n      <th>Type<\/th>\n      <th>Notes<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":100,"dom":"tip","scrollY":"400px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

---

### Finding features

- Regular-expression-search and manual annotation approach
- Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review)
- Also in Australia and (especially) New Zealand!

---

### Training a classifier on the basis of common word types

- Simple machine-learning classifiers using SVM, logistic regression, or other algorithms can distinguish between Australian and NZ transcripts on the basis of the 500 most common words in CoANZSE
<br><br>

---

### Pipeline for acoustic analysis (work in progress)

- Regular expressions to target specific words/phrases in the corpora
- Extract audio segments containing the targeted item(s) from YT stream
- Feed audio and transcript excerpt to forced aligner
- Extract desired sounds
- Measure acoustic phenomena of interest (formants, voice onset time, pitch, etc.)

---

### Pipeline for acoustic analysis

![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png)

- A Jupyter notebook that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants
- Click your way through the process in a Colab environment
- Can be used for any language that has ASR transcripts
- With a few script modifications, also works for manual transcripts (e.g., for Finnish)

https://github.com/stcoats/phonetics_pipeline

---

### Example: Excerpt from a council meeting in Gallatin, Tennessee (https://www.youtube.com/watch?v=yzjGnz_Rs7I)

---

### Pipeline for acoustic analysis: Vowel formants

For each transcript/video in the collection:

- Regular expressions to search for words with [eɪ]
- yt-dlp to download audio segments in a window around the target word
- Feed the segments (audio and corresponding transcript segment) to the Montreal Forced Aligner (McAuliffe et al. 2017); output is Praat TextGrids (Boersma & Weenink 2023)
- Select vowel(s) of interest using TextGrid timings and Parselmouth (Python port of Praat functions; Jadoul et al. 2018)

<pre style="font-size:12px">have a great d**ay** on that                      [eɪ]</pre>
<audio controls id="player" autostart="0" autostart="false" preload ="none" name="media">
  <source src="https://cc.oulu.fi/~scoats/yzjGnz_Rs7I_have a great day on that.wav" type="audio/wav">
</audio>

---

### Formants: F1/F2 values for a single utterance

]

- 9 measurements per segment in order to get trajectory of vowel sounds

- Retain segments for which at least 5 measurements were possible

]

---

### Formants: F1/F2 values for a single location (filtered)

]

- 9 measurements per segment in order to get trajectory of vowel sounds

- Retain segments for which at least 5 measurements were possible

- This visualization filters out segments that do not have the typical shape of the [eɪ] diphthong

]
     
---

### Formants: Values for a single location

.pull-left[
<img src="https://cc.oulu.fi/~scoats/Hendersonville_TN_v2.png"
     width="5990px" class="center">
]

- Circle locations represent the average value for that duration quantile (subscript)
- Circle size is proportional to the number of measurements for that quantile (more likely to get formant values in the middle of the vowel than at the beginning/end)
]

---

## Average  F1 and F2 values for the nuclei of the diphthongs /eɪ/, /aɪ/, /oʊ/, and /aʊ/, spatial autocorrelation <span class="small">(12,931,728 vowel tokens)

---

### Comparison <small>(Grieve, Speelman & Geeraert 2013, p. 37)</small>

![](data:image/png;base64,#.\Grieve_et_al_2013_eY.png)
]

.pull-right[
- Grieve et al. (2013) used a similar technique used to analyze formant measurements from the *Atlas of North American English* (Labov et al. 2006)
- ANAE contains approximately 134,000 vowel measurements in total
]

---

### Multimodality

- Use regular expressions to search corpus
- Extract video as well as audio
- Manually or automatically analyze:
  - Gesture
  - Posture/body/head inclination
  - Facial expression
  - Handling of objects
  - Touching
  - (etc.)

---

### 'Heaps of' in Australian English

---

### Extracted *today* tokens

---

### Average <span style="font-family:serif;font-weight:bolder">eɪ</span> diphthong

![:scale 70%](data:image/png;base64,#./eY_coanzse.png)

---

### A few caveats

- Videos of local government not representative of speech in general
- ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span>
  - Low-frequency phenomena: manually inspect corpus hits
  - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers
- Need to analyze error rates of forced alignment

---

### Summary and outlook

- Large corpora of ASR transcripts from YouTube channels of local governments
- Corpus studies of regional variation in spoken language: dialectology, pragmatics, phonetics, gestures
- Large-scale studies of phonetic variation

---

#Thank you!

---

### References

Agarwal, S., Godbole, S., Punjani, D. & Roy, S. (2007). [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12.

Boersma, P. & Weenink, D. (2023). Praat: doing phonetics by computer. Version 6.3.09. http://www.praat.org

Brezina, V., Love, R. & Aijmer, K. (2018). Corpus linguistics and sociolinguistics: Introducing the Spoken BNC2014. In V. Brezina, R. Love & K. Aijmer (Eds.), *Corpus approaches to contemporary British speech: Sociolinguistic studies of the Spoken BNC2014*, 3–9. Routledge.

Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*.

Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter.

Coats, S. (2022a). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*.

Coats, S. (2022b). [The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech](http://ceur-ws.org/Vol-3232/paper15.pdf). In K. Berglund, M. La Mela & I. Zwart (Eds.), *Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15–18, 2022*, 187–194. CEUR.

Coto-Solano, R., Stanford, J. N. & Reddy, S. K. (2021). [Advances in completely automated vowel analysis for sociophonetics: Using end-to-End speech recognition systems with DARLA](https://doi.org/10.3389/frai.2021.662097). Frontiers in Artificial Intelligence, Section Language and Computation.

Davies, Mark. (2008–). [The Corpus of Contemporary American English (COCA)](https://www.english-corpora.org/coca/)

Du Bois, J. W., Chafe, W. L., Meyer, C., Thompson, S. A., Englebretson, R. & Martey, N. (2000-2005). *Santa Barbara corpus of spoken American English*, Parts 1-4. Linguistic Data Consortium.

Grieve, J., Montgomery, C., Nini, A., Murakami, A. & Guo, D. (2019). [Mapping lexical dialect variation
in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2.
]]

---

### References II

Grieve, J., Speelman, D. & Geeraerts, D. (2013). [A multivariate spatial analysis of vowel formants in American English](https://doi.org/10.1017/jlg.2013.3). *Journal of Linguistic Geography* 1, 31–51.

Honnibal, M. et al. (2019). [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug
fixes](https://doi.org/10.5281/zenodo.3358113).

Jadoul, Y., Thompson, B. & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. *Journal of Phonetics*, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001

Labov, W., Ash, S. & Boberg, C. (2006). *The Atlas of North American English*. Mouton de Gruyter.

Love, R., Dembry, C., Hardie, A., Brezina, V. & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. In T. McEnery, R. Love & V. Brezina (Eds.), *Compiling and analysing the Spoken British National Corpus 2014* [ = International Journal of Corpus Linguistics 22(3)], 319–44.

Markl, N. & Lai, C. (2021). [Context-sensitive evaluation of automatic speech recognition: considering
user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics.

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In *Proceedings of the 18th Conference of the International Speech Communication Association*.

Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). [Artie bias corpus: An open dataset
for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468.

Montgomery, M. B. & Nagle, S. J. (1994). Double modals in Scotland and the Southern United States:
Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108.

Nerbonne, J. (2009). Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198.

Szmrecsanyi, B. (2013). *Grammatical variation in British English dialects: A study in corpus-based
dialectometry*. Cambridge University Press.

Szmrecsanyi, B. (2011). Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76.

Tatman, R. (2017). [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology.
org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for
Computational Linguistics.

]]