Double modals in YouTube videos from North America and the British Isles

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

<br>

## Double modals in YouTube videos from North America and the British Isles

Steven Coats<br>
English Philology, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>

Corpus-based and Computational Approaches to Variation Workshop, Helsinki <br> 
April 27th, 2022<br>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;Double modals | CorCoDial Workshop, Helsinki</span></div>

---

## Outline

1. CoNASE and CoBISE

2. YouTube ASR captions files, data collection and geocoding

3. Methods: Frequency analysis (frequent features), manual inspection/annotation (rare features)

4. Double modals in North America and in the British Isles

5. Caveats, summary

### Introduction

- Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span>
- Available corpora of transcribed spoken English <span class="small">(Anderwald & Wagner 2007; Corbett 2014; Corrigan et al. 2012; Du Bois et al. 2000-2005; Kallen & Kirk 2007)</span> are small or lack a broad geographic focus; size may make it difficult to find some features
- [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span>
- [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m token corpus of 38,680 ASR transcripts <span class="small">(Coats forthcoming a)</span>
- Correspond to more than 166,000 hours of video from more than 3,000 YouTube channels of local councils and other government entities in locations in the US, Canada, England, Scotland, Wales, Northern Ireland, and the Republic of Ireland
- Freely available for research use; download from the Harvard Dataverse [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) and [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD)

---

### YouTube captions files

- Videos can have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither

- User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software

- Auto-generated captions are generated by YT's speech-to-text service

- CoNASE and CoBISE: target YT ASR captions

---

### WebVTT file

![](data:image/png;base64,#WY9RPeXA3pw_vtt.png)

---

### Focus on regional and local council channels

Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability

- Speaker place of residence (cf. videos collected based on place-name search alone)

- Topical contents and communicative contexts comparable

---

### Data collection and processing

- Identification of relevant channels (YouTube API, searches of public-facing server, lists of councils with YT channels)
- Inspection of returned channels to remove false positives
- Retrieval of ASR transcripts using [YouTube-DL](https://github.com/ytdl-org/youtube-dl)
- VPN or [Tor](https://www.torproject.org/) to circumvent IP blocking
- Geocoding: String containing council name + channel name + country location to Google's geocoding service
- PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span>

---

### Transcript accuracy

- ASR transcripts contain errors (WER ~22%)
- High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers
- Low-frequency phenomena: manually inspect corpus hits
![:scale 60%](data:image/png;base64,#asr_wordfreqs.png)
---

### Corpus use cases and size

- Regional language (dialectology): e.g. syntax, mood and modality
- Pragmatics: Turn-taking, politeness markers
- Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
US            |2,189     |270,931 |1,149,030,824 | 141,455.11
Canada        | 383      |30,916 |103,035,369  |12,586.77

**CoANZSE** (coming soon)

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
Australia     |408    |38,786 |111,470,235 | 13,885.1
New Zealand   | 74     |18,029 |84,058,661  |1,083.75

]]

Country            | Channels|Videos|Tokens      |Length (h) 
-------------------|---------|------|-----------|-----------
England            |324      |23,657|72,879,173 | 8,518.39  
Northern Ireland   | 10      |1,898 |6,508,505  |774.17     
Republic of Ireland| 26      |2,525 |6,264,276  |680.81     
Scotland           |75       |8,135 |17,111,396 |1,845.35   
Wales              |18       |2,465 |8,800,264  |982.66

**CoGS** (coming soon)

Country       | Channels|Videos|Tokens      |Length (h) 
--------------|---------|------|-----------|-----------
Germany     |1,313    |39,495 |50,554,070 | 7,223.44
]]
</div>

---

### Example analysis: Double modals

- Non-standard rare syntactic feature in the British Isles, North America, and elsewhere <span class="small">(Montgomery & Nagle 1994; Coats 2022)</span>
- **Will you can help me with this?**
-  Occurs only in the American Southeast and in Scotland/Northern Ireland/Northern England?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(*LAMSAS*, *LAGS*, Murray 1873; Wright 1898-1905; Anderwald & Wagner 2007; Kallen & Kirk 2007; Smith et al. 2019)</span>

---

### Script: Generating a table for manual inspection of double modals

- Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta*
- Script to generate regex of two-tier combinations, plus forms with intervening pronouns, auxiliary verbs, negations

```python
import re
hits = []
for i,x in cobise_df.iterrows():
    pat1 = re.compile("((\\w+_\\S+_\\S+\\s){3}'+x[0]+'_\\w+_\\S+ '+x[1]+'n?_\\w+_\\S+(\\w+_\\S+_\\S+\\s){3})",re.IGNORECASE)
    
    if pat1.search(x["text_pos"]):
	    finds1 = pat1.findall(x["text_pos"])[0]
	    seq = " ".join([x.split("_")[0] for x in finds1[0].split()])
	    time = finds1[0].split()[0].split("_")[-1] 
	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

---

### Excerpt from generated table

<div id="htmlwidget-d66bae25821427a69fd5" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-d66bae25821427a69fd5">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101"],["England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England",""],["Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Fylde Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucester Arts Council","Gloucestershire County Council","Gloucestershire County Council","Gloucestershire County Council","Halton Borough Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Hammersmith & Fulham Council","Harlow Council","Harlow Council","Harlow Council","Harlow Council","Harlow Council","Harlow Council","Harlow Council","Harlow Council","IW Youth Council","IWCouncil","IWCouncil","IWCouncil","IWCouncil","IWCouncil","IWCouncil","IWCouncil","IWCouncil","IWCouncil","Islington Council","Kensington & Chelsea Social Council","Kent County Council","Kent County Council","Kent County Council","Kingston Council","Kingston Council","Kingston Council","Kirklees Council","Kirklees Council","Leeds City Council","Leicester City Council","Leicester City Council","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","LincolnshireCC","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council",""],["would will","may would","could can","may would","should can","'ll can","would can","should can","can will","could would","could can","would can","might may","will must","will would","would must","can must","can would","can will","would could","shall will","will may","'ll must","might will","would might","should will","will would","would will","would should","will 'll","could will","could would","would could","'ll can","can will","can 'll","would could","can should","should will","would could","should can","might 'll","can would","can will","would can","can should","can 'll","will should","should will","will 'll","may will","can may","must can","could should","might will","will can","can would","should can","may can","will can","should could","should will","will 'll","will would","should will","can 'll","can would","should 'll","might would","would will","may used to","will 'll","will can","could should","would might","can could","would can","will would","would should","would could","could may","will 'll","can 'll","may would","will 'll","would will","should can","could would","will 'll","would could","would can","'ll can","could would","might shall","will should","might can","could would","would must","can could","might would",""],["would not will not","may would","could he can","may I would","should you can","'ll can","would we can 't","should you can","can will","could would","could we can","would can","might may","will we must","will would","would must","can 't we must","can 't he would","can 't I will","would have I could","shall not I will","will I may not","'ll must","might will","would I might","should they will","will he wouldn 't","would i will","would should","will you 'll","could will","could would have","would could","'ll you can 't","can 't it will","can I 'll","would it could","can I should","should have it will","would could","should we can","might you 'll","can I would","can they will","would you can","can it should","can we 'll","will should","should we will","will we 'll","May we will have","can 't we may not","must can 't","could should","might I will","will can","can we would","should can 't","may can","will you can 't","should I could","should will","will we 'll","will would","should will","can 't we 'll","can I would","should she 'll","might I would","would we will not","May we used to","will it 'll","will can 't","could we should","would have it might have","can 't we couldn","would I can 't","will would","would should have","would could","could may","will we 'll","can they 'll","may would","will we 'll","would we will","shouldn 't they can","could I would","will I 'll","would could","wouldn 't you can","'ll we can","could I would have","might shall","will should","might can","could we would","would must","can could not","might would",""],["<a href=https://youtu.be/VogFB5X_1UM?t=490>https://youtu.be/VogFB5X_1UM?t=490<\/a>","<a href=https://youtu.be/6UudWle_wYM?t=1733>https://youtu.be/6UudWle_wYM?t=1733<\/a>","<a href=https://youtu.be/udvH0BtQ2ls?t=960>https://youtu.be/udvH0BtQ2ls?t=960<\/a>","<a href=https://youtu.be/V3YFSetBxgM?t=2184>https://youtu.be/V3YFSetBxgM?t=2184<\/a>","<a href=https://youtu.be/6erR7ZuYtjc?t=97>https://youtu.be/6erR7ZuYtjc?t=97<\/a>","<a href=https://youtu.be/zA7LkTk0Vt4?t=1447>https://youtu.be/zA7LkTk0Vt4?t=1447<\/a>","<a href=https://youtu.be/f1_x5C1ttCk?t=1702>https://youtu.be/f1_x5C1ttCk?t=1702<\/a>","<a href=https://youtu.be/7AsXVW1vako?t=119>https://youtu.be/7AsXVW1vako?t=119<\/a>","<a href=https://youtu.be/APNxUQP3Zok?t=1718>https://youtu.be/APNxUQP3Zok?t=1718<\/a>","<a href=https://youtu.be/X3rY_QDk5kA?t=492>https://youtu.be/X3rY_QDk5kA?t=492<\/a>","<a href=https://youtu.be/ded7W7m7id4?t=15>https://youtu.be/ded7W7m7id4?t=15<\/a>","<a href=https://youtu.be/PiSnXSuI8tQ?t=447>https://youtu.be/PiSnXSuI8tQ?t=447<\/a>","<a href=https://youtu.be/hXFo6xTncow?t=1969>https://youtu.be/hXFo6xTncow?t=1969<\/a>","<a href=https://youtu.be/tGt5g0CtIns?t=4281>https://youtu.be/tGt5g0CtIns?t=4281<\/a>","<a href=https://youtu.be/tGt5g0CtIns?t=7324>https://youtu.be/tGt5g0CtIns?t=7324<\/a>","<a href=https://youtu.be/tGt5g0CtIns?t=3605>https://youtu.be/tGt5g0CtIns?t=3605<\/a>","<a href=https://youtu.be/0A6Ng-XaP7w?t=3092>https://youtu.be/0A6Ng-XaP7w?t=3092<\/a>","<a href=https://youtu.be/nmrN8Qr-y2w?t=1235>https://youtu.be/nmrN8Qr-y2w?t=1235<\/a>","<a href=https://youtu.be/DHJivGsQSEY?t=4941>https://youtu.be/DHJivGsQSEY?t=4941<\/a>","<a href=https://youtu.be/DHJivGsQSEY?t=2320>https://youtu.be/DHJivGsQSEY?t=2320<\/a>","<a href=https://youtu.be/DHJivGsQSEY?t=5129>https://youtu.be/DHJivGsQSEY?t=5129<\/a>","<a href=https://youtu.be/R-cxOGyqmVQ?t=291>https://youtu.be/R-cxOGyqmVQ?t=291<\/a>","<a href=https://youtu.be/R-cxOGyqmVQ?t=1037>https://youtu.be/R-cxOGyqmVQ?t=1037<\/a>","<a href=https://youtu.be/PyCm4_tzCPk?t=868>https://youtu.be/PyCm4_tzCPk?t=868<\/a>","<a href=https://youtu.be/PyCm4_tzCPk?t=1462>https://youtu.be/PyCm4_tzCPk?t=1462<\/a>","<a href=https://youtu.be/qsDZFlJlQqk?t=2796>https://youtu.be/qsDZFlJlQqk?t=2796<\/a>","<a href=https://youtu.be/-MRLP4PgUuo?t=190>https://youtu.be/-MRLP4PgUuo?t=190<\/a>","<a href=https://youtu.be/wMCYYCIkHcc?t=267>https://youtu.be/wMCYYCIkHcc?t=267<\/a>","<a href=https://youtu.be/wMCYYCIkHcc?t=543>https://youtu.be/wMCYYCIkHcc?t=543<\/a>","<a href=https://youtu.be/65HJW6CDYRg?t=602>https://youtu.be/65HJW6CDYRg?t=602<\/a>","<a href=https://youtu.be/Xxor4Z5AKW8?t=2396>https://youtu.be/Xxor4Z5AKW8?t=2396<\/a>","<a href=https://youtu.be/6dbWhvANdTU?t=1494>https://youtu.be/6dbWhvANdTU?t=1494<\/a>","<a href=https://youtu.be/M-cwxUutH9Q?t=402>https://youtu.be/M-cwxUutH9Q?t=402<\/a>","<a href=https://youtu.be/qrrBqtker1I?t=49>https://youtu.be/qrrBqtker1I?t=49<\/a>","<a href=https://youtu.be/N7Tv7FSYUeM?t=2082>https://youtu.be/N7Tv7FSYUeM?t=2082<\/a>","<a href=https://youtu.be/N7Tv7FSYUeM?t=1167>https://youtu.be/N7Tv7FSYUeM?t=1167<\/a>","<a href=https://youtu.be/N7Tv7FSYUeM?t=453>https://youtu.be/N7Tv7FSYUeM?t=453<\/a>","<a href=https://youtu.be/AqV82AuRxhQ?t=1817>https://youtu.be/AqV82AuRxhQ?t=1817<\/a>","<a href=https://youtu.be/TB9MGdrWOB0?t=1691>https://youtu.be/TB9MGdrWOB0?t=1691<\/a>","<a href=https://youtu.be/vIFUqgtPss0?t=1389>https://youtu.be/vIFUqgtPss0?t=1389<\/a>","<a href=https://youtu.be/BE_wVJJdfsg?t=54>https://youtu.be/BE_wVJJdfsg?t=54<\/a>","<a href=https://youtu.be/vUrQSOIuHDs?t=2045>https://youtu.be/vUrQSOIuHDs?t=2045<\/a>","<a href=https://youtu.be/AR23tEWrvrA?t=1231>https://youtu.be/AR23tEWrvrA?t=1231<\/a>","<a href=https://youtu.be/9X7SqKEpjBQ?t=2367>https://youtu.be/9X7SqKEpjBQ?t=2367<\/a>","<a href=https://youtu.be/9X7SqKEpjBQ?t=1912>https://youtu.be/9X7SqKEpjBQ?t=1912<\/a>","<a href=https://youtu.be/2cu3CeFdUhg?t=3294>https://youtu.be/2cu3CeFdUhg?t=3294<\/a>","<a href=https://youtu.be/2dC0G3EWzqw?t=1025>https://youtu.be/2dC0G3EWzqw?t=1025<\/a>","<a href=https://youtu.be/2dC0G3EWzqw?t=495>https://youtu.be/2dC0G3EWzqw?t=495<\/a>","<a href=https://youtu.be/521iskNlHx8?t=2978>https://youtu.be/521iskNlHx8?t=2978<\/a>","<a href=https://youtu.be/Xm9diPw5WMk?t=176>https://youtu.be/Xm9diPw5WMk?t=176<\/a>","<a href=https://youtu.be/rNC9keNLC8Q?t=6523>https://youtu.be/rNC9keNLC8Q?t=6523<\/a>","<a href=https://youtu.be/rNC9keNLC8Q?t=5028>https://youtu.be/rNC9keNLC8Q?t=5028<\/a>","<a href=https://youtu.be/rNC9keNLC8Q?t=1991>https://youtu.be/rNC9keNLC8Q?t=1991<\/a>","<a href=https://youtu.be/rNC9keNLC8Q?t=2383>https://youtu.be/rNC9keNLC8Q?t=2383<\/a>","<a href=https://youtu.be/tcKuBzN9Bb8?t=4823>https://youtu.be/tcKuBzN9Bb8?t=4823<\/a>","<a href=https://youtu.be/tcKuBzN9Bb8?t=2122>https://youtu.be/tcKuBzN9Bb8?t=2122<\/a>","<a href=https://youtu.be/mXEGwm7eolA?t=6171>https://youtu.be/mXEGwm7eolA?t=6171<\/a>","<a href=https://youtu.be/mXEGwm7eolA?t=6681>https://youtu.be/mXEGwm7eolA?t=6681<\/a>","<a href=https://youtu.be/M_gI5ewdUmc?t=2784>https://youtu.be/M_gI5ewdUmc?t=2784<\/a>","<a href=https://youtu.be/EfBKYTPWDQ8?t=248>https://youtu.be/EfBKYTPWDQ8?t=248<\/a>","<a href=https://youtu.be/0iZWE5cwSn0?t=209>https://youtu.be/0iZWE5cwSn0?t=209<\/a>","<a href=https://youtu.be/wZvdkXcPy_Q?t=301>https://youtu.be/wZvdkXcPy_Q?t=301<\/a>","<a href=https://youtu.be/nFKxxrk5WqY?t=256>https://youtu.be/nFKxxrk5WqY?t=256<\/a>","<a href=https://youtu.be/1s1Ez_gkW94?t=9>https://youtu.be/1s1Ez_gkW94?t=9<\/a>","<a href=https://youtu.be/9uUOmgpjez0?t=298>https://youtu.be/9uUOmgpjez0?t=298<\/a>","<a href=https://youtu.be/vGgnHcp0fNw?t=1774>https://youtu.be/vGgnHcp0fNw?t=1774<\/a>","<a href=https://youtu.be/vGgnHcp0fNw?t=1224>https://youtu.be/vGgnHcp0fNw?t=1224<\/a>","<a href=https://youtu.be/ew8YXqYmYAk?t=150>https://youtu.be/ew8YXqYmYAk?t=150<\/a>","<a href=https://youtu.be/o-nRD07almk?t=149>https://youtu.be/o-nRD07almk?t=149<\/a>","<a href=https://youtu.be/4SMlmhHhcYc?t=91>https://youtu.be/4SMlmhHhcYc?t=91<\/a>","<a href=https://youtu.be/25V7yYIWQ2c?t=569>https://youtu.be/25V7yYIWQ2c?t=569<\/a>","<a href=https://youtu.be/SB5jpCzFwiQ?t=2969>https://youtu.be/SB5jpCzFwiQ?t=2969<\/a>","<a href=https://youtu.be/uPHbOl-Ypm0?t=43>https://youtu.be/uPHbOl-Ypm0?t=43<\/a>","<a href=https://youtu.be/woRbDuL1w4Y?t=6244>https://youtu.be/woRbDuL1w4Y?t=6244<\/a>","<a href=https://youtu.be/nG-p-vXRokg?t=4796>https://youtu.be/nG-p-vXRokg?t=4796<\/a>","<a href=https://youtu.be/BSxHwMeDKhU?t=3679>https://youtu.be/BSxHwMeDKhU?t=3679<\/a>","<a href=https://youtu.be/BSxHwMeDKhU?t=8261>https://youtu.be/BSxHwMeDKhU?t=8261<\/a>","<a href=https://youtu.be/oIfC4edM6Pw?t=7846>https://youtu.be/oIfC4edM6Pw?t=7846<\/a>","<a href=https://youtu.be/vRX7UT_0VzI?t=3501>https://youtu.be/vRX7UT_0VzI?t=3501<\/a>","<a href=https://youtu.be/vRX7UT_0VzI?t=2476>https://youtu.be/vRX7UT_0VzI?t=2476<\/a>","<a href=https://youtu.be/vRX7UT_0VzI?t=2304>https://youtu.be/vRX7UT_0VzI?t=2304<\/a>","<a href=https://youtu.be/vXr1nVcStWM?t=2144>https://youtu.be/vXr1nVcStWM?t=2144<\/a>","<a href=https://youtu.be/cPGCknaizPw?t=580>https://youtu.be/cPGCknaizPw?t=580<\/a>","<a href=https://youtu.be/05RmQEZ20Go?t=2757>https://youtu.be/05RmQEZ20Go?t=2757<\/a>","<a href=https://youtu.be/nOFIlcUWSy0?t=3706>https://youtu.be/nOFIlcUWSy0?t=3706<\/a>","<a href=https://youtu.be/nOFIlcUWSy0?t=4115>https://youtu.be/nOFIlcUWSy0?t=4115<\/a>","<a href=https://youtu.be/0BNjOwKSdpE?t=5928>https://youtu.be/0BNjOwKSdpE?t=5928<\/a>","<a href=https://youtu.be/0BNjOwKSdpE?t=6733>https://youtu.be/0BNjOwKSdpE?t=6733<\/a>","<a href=https://youtu.be/Zm9VbA6cYaA?t=3239>https://youtu.be/Zm9VbA6cYaA?t=3239<\/a>","<a href=https://youtu.be/Zm9VbA6cYaA?t=386>https://youtu.be/Zm9VbA6cYaA?t=386<\/a>","<a href=https://youtu.be/k7wzI5qbGUE?t=858>https://youtu.be/k7wzI5qbGUE?t=858<\/a>","<a href=https://youtu.be/S2WdzuvW6E4?t=1243>https://youtu.be/S2WdzuvW6E4?t=1243<\/a>","<a href=https://youtu.be/S2WdzuvW6E4?t=1057>https://youtu.be/S2WdzuvW6E4?t=1057<\/a>","<a href=https://youtu.be/CFeAAHo8q1g?t=4755>https://youtu.be/CFeAAHo8q1g?t=4755<\/a>","<a href=https://youtu.be/CFeAAHo8q1g?t=2877>https://youtu.be/CFeAAHo8q1g?t=2877<\/a>","<a href=https://youtu.be/aFyKmKsuJHU?t=5505>https://youtu.be/aFyKmKsuJHU?t=5505<\/a>","<a href=https://youtu.be/aFyKmKsuJHU?t=3145>https://youtu.be/aFyKmKsuJHU?t=3145<\/a>","<a href=https://youtu.be/E2IcPi0AYcE?t=10414>https://youtu.be/E2IcPi0AYcE?t=10414<\/a>","<a href=https://youtu.be/PUf8v6NLNyk?t=5338>https://youtu.be/PUf8v6NLNyk?t=5338<\/a>","<a href=https://youtu.be/PUf8v6NLNyk?t=4969>https://youtu.be/PUf8v6NLNyk?t=4969<\/a>","<a href=><\/a>"],["fp o1","t","","","","t","","sr d","t","","","fp a1 o1","","","","fp a2","","","","","","","fp a1","fp o1 a2","t *","","","","sr d t *","","","","","","","","","","","sr d","","","","","","","","sr d t *","","","","","fp a1","","","fp hn1","sr p","","","","sr d","fp o1","","","sr d","","","","","","","","sr p t *","","","","","","","sr p","sr p","","","fp o1","","","","","","sr d","","","","fp o1","sr d","fp a2","","sr d t *","","t",""],["","\"the wording, which it may would be better to read\"","","","","\"and we'll can see how\"","","","\"that business rating can will neither increase nor\" Scottish accent","","","","","","","\"I would much prefer\"","","","","","","","\"how else must\" Shakespeare performance","\"if I might. We'll...\" Shakespeare performance","Christopher Marlowe, Edward II, text is \"might I keep thee here\"","","","","no disfluency","","","","","","","","","","","","","","","","","","","","","","","","\"bus can't\"","","","\"if the political will can exist\"","","","","","","","","","","","","","","","","","","","","","","","","","","","","\"if I may, would\"","","","","","","","","","","\"that might, shall we say, ...\"","","\"it might then grow over\"","","","","\"I think three years, twelve terms might would be reasonable\"",""]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Country<\/th>\n      <th>Channel<\/th>\n      <th>DM<\/th>\n      <th>Regex_hit<\/th>\n      <th>link<\/th>\n      <th>type<\/th>\n      <th>notes<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":45,"dom":"tip","scrollY":"400px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"lengthMenu":[10,25,45,50,100],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

---

### Double modals

- Regular-expression-search and manual annotation approach
- Double modals can be found in the US North and West and in Canada; in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review)

---

class: center, middle
background-image: url(data:image/png;base64,#dm_map_0.png)
background-size: contain

---

class: center, middle
background-image: url(data:image/png;base64,#uk_dm_map.png)
background-size: contain

---

### A few caveats

- Meetings of local government not representative of speech in general
- ASR errors, quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span>

---

### Summary and outlook

- Large corpora of ASR transcripts from YouTube channels of local governments in the US, Canada, Britain, and Ireland (coming soon: Australia/NZ, 190m tokens, Germany, 56m tokens)
- Useful for corpus studies of spoken language, dialectology, pragmatics
- Double modals are more widespread than has previously been documented

---

#Thank you!

---

### References

Agarwal, S., S. Godbole, D. Punjani & S. Roy. 2007. [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12.

Anderwald, L. & S. Wagner. 2007. The Freiburg English Dialect Corpus: Applying corpus-linguistic
research tools to the analysis of dialect data. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 35–53. Palgrave Macmillan.

Coats, S. In review. Double modals in contemporary British and Irish Speech.

Coats, S. Forthcoming a. Dialect corpora from YouTube. *Proceedings of ICAME41*. De Gruyter.

Coats, S. 2022. [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9766889). *American Speech*.

Corbett, J. 2014. Syntactic variation: Evidence from the Scottish Corpus of Text and Speech. In: R. Lawson (Ed.), *Sociolinguistics in Scotland*, 258–276. Palgrave Macmillan.

Corrigan, K. P., I. Buchstaller, A. Mearns & H. Moisl. 2012. [*The Diachronic Electronic Corpus of Tyneside English*](https://research.ncl.ac.uk/decte).

Du Bois, J. W., W. L. Chafe, C. Meyer, S. A. Thompson, R. Englebretson & N. Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium.

Grieve, J., C. Montgomery, A. Nini, A. Murakami & D. Guo. 2019. [Mapping lexical dialect variation
in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2.

Honnibal, M., I. Montani, H. Peters, S. V. Landeghem, M. Samsonov, J. Geovedi, J. Regan,
G. Orosz, S. L. Kristiansen, P. O. McCann, D. Altinok, Roman, G. Howard, S. Bozek, E. Bot,
M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang, B. Böing, P. K. Tippa, jeannefukumaru,
G. Dubbin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj, wbwseeker, M. Burton, thomasO &
A. Patel. 2019. [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug
fixes](https://doi.org/10.5281/zenodo.3358113).

Kallen, J. & J. Kirk. 2007. ICE-Ireland: Local variations on global standards. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 121–162. Palgrave Macmillan.

]]

---

### References II

Markl, N. & C. Lai. 2021. [Context-sensitive evaluation of automatic speech recognition: considering
user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics.

Meyer, J., L. Rauchenstein, J. D. Eisenberg & N. Howell. 2020. [Artie bias corpus: An open dataset
for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468.

Montgomery, M. B. & S. J. Nagle. 1994. Double modals in Scotland and the Southern United States:
Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108.

Murray, J. 1873. *The dialect of the southern counties of Scotland: Its pronunciation, grammar, and historical relations.* London: Asher & Co.

Nerbonne, J. 2009. Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198.

Smith, J., D. Adger, B. Aitken, C. Heycock, E. Jamieson & G. Thoms. 2019. [*The Scots Syntax Atlas*](https://scotssyntaxatlas.ac.uk). University of Glasgow.

Szmrecsanyi, B. 2013. *Grammatical variation in British English dialects: A study in corpus-based
dialectometry*. Cambridge University Press.

Szmrecsanyi, B. 2011. Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76.

Tatman, R. 2017. [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology.
org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for
Computational Linguistics.

Wright, J. 1898–1905. *The English dialect dictionary* (6 volumes). London: Henry Frowde.

]]