The Corpus of British Isles Spoken English (CoBISE)

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

<br>

## The Corpus of British Isles Spoken English (CoBISE)
<span style="font-size:30px;font-family:Raleway;font-weight: 800;font-style: italic;color:#ffeddd;">A New Resource of Contemporary British and Irish Speech</span>

Steven Coats<br>
English Philology, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>

DHNB22 Conference, Uppsala <br> 
March 17th, 2022<br>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;CoBISE | DHNB22, Uppsala</span></div>

---

## Outline

1. Introduction

2. Data collection and processing

3. Transcript accuracy and corpus use cases

4. Example: Manual inspection/annotation of specific features

5. Caveats, summary

---

### Introduction

- Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span>
- Available corpora of British and Irish English <span class="small">(Anderwald & Wagner 2007; Corbett 2014; Corrigan et al. 2012; Kallen & Kirk 2007)</span> are mostly text or focused on specific countries/regions; size may make it difficult to find some features
- [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m word corpus of 38,680 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span>
- \> 12,801 hours of video from 495 YouTube channels of local councils and other government entities in 453 locations in England, Scotland, Wales, Northern Ireland, and the Republic of Ireland
- Created using procedures similar to those for [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html)
- Freely available for research use; download from the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD)

---

### Focus on regional and local council channels

Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability

- Speaker place of residence (cf. videos collected based on place-name search alone)

- Topical contents and communicative contexts comparable

---

### Data collection and processing

- Identification of relevant channels (YouTube API, searches of public-facing server, lists of councils with YT channels)
- Inspection of returned channels to remove false positives
- Download all available ASR transcripts as .vtt files using [YouTube-DL](https://github.com/ytdl-org/youtube-dl)
- Use [Tor](https://www.torproject.org/) to circumvent IP blocking by YT
- Remove transcripts < 50 words and those that are not ASR
- String containing council name + channel name + country location to Google's geocoding service
- Check results, correct if necessary
- PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span>

---

### Transcript accuracy

- ASR transcripts contain errors
- Given a minimum accuracy level, for high-frequency phenomena the signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span>; for low-frequency phenomena one can manually inspect corpus hits
![:scale 60%](data:image/png;base64,#asr_wordfreqs.png)
---

### Corpus use cases and size

- Regional language (dialectology): e.g. syntax, mood and modality
- Pragmatics: Turn-taking, politeness markers
- Script pipeline: Use corpus to identify areas/speakers/words/phonemes of interest, get videos, convert to audio (FFMpeg), automated formant extraction/vowel quality analysis on a large scale

Country            | Channels|Videos|Words      |Length (h) 
-------------------|---------|------|-----------|-----------
England            |324      |23,657|72,879,173 | 8,518.39  
Northern Ireland   | 10      |1,898 |6,508,505  |774.17     
Republic of Ireland| 26      |2,525 |6,264,276  |680.81     
Scotland           |75       |8,135 |17,111,396 |1,845.35   
Wales              |18       |2,465 |8,800,264  |982.66

---

### Script: Generating a table for manual inspection of '*I daresay*'

- Pseudo-modal with interesting grammatical properties
- Used in  spoken language, quite rare in written language (excepting dialogue)

```python
import re
hits = []
for i,x in cobise_df.iterrows():
    pat1 = re.compile("((\\w+_\\S+_\\S+\\s){3}i_\\w+_\\S+ daresay_\\w+_\\S+\\s(\\w+_\\S+_\\S+\\s){3})",re.IGNORECASE)
    
    if pat1.search(x["text_pos"]):
	    finds1 = pat1.findall(x["text_pos"])[0]
	    seq = " ".join([x.split("_")[0] for x in finds1[0].split()])
	    time = finds1[0].split()[0].split("_")[-1] 
	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

---

### Table

<div id="htmlwidget-f9cc8394ec54faea3110" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-f9cc8394ec54faea3110">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42"],["England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","England","Wales","Wales","Wales","Wales","Wales","Wales","Wales","Wales","Wales","Wales","Scotland"],["Babergh Mid Suffolk District Councils","Bristol City Council Live","Bristol City Council Live","Cambridgeshire County Council","City of York Council","City of York Council","City of York Council","City of York Council","City of York Council","IWCouncil","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","Maidstone Council","NSDCouncil","Plymouth City Council Webcast Archive","Richmond Council","ThanetCouncil","The Council of the Isles of Scilly","Wiltshire Council","middlesbroughcouncil","rochfordcouncil","rochfordcouncil","rochfordcouncil","rochfordcouncil","southwarkcouncil","Colchester Borough Council","Royal Borough of Windsor and Maidenhead","Suffolk County Council","Suffolk County Council","Monmouthshire CC","Monmouthshire CC","Monmouthshire CC","Monmouthshire CC","Monmouthshire CC","Torfaen Council Democracy and Scrutiny","Torfaen Council Democracy and Scrutiny","Torfaen Council Democracy and Scrutiny","Torfaen Council Democracy and Scrutiny","Torfaen Council Democracy and Scrutiny","North Lanarkshire Council"],["I think and I daresay you will review","the resources although I daresay we would all","Steve Pierce which I daresay some of you","put support now I daresay most other councillors","you very much I daresay there might be","much and then I daresay there may be","their own volition I daresay that choice has","months ago and I daresay might be here","more substantive but I daresay he may feed","of Wight and I daresay members of the","such proposals and I daresay we will continue","that surrounding area I daresay they would probably","the night so I daresay we ought not","park visitor and I daresay there will be","them before and i daresay will do again","assemble outside in I daresay the rain which","remarks which is I daresay the problem that","that we do I daresay there are plenty","it was easy I daresay most people on","in April although I daresay may may tell","up his leave I daresay you could say","finish up that I daresay probably a job","asked for it I daresay the the owner","all councils and I daresay as with any","as going through I daresay hullbridge I will","good interest rate I daresay they must have","meets our expectations I daresay certain stages in","headline for publicity I daresay it is indirectly","have been withdrawn i daresay will come back","hope and actually I daresay I hope and","council delivering that I daresay there might be","goes until 2019 I daresay the results won","the thing is I daresay in a rural","moment so that I daresay will come back","they want so I daresay that not having","you fell and I daresay most of us","landscaping proposals and I daresay the applicant would","time for us I daresay others will be","pick up that I daresay that was on","is come support I daresay this is covered","it as well I daresay that blessing zabal","just now which i daresay would be seen"],["<a href=https://youtu.be/9h1HBOjuiKs?t=888>https://youtu.be/9h1HBOjuiKs?t=888<\/a>","<a href=https://youtu.be/uYlgqEbDTBQ?t=1622>https://youtu.be/uYlgqEbDTBQ?t=1622<\/a>","<a href=https://youtu.be/5duBHnj3IPA?t=1777>https://youtu.be/5duBHnj3IPA?t=1777<\/a>","<a href=https://youtu.be/bDmodpBb0vU?t=5132>https://youtu.be/bDmodpBb0vU?t=5132<\/a>","<a href=https://youtu.be/6P03L8C5KS4?t=1941>https://youtu.be/6P03L8C5KS4?t=1941<\/a>","<a href=https://youtu.be/TFcWdkaQ6Js?t=356>https://youtu.be/TFcWdkaQ6Js?t=356<\/a>","<a href=https://youtu.be/lPRHzXqJjEA?t=1005>https://youtu.be/lPRHzXqJjEA?t=1005<\/a>","<a href=https://youtu.be/SsYxaxz3opw?t=1656>https://youtu.be/SsYxaxz3opw?t=1656<\/a>","<a href=https://youtu.be/6PK_GieQSro?t=2976>https://youtu.be/6PK_GieQSro?t=2976<\/a>","<a href=https://youtu.be/p3zKKB7-eI8?t=448>https://youtu.be/p3zKKB7-eI8?t=448<\/a>","<a href=https://youtu.be/x_g3mmcB_o0?t=2157>https://youtu.be/x_g3mmcB_o0?t=2157<\/a>","<a href=https://youtu.be/b5g0mqKdH_U?t=3178>https://youtu.be/b5g0mqKdH_U?t=3178<\/a>","<a href=https://youtu.be/nfEWdS6tEUE?t=1949>https://youtu.be/nfEWdS6tEUE?t=1949<\/a>","<a href=https://youtu.be/x0z0mZZ7CDY?t=10921>https://youtu.be/x0z0mZZ7CDY?t=10921<\/a>","<a href=https://youtu.be/zQAocyilWek?t=2699>https://youtu.be/zQAocyilWek?t=2699<\/a>","<a href=https://youtu.be/zP59mzqAIN8?t=99>https://youtu.be/zP59mzqAIN8?t=99<\/a>","<a href=https://youtu.be/OdWUiy9kD2M?t=1933>https://youtu.be/OdWUiy9kD2M?t=1933<\/a>","<a href=https://youtu.be/YT5H7uVguMM?t=1799>https://youtu.be/YT5H7uVguMM?t=1799<\/a>","<a href=https://youtu.be/pfGHz9qUMVE?t=530>https://youtu.be/pfGHz9qUMVE?t=530<\/a>","<a href=https://youtu.be/dO-euz882_8?t=2475>https://youtu.be/dO-euz882_8?t=2475<\/a>","<a href=https://youtu.be/tH85jRl3ow0?t=6165>https://youtu.be/tH85jRl3ow0?t=6165<\/a>","<a href=https://youtu.be/CzotLP1vHB8?t=469>https://youtu.be/CzotLP1vHB8?t=469<\/a>","<a href=https://youtu.be/-c-uzM8h_vA?t=946>https://youtu.be/-c-uzM8h_vA?t=946<\/a>","<a href=https://youtu.be/ejm2D6vjGt8?t=465>https://youtu.be/ejm2D6vjGt8?t=465<\/a>","<a href=https://youtu.be/yI4DSRx9FH0?t=1769>https://youtu.be/yI4DSRx9FH0?t=1769<\/a>","<a href=https://youtu.be/W4Z3gtUgPJ0?t=2692>https://youtu.be/W4Z3gtUgPJ0?t=2692<\/a>","<a href=https://youtu.be/48aTc2xD1LI?t=5724>https://youtu.be/48aTc2xD1LI?t=5724<\/a>","<a href=https://youtu.be/FU_gkc0Xy90?t=1718>https://youtu.be/FU_gkc0Xy90?t=1718<\/a>","<a href=https://youtu.be/ZhPutbP6v_8?t=7353>https://youtu.be/ZhPutbP6v_8?t=7353<\/a>","<a href=https://youtu.be/VM03gJmZCGs?t=2416>https://youtu.be/VM03gJmZCGs?t=2416<\/a>","<a href=https://youtu.be/os01kCsUJV4?t=3893>https://youtu.be/os01kCsUJV4?t=3893<\/a>","<a href=https://youtu.be/B4UPz38EY90?t=11041>https://youtu.be/B4UPz38EY90?t=11041<\/a>","<a href=https://youtu.be/9J81WsJva6k?t=1690>https://youtu.be/9J81WsJva6k?t=1690<\/a>","<a href=https://youtu.be/Vnecx5-GzV8?t=804>https://youtu.be/Vnecx5-GzV8?t=804<\/a>","<a href=https://youtu.be/ipt2TOCJBdg?t=2734>https://youtu.be/ipt2TOCJBdg?t=2734<\/a>","<a href=https://youtu.be/0w5qm4AQJqw?t=7457>https://youtu.be/0w5qm4AQJqw?t=7457<\/a>","<a href=https://youtu.be/pHhq2v9iCoQ?t=1115>https://youtu.be/pHhq2v9iCoQ?t=1115<\/a>","<a href=https://youtu.be/Pe-PiKZrnAc?t=1514>https://youtu.be/Pe-PiKZrnAc?t=1514<\/a>","<a href=https://youtu.be/yzU1kXZWMAM?t=7040>https://youtu.be/yzU1kXZWMAM?t=7040<\/a>","<a href=https://youtu.be/a1R-6XmT5Kc?t=4763>https://youtu.be/a1R-6XmT5Kc?t=4763<\/a>","<a href=https://youtu.be/o6a5MkbIrJM?t=842>https://youtu.be/o6a5MkbIrJM?t=842<\/a>","<a href=https://youtu.be/il6X2qjjDvI?t=5353>https://youtu.be/il6X2qjjDvI?t=5353<\/a>"]],"container":"<table class=\"cell-border stripe\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Country<\/th>\n      <th>Channel<\/th>\n      <th>Regex_hit<\/th>\n      <th>link<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":45,"dom":"tip","scrollY":"400px","rownames":false,"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}],"lengthMenu":[10,25,45,50,100],"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

---

### Example analysis: Double modals

- Non-standard rare syntactic feature in the British Isles, North America, and elsewhere <span class="small">(Montgomery & Nagle 1994; Coats forthcoming b)</span>
- **Will you can help me with this?**
- Occurs exclusively in Scotland, Northern Ireland, and Northern England?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(Murray 1873; Wright 1898-1905; Anderwald & Wagner 2007; Kallen & Kirk 2007; Smith et al. 2019)</span>

---

### Double modals

.pull-left[
- Regular-expression-search and manual annotation approach
- Double modals can be found in in Scotland, N. Ireland, and N. England, but also in the English Midlands and South and in Wales <span class="small">(Coats in Review)</span>  
]
<div style="top:-40px">
.pull-right[
![:scale 55%](data:image/png;base64,#Br_DM_pmw.png)
]
</div>

---

### A few caveats

- Meetings of local government not representative of speech in general
- ASR errors, quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span>

---

### Summary and outlook

- Large corpus of automatic speech-to-text transcripts from YouTube channels of local governments in Britain and Ireland
- Useful for corpus studies of spoken language, dialectology, pragmatics
- Freely available!

---

#Thank you!

---

### References

Agarwal, S., S. Godbole, D. Punjani & S. Roy. 2007. [How much noise is too much: A study in automatic text classification](https://doi.org/10.1109/ICDM.2007.21). In: *Seventh IEEE International Conference on Data Mining (ICDM 2007)*, 3–12.

Anderwald, L. & S. Wagner. 2007. The Freiburg English Dialect Corpus: Applying corpus-linguistic
research tools to the analysis of dialect data. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 35–53. Palgrave Macmillan.

Coats, S. In review. Double modals in contemporary British and Irish Speech.

Coats, S. Forthcoming a. Dialect corpora from YouTube. *Proceedings of ICAME41*. De Gruyter.

Coats, S. Forthcoming b. Naturalistic double modals in North America. *American Speech*.

Coats, S. 2019. [A corpus of regional American language from YouTube](https://ceur-ws.org/Vol-2364/7_paper.pdf). In: C. Navarretta, M. Agirrezabal & B. Maegaard (Eds.), *Proceedings of the 4th Digital Humanities in the Nordic Countries Conference, Copenhagen, Denmark, March 6–8, 2019*, 79–91. CEUR-WS.

Corbett, J. 2014. Syntactic variation: Evidence from the Scottish Corpus of Text and Speech. In: R. Lawson (Ed.), *Sociolinguistics in Scotland*, 258–276. Palgrave Macmillan.

Corrigan, K. P., I. Buchstaller, A. Mearns & H. Moisl. 2012. [*The Diachronic Electronic Corpus of Tyneside English*](https://research.ncl.ac.uk/decte).

Grieve, J., C. Montgomery, A. Nini, A. Murakami & D. Guo. 2019. [Mapping lexical dialect variation
in British English using Twitter](https://doi.org/10.3389/frai.2019.00011). *Frontiers in Artificial Intelligence* 2.

Honnibal, M., I. Montani, H. Peters, S. V. Landeghem, M. Samsonov, J. Geovedi, J. Regan,
G. Orosz, S. L. Kristiansen, P. O. McCann, D. Altinok, Roman, G. Howard, S. Bozek, E. Bot,
M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang, B. Böing, P. K. Tippa, jeannefukumaru,
G. Dubbin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj, wbwseeker, M. Burton, thomasO &
A. Patel. 2019. [Explosion/spaCy v2.1.7: Improved evaluation, better language factories and bug
fixes](https://doi.org/10.5281/zenodo.3358113).

Kallen, J. & J. Kirk. 2007. ICE-Ireland: Local variations on global standards. In: J. C. Beal, K. P. Corrigan & H. Moisl (Eds.), *Creating and digitizing language corpora volume 1: Synchronic databases*, 121–162. Palgrave Macmillan.

]]

---

### References II

Markl, N. & C. Lai. 2021. [Context-sensitive evaluation of automatic speech recognition: considering
user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In: *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Association for Computational Linguistics*, 34–40. Association for Computational Linguistics.

Meyer, J., L. Rauchenstein, J. D. Eisenberg & N. Howell. 2020. [Artie bias corpus: An open dataset
for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796). In: *Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020*, 6462–6468.

Montgomery, M. B. & S. J. Nagle. 1994. Double modals in Scotland and the Southern United States:
Trans-atlantic inheritance or independent development? *Folia Linguistica Historica* 14, 91–108.

Murray, J. 1873. *The dialect of the southern counties of Scotland: Its pronunciation, grammar, and historical relations.* London: Asher & Co.

Nerbonne, J. 2009. Data-driven dialectology. *Language and Linguistics Compass* 3, 175–198.

Smith, J., D. Adger, B. Aitken, C. Heycock, E. Jamieson & G. Thoms. 2019. [*The Scots Syntax Atlas*](https://scotssyntaxatlas.ac.uk). University of Glasgow.

Szmrecsanyi, B. 2013. *Grammatical variation in British English dialects: A study in corpus-based
dialectometry*. Cambridge University Press.

Szmrecsanyi, B. 2011. Corpus-based dialectometry: A methodological sketch. *Corpora* 6, 45–76.

Tatman, R. 2017. [Gender and dialect bias in YouTube’s automatic captions](https://aclanthology.
org/W17-1606). In: *Proceedings of the First ACL Workshop on Ethics in Natural Language Processing*, 53–59. Association for
Computational Linguistics.

Wright, J. 1898–1905. *The English dialect dictionary* (6 volumes). London: Henry Frowde.

]]