YouTube and transcript files
Corpus creation from the API (Application Programming Interface) and scraping
US corpus, Canada corpus, British Isles corpus
Example/application: Analysis of articulation rate in the US (Coats 2019b)
Caveats, summary, future outlook
Billions of YouTube videos, many with data that may be useful for studies in sociolinguistics, dialectology, phonetics, etc.
Videos downloadable, audio signal can be extracted, captions files (user-uploaded or automatically-generated) exist (for many videos) and can be downloaded
Metadata such as location, occasion, interaction type, etc. can often be inferred
Data accessible through API and the web
Videos may have multiple captions files: user-uploaded captions, auto-generated captions created using automatic speech recognition (ASR), or both, or neither
User-uploaded captions can be manually created or generated automatically by 3rd-party ASR software
Auto-generated captions are generated by Google's speech-to-text service
First ASR captions 2009 (Google 2009)
Advances in neural-network-based automatic speech-to-text transcription increase transcript accuracy (Dahl et al. 2012, Jaitly et al. 2012, Liao, McDermott & Senior 2014)
Google reports WER (word error rates) between 4.1%–5.6% for recent neural-network based ASR models (Chiu et al. 2018; cf. Ziman et al. 2018)
WER higher for videos
Provides access to videos, channels, and playlists that match search criteria
Can also be used to get activity summaries, automatically-generated video categories, image thumbnails, comments, like/dislike ratios, etc.
Data is returned in JSON format
Access limit via quota, all calls to API have a quota cost
Identify content of interest (e.g. local government channels) from the API or by scraping
Get the captions files (and video/audio) using YouTube-DL
Get geographical locations
Videos are often public meetings of elected representatives at town/city/county/state level: advantages in terms of representativeness and comparability
Speaker place of residence (cf. videos collected based on place-name search alone)
Topical contents comparable
Communicative contexts comparable
Audio quality often high
Make calls to the API with search terms that include place names
Search for channels
"Alabama city council", "Arizona city council", "Arkansas city council", etc.
New quota limit makes this unrealistic
Python module Selenium for browser automation
Imitate a user interacting with YouTube's web search interface
Check tag attributes to get channel URLs
Many homepages of municipalites have links to social media presence
Scrape pages to get links to YouTube channels
YouTube will block IPs for 48, 72 or more hours if too many http requests are made, therefore:
Change IP address after every 1000 http requests
Some videos have a geographical location tag (lat-long coordinates) assigned by the user, but most don't
Infer geographical location based on search term and channel title
1.6 billion words, 3,211 channels, 427,652 videos, 195,641 hours of video
Weights matrices: polygon continuity binary; 5, 10, 25, and 50 nearest-neighbor binary; and inverse distance with cutoffs of 200km and 100km)
Ratio of contracted to uncontracted forms; 50 nearest-neighbor binary weights matrix, Getis-Ord G*i
Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., Jaitly, M., Li, B., Chorowski, J. & Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. arXiv:1712.01769v6 [cs.CL].
Coats, S. (2019a). A corpus of regional American language from YouTube. In Navarretta, C. et al. (Eds.), Proceedings of the 4th Digital Humanities in the Nordic Countries Conference, Copenhagen, Denmark, March 6–8, 2019 (pp. 79–91). Aachen, Germany: CEUR.
Coats, S. (2019b). Articulation rate in American English in a corpus of YouTube videos. Language and Speech. https://doi.org/10.1177/0023830919894720
Dahl, G. E., Yu, D., Deng, L. & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 30–42.
De Jong, N.H. & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior research methods, 41(2), 385–390.
Esmukov, K., et al. (2018). Geopy. [Python module]. https://github.com/geopy/geopy
FFmpeg Developers. (2019). ffmpeg tool (Version 4.1.3) [Computer software]. http://ffmpeg.org
Getis, A. & Ord, J. K. (1992). The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis. 24(7), 189–206.
Google. (2009). Automatic captions in YouTube. https://googleblog.blogspot.com/2009/11/automatic-captions-in-youtube.html
Grieve, J. (2016). Regional variation in written American English. Cambridge, UK: Cambridge University Press.
Grieve, J., Speelman, D. & Geeraerts, D. (2011). A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change 23, 193–221.
Halpern, Y., Hall, K. B., Schogol, V., Riley, M., Roark, B., Skobeltsyn, G. & Bäuml, M. (2016). Contextual prediction models for speech recognition. In Proceedings of INTERSPEECH 2017, 2338–2342.
Jaitly, N., Nguyen, P., Senior, A. & Vanhoucke, V. (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. Proceedings of INTERSPEECH 2012, 2578–2581.
Liao, H., McDermott, E. & Senior, A. (2013). Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 368–373
Loesing, K., Murdoch, S. J., & Dingledine, R. (2010). A case study on measuring statistical data in the Tor anonymity network. In R. Sion et al. (Eds.), Financial Cryptography and Data Security: FC 2010 Workshops, RLCPS, WECSR, and WLC 2010 Tenerife, Canary Islands, Spain, January 2010, Revised Selected Papers, 203–215.
Ord, J. K. & Getis, A. (1995). Local spatial autocorrelation statistics: Distributional issues and application. Geographical Analysis 27(4), 286–306.
Orton, H., Sanderson, S. & Widdowson, J.D.A. (1978). The Linguistic Atlas of England. London and Atlantic Highlands, New Jersey: Croom Helm.
Szmrecsanyi, B. (2011). Corpus-based dialectometry: A methodological sketch. Corpora 6(1), 45–76.
Yen, C.-H., Remite, A. & Sergey M. (2019). Youtube-dl [Computer software]. https://github.com/rg3/youtube-dl/blob/master/README.md