+ - 0:00:00
Notes for current slide
Notes for next slide
W3Schools
W3Schools

Outline

  1. Data in dialectology: Linguistic atlases and dialect corpora

  2. Data collection from YouTube and corpus creation

  3. Preliminary analysis: Getis-Ord G*i statistic, lexical and grammatical variables

Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats

2 / 17
W3Schools
W3Schools

American dialectology data

Data from traditional language atlases

  • Artificial speech situation (questionnaire)
  • One speaker/one variant per location ➜ implied categoricity (Kretzschmar 2009, Nerbonne 2009, Szmrecsanyi 2011)
  • Linguistic Atlas of the Middle and South Atlantic States: "What word is used for a child born out of wedlock?"

Dialect corpora

  • Multiple speakers, relative frequencies ➜ more nuanced view of geographical distribution of linguistic forms
  • UK: Freiburg English Dialects Corpus (transcribed interviews, Szmrecsanyi and Hernández 2007)
  • US: Letters to the editor corpus (written texts, Grieve 2016)
  • No corpora of transcribed American speech with broad geographic coverage

(Kurath 1949)

3 / 17
W3Schools

YouTube automatic speech-to-text captions

Billions of YouTube videos, many with speech relevant for dialectological research

First automatically generated speech-to-text captions 2009 (Google 2009)

Recent advances in neural-network-based speech-to-text transcription increase transcript accuracy (Chiu et al. 2018)

4 / 17
W3Schools

YouTube automatic speech-to-text captions

Billions of YouTube videos, many with speech relevant for dialectological research

First automatically generated speech-to-text captions 2009 (Google 2009)

Recent advances in neural-network-based speech-to-text transcription increase transcript accuracy (Chiu et al. 2018)

Focus on US local government/civic organization meetings

Public meetings of elected representatives at town/city/state level: advantages in terms of representativeness and comparability

  • Speaker place of residence
  • Similar in terms of topical content and communicative contexts
4 / 17
W3Schools

Example video

5 / 17
W3Schools

.vtt file

6 / 17
W3Schools

Data collection

Script to search YouTube API for channels:

  • Substrings county of, city of, municipal, town meeting, city council, county supervisors, board of supervisors, government, and official government + names/abbreviations 50 U.S. states or names of the 312 municipalities and 100 counties by population in the United States + corresponding state names/abbreviation
  • E.g. county of Alabama, city council CA, official government Chicago, Illinois, official government Los Angeles County, California
7 / 17
W3Schools

Data collection

Script to search YouTube API for channels:

  • Substrings county of, city of, municipal, town meeting, city council, county supervisors, board of supervisors, government, and official government + names/abbreviations 50 U.S. states or names of the 312 municipalities and 100 counties by population in the United States + corresponding state names/abbreviation
  • E.g. county of Alabama, city council CA, official government Chicago, Illinois, official government Los Angeles County, California

Data filtering

  • 1,680 channel matches, remove false positives and duplicates
  • Geocoding API to assign exact latitude-longitude coordinates (Esmukov et al. 2018)
  • 53,743 captions files downloaded in .vtt format, script to extract text and timings
  • PoS tagging with NLTK (Bird, Loper & Klein 2009)
7 / 17
W3Schools

Spoken American YouTube Corpus (SpAmYT)

  • First corpus of spoken language for the entire US with fine geographic granularity
  • Largest spoken language corpus
  • 579 locations
  • 29,267.14 hours of video
  • 252,259,141 words
8 / 17
W3Schools

Channels sampled (channels with at least 1,000 words)

 
9 / 17
W3Schools

Exploratory analysis

  • Channel locations ➜ Voronoi tessalation; relative frequencies ➜ spatial autocorrelation

  • Proof of concept: How are lexical items pertaining to weather (snow, sun) spatially distributed?
  • Copula contraction: Where do Americans use more contracted forms (e.g. he's, they're) compared to uncontracted forms (e.g. he is, they are)?
10 / 17
W3Schools

Getis-Ord G*i  (Ord & Getis 1992; Getis & Ord 1995)

Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)

For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster

11 / 17
W3Schools

Getis-Ord G*i  (Ord & Getis 1992; Getis & Ord 1995)

Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)

For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster

Gi=j=1nwijxjX¯j=1nwijsnj=1nwij2(j=1nwij)2n1

n = number of locations, i,j = location indexes, x = value of variable, wij = spatial weight between locations i and j, X¯ = mean of x, s = standard deviation of x

Result is a standard deviate (significant at p=0.05 for Gi±1.645)

11 / 17
W3Schools

Getis-Ord G*i  (Ord & Getis 1992; Getis & Ord 1995)

Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)

For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster

Gi=j=1nwijxjX¯j=1nwijsnj=1nwij2(j=1nwij)2n1

n = number of locations, i,j = location indexes, x = value of variable, wij = spatial weight between locations i and j, X¯ = mean of x, s = standard deviation of x

Result is a standard deviate (significant at p=0.05 for Gi±1.645)

Spatial weights can be binary, based on polygon contiguity, a cutoff distance, or a nearest-neighbor function; or continuous, based on inverse distance or other functions

11 / 17
W3Schools
W3Schools

Exploratory analysis: weather terms

(Weights matrices: 50 nearest-neighbor binary and inverse distance with cutoffs of 200km and 100km)
12 / 17
W3Schools

Exploratory analysis: copula contraction

(Weights matrices: 50 nearest-neighbor binary)
13 / 17
W3Schools

Copula contraction: Grieve 2016 (p. 151)

  • More contraction in the West, less in the East/Southeast
  • Spoken-language pattern corresponds to written-language pattern
14 / 17
W3Schools

Summary and outlook

  • First dialect corpus of spoken American English
  • First large corpus from automatic speech-to-text transcripts
  • Largest dialect corpus (252m words), extensive geographical coverage
  • First corpus from local government/civic organization meetings
15 / 17
W3Schools

Summary and outlook

  • First dialect corpus of spoken American English
  • First large corpus from automatic speech-to-text transcripts
  • Largest dialect corpus (252m words), extensive geographical coverage
  • First corpus from local government/civic organization meetings

Spatial autocorrelation/visualization shows that

  • Lexical types show interpretable regional variation
  • Geographical distribution of copula contraction: spoken language similar to written language
15 / 17
W3Schools

Summary and outlook

  • First dialect corpus of spoken American English
  • First large corpus from automatic speech-to-text transcripts
  • Largest dialect corpus (252m words), extensive geographical coverage
  • First corpus from local government/civic organization meetings

Spatial autocorrelation/visualization shows that

  • Lexical types show interpretable regional variation
  • Geographical distribution of copula contraction: spoken language similar to written language

  • Analysis of lexico-grammatical features for aggregate dialectometry (cf. Grieve 2016)
  • NLP-based analyses (regional topics: topic modeling, regional word semantics: word vectors)
  • Speech/articulation rate analysis (in progress!)
  • Automatic annotation of speaker variables?
15 / 17
W3Schools

Thank you!

16 / 17
W3Schools

References

Bird, S., Loper, E. and Klein, E. 2009. Natural Language Processing with Python. Newton, MA: O'Reilly.

Chiu, C.-C., Sainath, T., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., Jaitly, N., Li, B., Chorowski, J., & Bacchiani, M. 2018. State-of-the-art speech recognition with sequence-to-sequence models. arXiv:1712.01769v6 [cs.CL].

Esmukov, K., et al. 2018. GeoPy (Python library).

Getis, A., & Ord, J. K. 1992. The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis 24(7), 189–206.

Google. 2009. Automatic captions in YouTube.

Grieve, J. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.

Kretzschmar, W. A. 2009. The linguistics of speech. Cambridge, UK: Cambridge University Pres.

Kurath, H. 1949. A word geography of the Eastern United States. Ann Arbor, MI: University of Michigan Press.

Nerbonne, J. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1), 175–198.

Ord, J. K. & Getis, A. 1995. Local spatial autocorrelation statistics: Distributional issues and application. Geographical Analysis 27(4), 286–306.

Szmrecsanyi, B. 2011. Corpus-based dialectometry: a methodological sketch. Corpora 6(1), 45–76.

Szmrecsanyi, B. & Hernández, N. 2007. Manual of information to accompany the Freiburg Corpus of English Dialects Sampler (FRED-S). Freiburg: University of Freiburg.

17 / 17
W3Schools
W3Schools

Outline

  1. Data in dialectology: Linguistic atlases and dialect corpora

  2. Data collection from YouTube and corpus creation

  3. Preliminary analysis: Getis-Ord G*i statistic, lexical and grammatical variables

Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats

2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow