English Philology, University of Oulu, Finland
steven.coats@oulu.fi
4th DHN Conference, Copenhagen
March 8th, 2019
Data in dialectology: Linguistic atlases and dialect corpora
Data collection from YouTube and corpus creation
Preliminary analysis: Getis-Ord G*i statistic, lexical and grammatical variables
Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats
Data from traditional language atlases
Dialect corpora
(Kurath 1949)
Billions of YouTube videos, many with speech relevant for dialectological research
First automatically generated speech-to-text captions 2009 (Google 2009)
Recent advances in neural-network-based speech-to-text transcription increase transcript accuracy (Chiu et al. 2018)
Billions of YouTube videos, many with speech relevant for dialectological research
First automatically generated speech-to-text captions 2009 (Google 2009)
Recent advances in neural-network-based speech-to-text transcription increase transcript accuracy (Chiu et al. 2018)
Public meetings of elected representatives at town/city/state level: advantages in terms of representativeness and comparability
Script to search YouTube API for channels:
county of
, city of
, municipal
, town meeting
, city council
, county supervisors
, board of supervisors
, government
, and official government
+ names/abbreviations 50 U.S. states or names of the 312 municipalities and 100 counties by population in the United States + corresponding state names/abbreviationcounty of Alabama
, city council CA
, official government Chicago, Illinois
, official government Los Angeles County, California
Script to search YouTube API for channels:
county of
, city of
, municipal
, town meeting
, city council
, county supervisors
, board of supervisors
, government
, and official government
+ names/abbreviations 50 U.S. states or names of the 312 municipalities and 100 counties by population in the United States + corresponding state names/abbreviationcounty of Alabama
, city council CA
, official government Chicago, Illinois
, official government Los Angeles County, California
Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)
For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster
Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)
For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster
G∗i=∑nj=1wijxj−¯X∑nj=1wijs√n∑nj=1w2ij−(∑nj=1wij)2n−1
n = number of locations, i,j = location indexes, x = value of variable, wij = spatial weight between locations i and j, ¯X = mean of x, s = standard deviation of x
Result is a standard deviate (significant at p=0.05 for G∗i≥±1.645)
Spatial autocorrelation statistic used in geography and recently in dialectology (e.g. Grieve 2016)
For each point in spatially distributed data: Positive value ➜ in a cluster of high values, negative value ➜ in a cluster of low values, zero ➜ not in a cluster
G∗i=∑nj=1wijxj−¯X∑nj=1wijs√n∑nj=1w2ij−(∑nj=1wij)2n−1
n = number of locations, i,j = location indexes, x = value of variable, wij = spatial weight between locations i and j, ¯X = mean of x, s = standard deviation of x
Result is a standard deviate (significant at p=0.05 for G∗i≥±1.645)
Spatial weights can be binary, based on polygon contiguity, a cutoff distance, or a nearest-neighbor function; or continuous, based on inverse distance or other functions
Spatial autocorrelation/visualization shows that
Spatial autocorrelation/visualization shows that
Bird, S., Loper, E. and Klein, E. 2009. Natural Language Processing with Python. Newton, MA: O'Reilly.
Chiu, C.-C., Sainath, T., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., Jaitly, N., Li, B., Chorowski, J., & Bacchiani, M. 2018. State-of-the-art speech recognition with sequence-to-sequence models. arXiv:1712.01769v6 [cs.CL].
Esmukov, K., et al. 2018. GeoPy (Python library).
Getis, A., & Ord, J. K. 1992. The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis 24(7), 189–206.
Google. 2009. Automatic captions in YouTube.
Grieve, J. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.
Kretzschmar, W. A. 2009. The linguistics of speech. Cambridge, UK: Cambridge University Pres.
Kurath, H. 1949. A word geography of the Eastern United States. Ann Arbor, MI: University of Michigan Press.
Nerbonne, J. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1), 175–198.
Ord, J. K. & Getis, A. 1995. Local spatial autocorrelation statistics: Distributional issues and application. Geographical Analysis 27(4), 286–306.
Szmrecsanyi, B. 2011. Corpus-based dialectometry: a methodological sketch. Corpora 6(1), 45–76.
Szmrecsanyi, B. & Hernández, N. 2007. Manual of information to accompany the Freiburg Corpus of English Dialects Sampler (FRED-S). Freiburg: University of Freiburg.
Data in dialectology: Linguistic atlases and dialect corpora
Data collection from YouTube and corpus creation
Preliminary analysis: Getis-Ord G*i statistic, lexical and grammatical variables
Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |