UEF_Talk_Coats

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-right[
<span style="font-family:Rubik;font-size:24pt;font-weight: 700;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Analysis of Online Discourse from Text to Multimedia</span>
]

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
SCE Spring School, University of Eastern Finland<br> 
May 20th, 2024<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Online Discourse from Text to Multimedia | UEF SCE Program</span></div>

---

---

### Outline

1. Theoretical background
  - Computation as *data* and computation as *method*
  - Hermeneutics, Corpus linguistics, AI
  - Online communication as text, augmented text, and multimedia
2. Text
  - Twitter/𝕏: Skin tone emoji in tweets
3. Speech/Audio
  - ASR transcripts, corpora from YouTube, double modals 
  - Corpus phonetics, Praat-Parselmouth
4. Towards multimodal corpora
  - Streaming content: Scripts and pipelines
5. Caveats, outlook, summary

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Discourse from Text to Multimedia | UEF SCE Program</span></div>

---

### Computation as data and method <span class="small">(cf. Cioffi-Revilla 2014)</span>

Computation as ***data***

- The representation of discourse as digitized, annotated records
- XML, JSON, base64, FFT of audio signals, video codecs
- Large data sets

Computation as ***method***

- Data analytical approaches for text, speech, video
- Frequencies and statistics, predictive algorithms, neural networks with LLMs

Visualization, mapping

- Big data from multiple platforms, utilizing the methods of computational social science and computational sociolinguistics (Lazer, 2009; Nguyen et al., 2016)

“Research begins where counting ends” (Rissanen 2009: 66)

---

### Theoretical background

.small[
- Humanities and social sciences have traditionally focused on the analysis of *discourse* in its various forms

Past: Traditional methods for the analysis of textual discourse
  - Hermeneutics, exegesis, close reading
  - Focused on interpretation of key texts and authors

![:scale 50%](data:image/png;base64,#discoursers.png)
<br>Schleiermacher, Dilthey, Deleuze, Foucault]

.small[
Present: The **computational turn** has brought about new methods for the analysis of discourse

- Content analysis (Krippendorff 2004)
  - Distant reading (Moretti 2007, 2013)
  - Cultural analytics (Manovich 2020)

Future: AI-augmented interpretation of content

]
---

![:scale 75%](data:image/png;base64,#book_covers.png)<br>
<div class="small">
"Content analysis is a research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use" (Krippendorff 2004: 18)<br> 
- Includes "works of art, images, maps, sounds, signs, symbols, and numerical records"<br><br>

"Instead of concrete, individual works, a trio of artificial constructs — graphs, maps, and trees — in which the reality of the text undergoes a process of deliberate reduction and abstraction" (Moretti 2007: 1)<br><br>

"The use of computational and design methods — including data visualization, media and interaction
design, statistics, and machine learning — for exploration and analysis of contemporary culture" (Manovich 2020: 9)</div>

---

### Discourse as communication, discourse as information

- Discourse is *information* from which meanings emerge contextually
- Information can be quantified and modeled 
- Contextual, channel, code, sender, and receiver factors influence interpretations 
]

![](data:image/png;base64,#./entropy.png)

- In a computational sense, the newness of any information can be quantified
- Metrics derived from Shannon Entropy are widely used in machine learning and AI for the comparison of models 
- Shannon Entropy is not discourse or meaning, but it is quantifies information in a way that ultimately allows us to model discourses
- Can the other factors be modeled, given large enough models?
]

---

---

---

Cartoon

---

.pull-left[
**Max Weber 1910** <span class="small">(Deutsche Gesellschaft für Soziologie 1911)</span>

- Proposes a comprehensive empirical study of all newspapers to investigate relationships between media owners, consumers, advertisers, and politics
- The nature of the press affects how people read and understand the world <span class="small">("Es sind unzweifelhaft gewaltige Verschiebungen, die die Presse da in den Lesegewohnheiten  vornimmt,  und  damit  gewaltige  Verschiebungen der  Prägung,  der  ganzen  Art,  wie der moderne Mensch von außen her rezipiert" DGfS 1911: 51)<br>
![:scale 40%](data:image/png;base64,#./Weber.png)
<p class="small">
Digitized newspapers are now widely available and accessible through APIs<br>
<a href="https://www.deutsche-digitale-bibliothek.de/newspaper">Deutsche Digitale Bibliothek</a><br>
<a href="https://github.com/CSCfi/kielipankki-nlf-harvester">NLF Harvester</a> (Kielipankki/National Library of Finland digitized versions of almost all Finnish newspapers and periodicals since  1771)
<a href="https://sites.utu.fi/digital-history-literature-finland">DHL-FI Project</a> (Finnish literature 1809-1917)
</p>
]
.pull-right[
**Andrei Markov 1913** <span class="small">(Markov 2006 [1913])</span>

- Probabilities of character sequences in natural language text
- Markov Chains, Bayesian modelling
- Large Language Models/generative AI utilize what are essentially sophisticated multidimensional Markov Chains based on scrapes of the entire web + all social media + whatever else, including Pushkin

![:scale 50%](data:image/png;base64,#./Markov.png)
![:scale 45%](data:image/png;base64,#./Onegin.png)
]

---

### Precursors

.pull-right30[&emsp;&emsp;&emsp;&emsp;
![:scale 90%](data:image/png;base64,#./Weber.png)
]
.pull-left70[
Max Weber 1910: Proposes a comprehensive empirical study of the newspaper press to investigate
- Relationships between media owners, consumers, and advertisers, especially with regards to politics
- How does the commercial and capitalist organization of the media affect public opinion? <span class="small">("was bedeutet die kapitalistische Entwicklung innerhalb des Pressewesens für die soziologische  Position der Presse im allgemeinen, für ihre Rolle innerhalb der Entstehung der öffentlichen Meinung?" Deutsche Gesellschaft für Soziologie 1911: 47)</span>&nbsp;
- The nature of the press affects how people read; also how they understand the world around them <span class="small">("Es sind unzweifelhaft gewaltige Verschiebungen, die die Presse da in den Lesegewohnheiten  vornimmt,  und  damit  gewaltige  Verschiebungen der  Prägung,  der  ganzen  Art,  wie der moderne Mensch von außen her rezipiert" *ibid.*: 51)</span>&emsp;

Digitized newspapers are now widely available and accessible through APIs
- https://www.deutsche-digitale-bibliothek.de/newspaper
- https://github.com/CSCfi/kielipankki-nlf-harvester (Kielipankki/National Library of Finland has digitized versions of almost all Finnish newspapers and periodicals since  1771)

]

---

### Precursors

Andrei Markov (2006 [1913])

- Probabilities of character sequences in natural language text
- -> Markov Chains, Bayesian modelling

- Large Language Models/generative AI utilize what are essentially sophisticated multidimensional Markov Chains based on scrapes of the entire web + all social media + whatever else, including Pushkin

]

.pull-right[
![:scale 45%](data:image/png;base64,#./Markov.png)
![:scale 45%](data:image/png;base64,#./Onegin.png)
]

---

### Background III

- *computational social science* and *computational linguistics*

- Auguste Comte's rational/empirical conception of the study of society
- Psychology, Political Sciences, Sociology, Economics, Anthropology (the "big five")
- Computation as a paradigm for understanding, interpreting, and modeling society

Computational politics, computational sociology, digital humanities

---

### Shift to online contexts

In the last 40 years two significant shifts have affected research in humanities and social sciences

1. We have access to far larger amounts of *data* than previously

- Digitized historical records 
    - Population, census, statistical data from national and supranational sources
    - Constantly new *born-online* data: videos/streams, chat messages, comments, posts, etc.

2. Much discourse has shifted to *online* contexts

- News and commentary on political developments
    - Informal communication types
    - Online discourse has undergone an evolution from asynchronic text → synchronic text → text + images → text + images and sound → video → video + text + images + sound
    - Interlocutor configurations have also become more comple
    
---

### Twitter/𝕏

- Founded 2006 as an online equivalent to SMS 
- Access to data via API (until June 2023)
- Archetypal source for online text data
- Data used for studies of language diversity and multilingualism (e.g. Mocanu et al. 2013; Coats 2019), dialects (e.g. Grieve et al. 2019; Purschke and Hovy 2019), pragmatics (Zappavigna 2012), politics, migration, catastrophes and other topics (see, e.g., Tumasjan et al. 2010; Hübl et al. 2017; Murzintcev and Cheng 2017).

Many researchers have collected and stored Twitter/𝕏 data

- Nordic Tweet Stream (Laitinen et al. 2018)
- A post can contain text, images, video, and various types of metadata (location, network, etc.)
- Let us start with text and consider *emoji*

---

- Telegram, Signal, Discord, Mastodon
- Jodel (Purschke)
- Purschke and Hovy (2019) collected a corpus of German-language messages from Germany, Austria, and Switzerland from the social media service Jodel, which allows for anonymous communication within a 10km radius of a user’s location. They found that clustering individual locations in their data based on similarity of lemmatized content words largely recapitulates traditional dialect divisions for German derived from older linguistic atlas data (Lameli, 2013).

---

### Emoji

Pictorial representations in graphical form. Origins in Japan in 1990s, introduced into Unicode late 2000s as dedicated code points. Currently 3,782 unique emoji, more with every Unicode update.
.center[
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f600.png)![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f63a.png)![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png)![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f680.png)![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f1eb-1f1ee.png)
.small[Glyphs from [twemoji](https://github.com/twitter/twemoji)]]

- Emoji are used in computer-mediated communication in most languages
- What meanings do emoji have?
- What emoji are used in what places?

#### Creating a corpus to analyze emoji in discourse

- Write script to get tweets from Twitter API
- Select all tweets with emoji
- Use sentiment analysis and word vector similarity to consider emoji meanings

---

### Skin tone emoji

Since  Unicode 8.0 (June 17, 2015), skin tone characters are part of Unicode

![](data:image/png;base64,#Fitzpatrick1a.png).small[[source](./Fitzpatrick1a.png)]]

.pull-right70[
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fb.png)Emoji Modifier Fitzpatrick Type-1-2<br>![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fc.png)Emoji Modifier Fitzpatrick Type-3<br>![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fd.png)Emoji Modifier Fitzpatrick Type-4<br>![](data:image/png;base64,#https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fe.png)Emoji Modifier Fitzpatrick Type-5<br>
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3ff.png)Emoji Modifier Fitzpatrick Type-6<br>
]

---

### Skin tone emoji use

Skin tone is shown using sequences of Unicode characters

.pull-left30[
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fb.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478-1f3fb.png)<br>
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fc.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478-1f3fc.png)<br>
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fd.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478-1f3fd.png)<br>
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3fe.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478-1f3fe.png)<br>
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3ff.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f478-1f3ff.png)<br>
]

.pull-right70[
<br><br>
\U0001f478\U0001f3fb<br><br><br>
\U0001f478\U0001f3fc<br><br><br>
\U0001f478\U0001f3fd<br><br><br>
\U0001f478\U0001f3fe<br><br><br>
\U0001f478\U0001f3ff
]

---

### Emoji sequences

Since Unicode 9.0 (late 2016), emoji sequences can also be used to indicate activities, professions, groups, etc. These can usually be combined with skin tone as well.

![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f468.png) + ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/2695.png) = ![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f468-200d-2695-fe0f.png)

![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f9db.png) +
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f3ff.png) +
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/2640.png) =
![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f9db-1f3ff-200d-2640-fe0f.png)

Sequences can utilize additional **zero-width joiner** and **variation selector** code points to show that the sequence is to be parsed as one character

![]( https://cdnjs.cloudflare.com/ajax/libs/twemoji/14.0.2/72x72/1f9d6-1f3fb-200d-2642-fe0f.png) = \U0001f9d6\U0001f3fb\U0000200d\U00002642\U0000fe0f

- Parsing and tokenization of emoji sequences can present difficulties

---

### Emoji corpus

Twitter's API was used to collect 600m tweets. After filtering this resulted in a global corpus of 24,231,885 tweets containing skin tone emoji

How are these skin tone emoji being used globally?

- How often do users select a skin tone variant compared to a default version?

- Which skin tones are being used?

---

### Proportion of potential skin tone emoji assigned skin tone

---

### Median skin tone values

---

### Global distribution of skin colors

![:scale 80%](data:image/png;base64,#ancestralSkintone.png).small[[source](https://sruk.org.uk/skin-color-an-example-of-adaptation-to-the-environment/)]

---

### Emoji sentiment rankings (Kralj-Novak et al. 2015)

- L1 Annotators categorized tweets containing emoji in 13 European languages as "negative", "neutral", or "positive"

- Aggregate statistics were used to assign sentiment values to individual emoji

"I love it, it's great!!! 😊" → positive

"This is the worst thing ever, terrible 😒"  → negative

- This dictionary was applied to evaluate sentiment in my data

---

### Emoji sentiment rankings (Kralj-Novak et al. 2015)

---

### Calculation of mean sentiment by country/territory for tweets with potential skin tone emoji

- ~25m tweets with potential skin tone emoji
- Tweets stripped of usernames, URLs, and hashtags, then tokenized
- Mean values per country/territory

![](data:image/png;base64,#tokens_emojidata.png)

---

### Correlation of tweet sentiment with skin tone emoji

---

### A "Distant Reading" interpretation of 24m tweets (https://cc.oulu.fi/~scoats/emojiBokeh.html)

.pull-left[
![](data:image/png;base64,#./emoji_tsne_detail.png)
]
.pull-right[
- Word2Vec (Mikolov et al. 2013) to analyze content
- 307,831,369 tokens (4,492,662 unique types)
- t-SNE (van der Maaten & Hinton 2008) to reduce 400-dimensional vectors of 2,704 unique emoji types to 2 dimensions
- Emoji that are closer together are used in similar contexts: similar meanings 
- Emoji that are further apart are more distant semantically 
]

---

### Hashtags and @

Hashtags are quasi lexical elements whose discourse, grammatical, and pragmatic properties diverge from those of normal lexical items. On social media, they are used as indexing elements for the organization and contextualization of discourse (Wikström, 2014; Bruns & Burgess, 2015; Squires, 2015; Zappavigna, 2018)

- Social distributions of hashtags
- Geographical distributions of hashtags (Hiippala et al. 2021; Coats 2023)

---

### Speech/Audio: ASR Corpora from YouTube

- Renaissance in corpus-based study of English varieties <span class="small">(Nerbonne 2009; Szmrecsanyi 2011, 2013; Grieve et al. 2019)</span>
- Some corpora of transcribed spoken English have limited availability, are small in size, or lack sufficient geographical granularity to make inferences about regional distributions of features

.small[
Corpus	              |Location(s)        |nr_words| Reference
----------------------|-------------------|--------|--------------------------				
ICE-Aus               | Australia         |~600k   | Cassidy et al. 2012
Monash Corpus         | Melbourne         |~96k    | Bradshaw et al. 2010
Griffith Corpus       | Brisbane          |~32k    | Cassidy et al. 2012
Wellington Corpus     | NZ                |~1m     | Holmes et al. 1998
ONZE Corpus           | NZ                |?       | Gordon et al. 2007
]

- Automatic Speech Recognition (ASR) transcripts are available online for speech from specific locations
- Videos from local councils and other government entities can be harvested to create large corpora

---

### Speech/Audio: YouTube ASR Corpora <span class="small">(Coats 2023c)</span>

US, Canada, England, Scotland, Wales, Northern Ireland, the Republic of Ireland, Germany. Australia, and New Zealand 
  - [CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html): 1.25b token corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts 
  - [CoBISE](https://cc.oulu.fi/~scoats/CoBISE.html): 112m tokens, 452 locations, 38,680 ASR transcripts <span class="small">(Coats 2022b)</span>
  - [CoGS](https://cc.oulu.fi/~scoats/CoGS.html): 50.5m tokens, 39.5k transcripts, 1,308 locations 
  - [CoANZSE](https://cc.oulu.fi/~scoats/CoANZSE.html): 190m tokens, 57k transcripts, 482 locations; also [coanzse.org](https://coanzse.org)
  
Freely available for research use; download from the Harvard Dataverse ([CoNASE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV), [CoBISE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD), [CoGS](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3Y1YVB), [CoANZSE](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GW35AK))

---

### Example video

---

### WebVTT file

![](data:image/png;base64,#../ALOES_preconference_workshop/Workshop_presentation/Maranoa_webvtt_example.png)

---

### Example analysis: Double modals

- Non-standard rare syntactic feature<span class="small"> (Montgomery & Nagle 1994; Coats 2024)</span>
  - *I might could help you with this*
-  Occurs only in the American Southeast and in Scotland/Northern England/Northern Ireland?
- Most studies based on non-naturalistic data with limited geographical scope <span class="small">(data from linguistic atlas interviews, surveys administered mostly in American Southeast and North of Britain)</span>
- More widely used in North America and the British Isles than previously thought (Coats 2024; Coats 2023b; Morin & Coats 2023)
- Can also be found in Australian and New Zealand speech

---

### Script: Generating a table for manual inspection of double modals

- Base modals *will, would, can, could, might, may, must, should, shall, used to, 'll, ought to, oughta*
- Script to generate regexes of two-tier combinations

```python
import re
hits = []
for x in modals:
  for i,y in coanzse_df.iterrows():
      pat1 = re.compile("("+x[0]+"_\\w+_\\S+\\s+"+x[1]+"_\\w+_\\S+\\s)",re.IGNORECASE)
      finds = pat1.findall(y["text_pos"])
      if finds:
  	    for z in finds:
    	    seq = z.split()[0].split("_")[0].strip()+" "+z.split()[1].split("_")[0].strip()
    	    time = z.split()[0].split("_")[-1] 
    	    hits.append((x["country"],x["channel_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)
```

- The script creates a URL for each search hit at a time 3 seconds before the targeted utterance 
- In the resulting data frame, each utterance can be annotated after examining the targeted video sequence
- Filter out non-double-modals (clause overlap, speaker self-repairs, ASR errors)

---

### North America

![:scale 80%](data:image/png;base64,#./Coats_Figure1_Dec23.png)

---

### Britain and Ireland

![:scale 30%](data:image/png;base64,#UK_dm.png)

---

### Australia and New Zealand

![:scale 70%](data:image/png;base64,#../../CoANZSE_DM_paper/CoANZSE_dms_pmw_Oct23.png)

---

### Speech/Audio: Articulation rate

Prosodic features increasingly being considered as bearing indexicality in the same manner as (e.g.) lexical or grammatical variables .small[(Ray & Zahn, 1999; Kendall, 2013)]

Attitudes about regional or urban-rural differences in speech temporality are common in the U.S. .small[(Preston, 1989, 1999; Roach, 1998)]

Faster speech can be associated with

- Competence, intelligence, and expertise .small[(Smith et al., 1975; Street & Brady, 1982; Thakerar & Giles, 1981)]

- Persuasiveness .small[(Apple et al., 1979; Giles & Powesland, 1975, Miller et al., 1976)]

- Attractiveness .small[(Street et al., 1983)]</br>

compared to slower speech

---

### Example video (slow talker)

---

### Example video (fast talker)

---

### Speaking rate and articulation rate

*Speaking rate*: Sum of units of speech (e.g. phones, syllables, or words) divided by total utterance time

*Articulation rate*: Sum of units of speech divided by total utterance time, **omitting pauses between segments of unbroken speech**

- Pause duration has been shown to vary .small[(Goldman-Eisler, 1961)], also according to demographic and regional parameters .small[(Clopper & Smiljanic, 2011, 2015)]

- In this study **articulation rate**, measured in σ/sec., is compared

---

### Factors that can affect articulation rate

- Type of speech: Reading, monologue, conversation

- Conversation: Interlocutor familiarity, topic under discussion .small[(Yuan, Liberman & Cieri, 2006)]

- Utterance-internal considerations .small[(Byrd & Saltzman, 1998; Yuan, Liberman & Cieri, 2006; Oller, 1973)]

- Anatomical, physiological, or neurological parameters .small[(Tsao & Weismer, 1997; Tsao, Weismer & Iqbal, 2006)]

- Demographic, social, or **regional** identity .small[(Byrd, 1992, 1994; Jacewicz et al., 2009, 2010; Kendall, 2014)]

---

### Articulation rate

- Based on word timing information extracted from more than 300k videos

---

### Pipeline for acoustic analysis

![:scale 50%](data:image/png;base64,#../../PapersConferences2024/ALOES_preconference_workshop/Workshop_presentation//Github_phonetics_pipeline_screenshot.png)

- A Jupyter notebook for Python that collects transcripts and audio from YouTube, aligns the transcripts, and extracts vowel formants
- Click your way through the process in a Google Colab environment
- Can be used for any language that has ASR transcripts
- With a few script modifications, also works for manual transcripts

https://github.com/stcoats/phonetics_pipeline

---

### Scripting pipelines for multimedia analysis

- Scripts in Python or R in a cloud-based notebook environment (CSC's Tykky or Google's Colab)
- Dependency conflict issues are minimal
- Can use immediately without extensive setup of servers, databases
- Script components are customizable
- Can be adapted to handle various types of content
- Can be adapted to handle large amounts of data

#### https://t.ly/3HhGJ

A Colab pipeline for acoustic analysis of YouTube content

---

### Component: yt-dlp

![](data:image/png;base64,#../../PapersConferences2024/ALOES_preconference_workshop/Workshop_presentation/yt-dlp_screenshot.png)

]

- Open-source fork of YouTube-DL

- Can be used to access any content streamed with DASH or HLS protocols

- Can be used to get video
]

---

### Component: Praat-Parselmouth <span class="small">(Jadoul et al. 2018)</span>

- Python interface to Praat, widely used software for acoustic analysis <span class="small">(Boersma & Weenink 2023)</span>

- Intergration into Python simplifies workflows and analysis

![:scale 75%](data:image/png;base64,#../../PapersConferences2024/ALOES_preconference_workshop/Workshop_presentation/praat_screenshot.png)

---

### Video: 'heaps of' in Australian English

---

### Component: Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>

.small[
- Forced alignment is aligning a transcript to an audio track so that the exact start and end times of segments (words, phones) can be determined

- Necessary for automated analysis of vowel quality or other phonetic analysis

- MFA may perform better than some other aligners (P2FA, MAUS)

- MFA is fragile

]
]

![](data:image/png;base64,#../../PapersConferences2024/ALOES_preconference_workshop/Workshop_presentation/mfa_screenshot.png)

]

---

### Google Colab

- Google Colaboratory is an online server for running
code in Python or R in a notebook environment
- You need a Google account to use Colab
- Advantages include access to GPU/TPU, collaborative
editing, cloud-based execution, and integration with code on GitHub/Gitlab

![:scale 70%](data:image/png;base64,#../../PapersConferences2024/ALOES_preconference_workshop/Workshop_presentation/Colab.png)
---

### Summary and outlook

We have identified two diachronic conceptual axes

- Methodology: Hermeneutics → corpus approaches → AI
- Online discourse: Text → augmented text → multimedia
- We can use data analytical techniques to conduct "distant reading" analyses on discourse
  - Meanings of emoji and global patterns of skintone emoji use
  - Use of a rare syntactic feature, double modals, in English varieties
  - Analyses of articulation rate in speech
  - Future: Automated video analyses
  
- Once the counting has ended: We can't yet forget about hermeneutics

---

### Resources

![:scale 60%](data:image/png;base64,#./Statistics-for-Linguistics-with-R.jpg)
]
.pull-right[
![:scale 70%](data:image/png;base64,#./Nelimarkka.jpg)
]
---

### Thank you for your attention!

---

### References

Apple, W., Streeter, L. A., & Krauss, R. M. 1979. Effects of pitch and speech rate on personal attributions. *Journal of Personality and Social Psychology* 37, 715–27.

Byrd, D., & Saltzman, E. 1998. Intragestural dynamics of multiple phrasal boundaries. *Journal of Phonetics* 26, 173–199.

Cioffi-Revilla, C. (2014). *Introduction to Computational Social Science*. Springer. https://doi.org/10.1007/978-1-4471-5661-1_2

Coats, S. (2019). [Language choice and gender in a Nordic social media corpus](https://doi.org/10.1017/S0332586519000039). *Nordic Journal of Linguistics* 42(1), 31–55.

Coats, S. (2023a). CoANZSE: [The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts](https://doi.org/10.2478/plc-2022-13 ). In P. Parameswaran, J. Biggs & D. Powers (Eds.), *Proceedings of the the 20th Annual Workshop of the Australasian Language Technology Association*, 1–5. Australasian Language Technology Association.

Coats, S. (2023b). [Double modals in contemporary British and Irish Speech](https://doi.org/10.1017/S1360674323000126). *English Language and Linguistics*.

Coats, S. (2023c). [Dialect corpora from YouTube](https://doi.org/10.1515/9783111017433-005 ). In B. Busse, N. Dumrukcic & I. Kleiber (Eds.), *Language and linguistics in a complex world*, 79–102. Walter de Gruyter.

Coats, S. (2024). [Naturalistic double modals in North America](https://doi.org/10.1215/00031283-9616142). *American Speech* 99(1), 47–77.

Deutsche Gesellschaft für Soziologie. (1911). *Verhandlungen des Ersten Deutschen Soziologentages vom 19.  bis  22.  Oktober  1910  in  Frankfurt  a.  M..  Reden  und  Vorträge  von  Georg  Simmel,  Ferdinand  Tönnies,  Max  Weber,  Werner  Sombart,  Alfred  Ploetz,  Ernst  Troeltsch,  Eberhard  Gothein,  Andreas  Voigt,  Hermann  Kantorowicz  und Debatten*. J.C.B. Mohr (Paul Siebeck).

Giles, H., & Powesland, P. 1975. *Speech Style and Social Evaluation*. London/New York: Academic Press.

Hübl, F., S. Cvetojevic, H. Hochmair, & G. Paulus (2017). Analyzing refugee migration patterns using geo-tagged tweets. *International Journal of Geo-Information*, 6 (10). https://doi.org/10.3390/ijgi6100302

Kendall, T. 2013. *Speech rate, pause, and sociolinguistic variation: Studies in corpus sociophonetics*. London: Palgrave-Macmillan.

Kralj-Novak, P., Smailovic, J., Sluban, B., and Mozetic, I. (2015). [Sentiment of  emojis](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296). *PLoS ONE* 10(12).

Krippendorff, K. (2004). *Content analysis*. Sage.

Laitinen, M., Lundberg, J., Levin, M., & Martins, R.M. (2018). The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data. In *Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018* (pp. 349-362).

Lazer, David et al. (2009). Computational social science. *Science*, 323(5915), 721–723.  https://doi.org/10.1126/science.1167742

Manovich, L. (2020). Cultural analytics. MIT Press.

Markov, A. A. (2006 [1913]). An example of statistical investigation of the text Eugene Onegin 
concerning the connection of samples in chains. *Science in Context*, 19, 591–600. 
https://doi.org/10.1017/S0269889706001074

Miller, N., Maruyama, G., Beaber, R. J., & Valone, K. 1976. Speed of speech and persuasion. *Journal of Personality and Social Psychology* 34, 615–25.

Mocanu, D., A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani 
(2013). The Twitter of Babel: Mapping World Languages Through Microblogging 
Platforms. *PLoS ONE*, 8 (4). https://doi.org/10.1371/journal.pone.0061981

]]

---

### References II

Morin, C., and  S. Coats. (2023). [Double modals in Australian and New Zealand English](https://doi.org/10.1111/weng.12639). *World Englishes*.

Murzintcev, N., & C. Cheng (2017). Disaster Hashtags in Social Media. *International Journal of Geo-Information*, 6 (7). https://doi.org/10.3390/ijgi6070204

Moretti, F. (2013). *Distant reading*. Verso.

Moretti, F. (2007). *Graphs, maps, trees: Abstract models for a literary history*. Verso.

Nelimarkka, M. (2024). *Computational thinking and social science: Combining programming, methodologies and fundamental concepts*. Sage.

Oller, D. K. 1973. The effect of position in utterance on speech segment duration in English. *Journal of the Acoustical Society of America* 54, 1235–1247.

Purschke, C. & Hovy, D. (2019). Lörres, Möppes, and the Swiss: (Re)Discovering regional patterns in anonymous social media data. *Journal of Linguistic Geography*, 7(2), 113–134. https://doi.org/10.1017/jlg.2019.10

Mikolov, T, W.-T. Yih, & G. Zweig. (2013). Linguistic regularities in continuous space word representations. In *Proceedings of HLT-NAACL 13* (pp. 746–751).

Nguyen, D., Doğruöz, S., Rosé, C. P., & de Jong, F. (2016). Computational sociolinguistics: A survey. *Computational Linguistics*, 42(3), 537–593. https://doi.org/10.1162/COLI_a_00258

Rissanen, M. (2009). Corpus linguistics and historical linguistics. In A. Lüdeling and M. 
Kytö (Eds.), *Corpus Linguistics: An International Handbook*. Vol. 1 (pp. 53–68.). Berlin: Mouton de Gruyter.

Roach, P. 1998. Myth 18: Some languages are spoken more quickly than others. In L. Bauer & P. Trudgill (eds.), *Language myths*. London/New York: Penguin, 150–158.

Smith, B. L., Brown, B., Strong, W. J., & Rencher, A. C. 1975. Effects of speech rate on personality perception. *Language and Speech* 18(2), 145–52.

Street, R. L., Jr., & Brady, R. M. 1982. Speech rate acceptance ranges as a function of evaluative domain, listener speech rate, and communication context. *Communication Monographs* 49(4), 290–308.

Street, R. L., Jr., Brady, R. M., & Putman, W. B. 1983. The influence of speech rate stereotypes and rate similarity on listeners' evaluations of speakers. *Journal of Language and Social Psychology* 2(1), 37–56.

Thakerar, J. N., & Giles, H. 1981. They are – so they speak: Noncontent speech stereotypes. *Language and Communication* 1, 251–256.

Tsao, Y.-C., & Weismer, G. 1997. Interspeaker variation in habitual speaking rate: Evidence for a neuromuscular component. *Journal of Speech, Language, and Hearing Research* 40, 858–866.

Tsao, Y.-C., Weismer, G., & Iqbal, K. 2006. Interspeaker variation in habitual speaking rate: Additional evidence. *Journal of Speech, Language, and Hearing Research* 49, 1156–1164.

Tumasjan, A., T. Sprenger, P. Sandner, & I. Welpe (2010). Predicting Elections with Twitter: What 140 characters reveal about political sentiment. In *Proceedings of the International AAAI Conference on Web and Social Media* (pp. 178–185). Association for the Advancement of Artificial Intelligence.

van der Maaten, L., & G. Hinton (2008). Visualizing High-Dimensional Data Using t-SNE. *Journal of Machine Learning Research*, 9, 2579–2605.

Wikström, Peter. (2014). #srynotfunny: Communicative functions of hashtags on Twitter. *SKY Journal of Linguistics*, 27, 127–152.

Yuan, J., Cieri, C., & Liberman, M. 2006. Towards an integrated understanding of speaking rate in conversation. *Proceedings of Interspeech 2006, Pittsburgh, PA*, 541–544.

]]