Compiling Corpora from Social Media: Combined Audio and Chat Transcripts for Recorded Video Streams

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-left[![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 40%](data:image/png;base64,#https://cdn.betterttv.net/emote/5f9db60640eb9502e22372d2/3x.webp)![:scale 25%](data:image/png;base64,#https://cdn.betterttv.net/emote/61fe1a5206fd6a9f5be370df/3x.webp)]

.pull-right[
<span style="font-family:Rubik;font-size:24pt;font-weight: 700;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Compiling Corpora from Social Media: Combined Audio and Chat Transcripts for Recorded Video Streams</span>
]
<p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;">

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
University of Bonn<br> 
May 22nd, 2026<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Framework for Stream Analysis | Uni Bonn Workshop</span></div>

---

---

### Outline

1. Background

- Computer-Mediated Communication (CMC), video streaming, and recorded streams ("Video on Demand", VoD)
  - Corpus approaches to multimodal content: Automatic Speech Recognition (ASR) transcript + chat content (+ video content)

2. VoD Toolkit: Pipeline components

3. Use cases

- Chat density
  - Lexical alignment
  - Sentiment analysis
  - Video summarization, phonetic analysis

4. Workshop in Colab/Jupyter

---

### Background

- Increasing popularity of streaming
  - Twitch (mostly gaming), YouTube Live, Instagram Live, Facebook Live, X Livestream, Kick, and others
  - Increasing importance as an economic activity <span class="small">(Zhou et al. 2019; Johnson & Woodcock 2019; Yu et al. 2018)</span>

- Recorded streams contain multiple levels of communication <span class="small">(Sjöblom et al. 2019; Recktenwald 2017)</span>
  - Speech of the streamer (and potentially of others)
  - Text and graphical image content (emoji, emotes) of chat participants ("crowdspeak")
  - Text and graphical content of system messages (e.g. bots showing tips to streamer)
  - Secondary visual content (and text and speech) of video output (embedded windows showing gameplay)

- Most corpus-based analyses have focused on live chat content <span class="small">(Olejniczak 2015; Kim et al. 2022)</span> 
- Few studies consider multiple levels

---

### VoD Toolkit (https://shorturl.at/TF3Kn)

A script pipeline to generate a structured, time-aligned transcript that combines the stream speech transcript with chat contributions

- Jupyter Notebook/ Google Colab using Python
- Generates output that can be analyzed with corpus methods
- Can also be used to capture video for multimodal analysis

---

### Pipeline components

![:scale 55%](data:image/png;base64,#./VoD_v2_screenshot.png)
- [yt-dlp](https://github.com/yt-dlp/yt-dlp)
- [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader) 
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)

Outputs are files in HTML or other formats

---

### Component: [yt-dlp](https://github.com/yt-dlp/yt-dlp)

![](data:image/png;base64,#./yt-dlp_screenshot.png)

]

- Fork of YouTube-DL

- Can be used to access any content streamed with DASH or HLS protocols

- Can be used to get video
]

---

### Component: [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader)

![](data:image/png;base64,#./TwitchDownloaderCLI_screenshot.png)

]

- Command-line interface for retrieving Twitch videos and chats

]

---

### Component: [faster-whisper](https://github.com/SYSTRAN/faster-whisper)

![](data:image/png;base64,#./faster-whisper.png)

]

Library based on OpenAI's Whisper providing Automatic speech recognition

- Word-level timestamps
- Faster than Whisper, especially with GPU

]

---

### Data collection workflow

![](data:image/png;base64,#./diagram4.png)

---

### Example: YouTube stream

---

### Example: YouTube output

---

### Example: Twitch stream

---

### Example: Twitch output

---

### Use cases: Chat density

![:scale 75%](data:image/png;base64,#./chat_density.png)
- Chat density can be compared with and correlated with streamer utterances

---

### Use cases: Lexical alignment

#### What is it?

Lexical alignment (also called *entrainment* or *accommodation*) happens when people in a conversation begin using the same words, phrases, or expressions.

Examples:

- Repeating a nickname, emoji, emote
- Adopting someone else's phrasing
- Converging on shared vocabulary
- Reusing terms introduced earlier in the interaction

]

> Streamer:
>
> “This boss fight is cursed.”

<br>

> Chat:
>
> “CURSED”
>
> “this run is cursed”
>
> “cursed stream”

<br>

Later:

> Streamer:
>
> “okay this stream is officially cursed”

]

Lexical alignment is a common phenomenon in human conversation and is linked to coordination and successful communication <span class="small">(Srivastava et al. 2025)</span>

---

### Why is lexical alignment interesting?

Researchers argue that alignment helps people:

- Communicate more efficiently
- Build shared understanding
- Coordinate socially
- Show engagement or affiliation
- Create group identity

Theories like the *Interactive Alignment Model* suggest that speakers gradually build shared “mental models” through repeated linguistic choices <span class="small">(Pickering & Jarrod 2004)</span>

---

### In livestreams...

Alignment can reveal:
- Audience engagement
- Meme formation
- Streamer ↔ audience influence
- Moments of hype or emotional intensity
- How online communities construct shared language

]

We will compare:

- the ASR transcript of the streamer
- the live chat messages

to see how much their language overlaps and changes over time.

]

#### Possible questions

- Does chat adopt the streamer’s vocabulary?
- Does the streamer echo chat phrases?
- Which words become shared “community language”?
- Do alignment levels increase during exciting moments?
- Which streams show the strongest audience engagement?

---

### Metric 1: Jaccard Similarity (The "Dictionary" Match)

What it measures: The overlap of unique words (types).

`$$\text{Jaccard Similarity}(A, B)=\frac{|A \cap B|}{|A \cup B|}$$`

- "Did the streamer and the chat use the same words?"
- Ignores how often a word is said

Example: If the streamer says the word "based" exactly 1 time, and the chat spams the word "based" 5,000 times, Jaccard counts this as a perfect match. They both have "based" in their dictionary.

Why we use it: It tells us if the streamer and the chat belong to the same "Community of Practice." Do they share the same slang, emotes, and inside jokes?

---

### Metric 2: Cosine Similarity (The "Intensity" Match)

The Formula:

`$$\text{Cosine Similarity}(A, B) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2}\sqrt{\sum_{i=1}^{n} B_i^2}}$$`

* **Vector A & B:** The word counts for the Streamer and the Chat.
* **The Top (Dot Product):** We multiply the streamer's count for a word by the chat's count for that exact same word, and add them all up. 
    * *(This rewards words they both used heavily).*
* **The Bottom (Magnitude):** We calculate the total size of the streamer's vocabulary and multiply it by the total size of the chat's vocabulary. 
    * *(This mathematically penalizes the score if one person is talking way more than the other, keeping the final score cleanly balanced between 0 and 1).*

---
### Use cases: Sentiment

![:scale 75%](data:image/png;base64,#./sentiment_score.png)
- How does sentiment evolve over the course of a stream?

---

### Potential use case: Automated analysis of video streams

- Retrieve video with yt-dlp
- Add cells to VoD Toolkit to import multimodal modals like [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) <span class="small">(Wang et al. 2022)</span> or  [LLaVA-NeXT-Video-7B-h9](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-h) <span class="small">(Zhang et al. 2022)</span>, or other libraries

- Automatically generate text describing what is going on in different parts of the video
  - Who is chatting about what in different parts of the video?
  - Is chat about (for example) video content, other chat, or speech content?

---

### Potential use cases: Acoustic analysis

![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png)

- Acoustic features of particular streamers or streams with different topics/from different locations etc. can be analyzed <span class="small">(cf. Coats 2025, 2023; Méli et al. 2023)</span>

https://colab.research.google.com/github/stcoats/phonetics_pipeline/blob/main/phonetics_pipeline_v3.ipynb

---

### Google Colab

- Google Colaboratory is an online server for running
code in Python or R in a notebook environment
- You need a Google account to use Colab
- Advantages include access to GPU/TPU, collaborative
editing, cloud-based execution, and integration with code on GitHub/Gitlab

![:scale 70%](../../ALOES_preconference_workshop/Workshop_presentation/Colab.png)
---

### Workshop

1. Package installs
2. Collect some YouTube content
3. Collect some Twitch content
4. Preliminary analyses of YouTube content
5. Small group work

#### Hauen wir rein!

VoD Toolkit (https://shorturl.at/TF3Kn)
Group Work editable Google Slides: (https://docs.google.com/presentation/d/16OUWTnGAtE2XG_BGQOXXFcngCZtPly7L0SlT0QA_T_I)

---

### References

Coats, S. (2025). [An automatic pipeline for processing streamed content: New horizons for corpus linguistics and phonetics](https://doi.org/10.1515/9783111434018-011). In L. Cotgrove, L. Herzberg, & H. Lüngen (eds.), *Exploring digitally-mediated communication with corpora: Methods, analyses, and corpus construction*, 257–274. Berlin: De Gruyter Brill.

Coats, S. (2023). [A pipeline for the large-scale acoustic analysis of streamed content](https://doi.org/10.14618/1z5k-pb25). In L. Cotgrove, L. Herzberg, H. Lüngen, & I. Pisetta (eds.), *Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023)*, 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache.

Herring, S. (1999). [Interactional coherence in CMC](https://doi.org/10.1111/j.1083-6101.1999.tb00106.x ). *Journal of Computer-Mediated-Communication*, 4(4).

Johnson, M. R., & Woodcock, J. (2019). [The impacts of live streaming and Twitch.tv on the video game industry](https://doi.org/10.1177/0163443718818363  ). *Media, Culture & Society*, 41(5), 670–688.

Kim, J., Wohn, D. Y., & Cha, M. (2022). [Understanding and identifying the use of emotes in toxic chat on Twitch](https://doi.org/10.1016/j.osnem.2021.100180). *Online Social Networks and Media* 27.

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). [Expanding language-image pretrained models for general video recognition](https://doi.org/10.48550/arXiv.2208.02816 ). *arXiv*, cs.CV, 2208.02816.

Olejniczak, J. (2015). A linguistic study of language variety used on twitch.tv: Descriptive and corpus-based approaches. In *Proceedings of RCIC’15: Redefining Community in Intercultural Context, Brasov, 21–23 May 2015* (pp. 329–334).

Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. *Behavioral and Brain Sciences, 27*(2), 169–190. https://doi.org/10.1017/S0140525X04000056

Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on Twitch. *Journal of Pragmatics* 115, 68–81.

Robert, A. J. (2025). Modelling the interaction space of Twitch: A multimodal framework for corpus structuring and analysis. In A. Fabián & I. Trost (eds.), [*Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities*](https://doi.org/10.15495/EPub_UBT_00008705), 94–98. Bayreuth: University of Bayreuth.

Sjöblom, M., Törhönen, M., Hamari, J., & Macey, J. (2019). The ingredients of Twitch streaming: Affordances of game streams. *Computers in Human Behavior*, 92, 20–28.

Srivastava, S., Wentzel, S. D., Catala, A., & Theune, M. (2025). Measuring and implementing lexical alignment: A systematic literature review. *Computer Speech & Language, 90*, 101731. https://doi.org/10.1016/j.csl.2024.101731

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., & Lin, J. (2024). Qwen2-VL: Enhancing vision-language models’ perception of the world at any resolution. *arXiv*. https://arxiv.org/abs/2409.12191

Yu, E., Jung, C., Kim, H., & Jung, J. (2018). Impact of viewer engagement on gift-giving in live video streaming. *Telematics and Informatics*, 35(5), 1450–1460.

Zhang, Y. Li, B., Liu, H., Lee, Y. G., Gui, L., Fu, D., Feng, J., Liu, Z., & Li, C. (2024). [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/).

Zhou, J., Zhou, J., Ding, Y., & Wang, H. (2019). The magic of danmaku: A social interaction perspective of gift 
sending on live streaming platforms. *Electronic Commerce Research and Applications* 34, 100815.

]]