Combined Audio and Chat Transcripts for Recorded Video Streams

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-left[![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)]

.pull-right[
<span style="font-family:Rubik;font-size:24pt;font-weight: 700;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Combined Audio and Chat Transcripts for Recorded Video Streams</span>
]
<p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;">

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
Love Data Week, Université Toulouse – Jean Jaurès<br> 
February 12th, 2026<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Framework for Stream Analysis | Formation données langagières</span></div>

---

---

### Outline

1. Background

- Video streaming as an increasingly popular CMC modality
  - Study of multimodal content: ASR transcript + chat stream (+ video)

2. VoD Toolkit: Pipeline components

3. Use cases

- Chat density 
  - Sentiment

4. Workshop in Colab/Jupyter

---

### Background

- Increasing popularity of streaming
  - Twitch (mostly gaming), YouTube Live, Instagram Live, Facebook Live, X Livestream, Kick, and others
  - Increasing importance as an economic activity <span class="small">(Zhou et al. 2019; Johnson & Woodcock 2019; Yu et al. 2018)</span>

- Recorded streams contain multiple levels of communication <span class="small">(Sjöblom et al. 2019; Recktenwald 2017)</span>
  - Speech of the streamer (and potentially of others)
  - Text and graphical image content (emoji, emotes) of chat participants
  - Text and graphical content of system messages (e.g. bots showing tips to streamer)
  - Secondary visual content (and text and speech) of video output (embedded windows showing gameplay)

- Most corpus-based analyses have focused on live chat content <span class="small">(Olejniczak 2015; Kim et al. 2022)</span> 
- Few studies consider multiple levels

---

### VoD Toolkit (https://shorturl.at/TF3Kn)

A script pipeline to generate a structured, time-aligned transcript that combines the stream speech transcript with chat contributions

- Jupyter Notebook/ Google Colab
- Generates output that can be analyzed with corpus methods
- Can also be used to capture video for multimodal analysis

---

### Pipeline components

![:scale 55%](data:image/png;base64,#./VoD_v2_screenshot.png)
- [yt-dlp](https://github.com/yt-dlp/yt-dlp)
- [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader) 
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)

The toolkit's output is an HTML file

---

### Component: [yt-dlp](https://github.com/yt-dlp/yt-dlp)

![](data:image/png;base64,#./yt-dlp_screenshot.png)

]

- Fork of YouTube-DL

- Can be used to access any content streamed with DASH or HLS protocols

- Can be used to get video
]

---

### Component: [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader)

![](data:image/png;base64,#./TwitchDownloaderCLI_screenshot.png)

]

- Command-line interface for retrieving Twitch videos and chats

]

---

### Component: [faster-whisper](https://github.com/SYSTRAN/faster-whisper)

![](data:image/png;base64,#./faster-whisper.png)

]

Library based on OpenAI's Whisper providing Automatic speech recognition

- Word-level timestamps
- Faster than Whisper, especially with GPU

]

---

### Workflow

![](data:image/png;base64,#./diagram4.png)

---

### Example: YouTube stream

---

### Example: YouTube output

---

### Example: Twitch stream

---

### Example: Twitch output

---

### Use cases: Chat density

![:scale 75%](data:image/png;base64,#./chat_density.png)
- Chat density can be compared with and correlated with streamer utterances

---

### Use cases: Sentiment

![:scale 75%](data:image/png;base64,#./sentiment_score.png)
- How does sentiment evolve over the course of a stream?

---

### Potential use case: Automated analysis of video streams

- Retrieve video with yt-dlp
- Add cells to VoD Toolkit to import (e.g.) [X-CLIP](https://huggingface.co/microsoft/xclip-base-patch32) <span class="small">(Ni et al. 2022)</span>, [LLaVA-NeXT-Video-7B-h9](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-h) <span class="small">(Zhang et al. 2022)</span>, or other libraries

- Automatically generate text describing what is going on in different parts of the video
  - Who is chatting about what parts of the video?
  - Is chat about (for example) video content, other chat, or speech content?

---

### Potential use cases: Acoustic analysis

![:scale 50%](data:image/png;base64,#./Github_phonetics_pipeline_screenshot.png)

- Acoustic features of particular streamers or streams with different topics/from different locations etc. can be analyzed <span class="small">(cf. Coats 2025, 2023; Méli et al. 2023)</span>

https://colab.research.google.com/github/stcoats/phonetics_pipeline/blob/main/phonetics_pipeline_v3.ipynb

---

### Google Colab

- Google Colaboratory is an online server for running
code in Python or R in a notebook environment
- You need a Google account to use Colab
- Advantages include access to GPU/TPU, collaborative
editing, cloud-based execution, and integration with code on GitHub/Gitlab

![:scale 70%](../../ALOES_preconference_workshop/Workshop_presentation/Colab.png)
---
### CMC-Corpora Conference

https://cmc2026.org

Submissions portal open! Deadline 15 April

### Workshop

1. Package installs
2. Collect some YouTube content
3. Collect some Twitch content
4. Preliminary analyses of YouTube content

#### Allons-y!

VoD Toolkit (https://shorturl.at/TF3Kn)

---

### References

Coats, S. (2025). [An automatic pipeline for processing streamed content: New horizons for corpus linguistics and phonetics](https://doi.org/10.1515/9783111434018-011). In L. Cotgrove, L. Herzberg, & H. Lüngen (eds.), *Exploring digitally-mediated communication with corpora: Methods, analyses, and corpus construction*, 257–274. Berlin: De Gruyter Brill.

Coats, S. (2023). [A pipeline for the large-scale acoustic analysis of streamed content](https://doi.org/10.14618/1z5k-pb25). In L. Cotgrove, L. Herzberg, H. Lüngen, & I. Pisetta (eds.), *Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023)*, 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache.

Herring, S. (1999). [Interactional coherence in CMC](https://doi.org/10.1111/j.1083-6101.1999.tb00106.x ). *Journal of Computer-Mediated-Communication*, 4(4).

Johnson, M. R., & Woodcock, J. (2019). [The impacts of live streaming and Twitch.tv on the video game industry](https://doi.org/10.1177/0163443718818363  ). *Media, Culture & Society*, 41(5), 670–688.

Kim, J., Wohn, D. Y., & Cha, M. (2022). [Understanding and identifying the use of emotes in toxic chat on Twitch](https://doi.org/10.1016/j.osnem.2021.100180). *Online Social Networks and Media* 27.

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). [Expanding language-image pretrained models for general video recognition](https://doi.org/10.48550/arXiv.2208.02816 ). *arXiv*, cs.CV, 2208.02816.

Olejniczak, J. (2015). A linguistic study of language variety used on twitch.tv: Descriptive and corpus-based approaches. In *Proceedings of RCIC’15: Redefining Community in Intercultural Context, Brasov, 21–23 May 2015* (pp. 329–334).

Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on Twitch. *Journal of Pragmatics* 115, 68–81.

Robert, A. J. (2025). Modelling the interaction space of Twitch: A multimodal framework for corpus structuring and analysis. In A. Fabián & I. Trost (eds.), [*Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities*](https://doi.org/10.15495/EPub_UBT_00008705), 94–98. Bayreuth: University of Bayreuth.

Sjöblom, M., Törhönen, M., Hamari, J., & Macey, J. (2019). The ingredients of Twitch streaming: Affordances of game streams. *Computers in Human Behavior*, 92, 20–28.

Yu, E., Jung, C., Kim, H., & Jung, J. (2018). Impact of viewer engagement on gift-giving in live video streaming. *Telematics and Informatics*, 35(5), 1450–1460.

Zhang, Y. Li, B., Liu, H., Lee, Y. G., Gui, L., Fu, D., Feng, J., Liu, Z., & Li, C. (2024). [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/).

Zhou, J., Zhou, J., Ding, Y., & Wang, H. (2019). The magic of danmaku: A social interaction perspective of gift 
sending on live streaming platforms. *Electronic Commerce Research and Applications* 34, 100815.

]]