CMC11

class: inverse, center, middle
background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

.pull-left[![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)]

.pull-right[
<span style="font-family:Rubik;font-size:24pt;font-weight: 700;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Video on Demand Toolkit: A Framework for Analysis of Speech and Chat Content in YouTube and Twitch 
Streams</span>
]
<p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;">

Steven Coats<br>
English, University of Oulu, Finland<br>
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br>
CMC-Corpora 11, University of Provence-Côte d'Azur<br> 
September 5th, 2024<br>
</p>

---

<div class="my-footer"><span>Steven Coats&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Framework for Stream Analysis | CMC-Corpora 11</span></div>

---

---

### Outline

1. Background
  - Video streams as an increasingly popular CMC modality
  - Corpus-based study of video streams
2. VoD Toolkit: Pipeline components
3. Use cases
  - Chat density 
  - Automated analysis of video streams 
4. Outlook and summary

---

### Background

- In the past 15 years, video-based CMC modalities have become popular
- Many streaming sites have large numbers of users
  - Twitch (mostly gaming), YouTube Live (mostly music, live vlogging, tutorials, etc.), Instagram Live, Facebook Live, X Livestream, and others
  - Increasing importance as an economic activity <span class="small">(Zhou et al. 2019; Johnson & Woodcock 2019; Yu et al. 2018)</span>

- Recorded streams contain multiple levels of communication at multiple levels <span class="small">(Sjöblom et al. 2019; Recktenwald 2017)</span>, for example
  - Speech and visual content (e.g. facial expressions or gestures) of the streamer
  - Text and graphical image content (emoji, emotes) of live chat
  - Text and graphical content of system messages (e.g. bots showing tips to streamer)
  - Secondary visual content (and text and speech) of video output (e.g., window showing gameplay)

- These offer new perspectives for the study of online interactional coherence <span class="small">(Herring 1999)</span>

- Most corpus-based analyses have focused on live chat content <span class="small">(Olejniczak 2015; Kim et al. 2022)</span> 
- Few studies consider the content of the streamer as well as chat and comments

---

### Scripting pipelines for multimedia analysis

- Scripts in Python or R in a cloud-based notebook environment (CSC's Tykky or Google's Colab)
- Dependency conflict issues are minimal
- Can use immediately without extensive setup of servers, databases
- Script components are customizable
- Can be adapted to handle various types of content
- Can be adapted to handle large amounts of data

---

### VoD Toolkit (https://t.ly/le6_e)

This contribution: A script pipeline to generate a structured, time-aligned transcript that combines the stream speech transcript with chat contributions and other types of content

- Python Jupyter environment
- Google Colab
- Generates textual output that can be analyzed with corpus methods
- Includes emoji and custom emotes
- Can also be used to capture video/audio for combined multimedia analysis

---

### Pipeline components

![:scale 65%](data:image/png;base64,#./VoD_screenshot.png)
- [yt-dlp](https://github.com/yt-dlp/yt-dlp)
- [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader) 
- [Whisper](https://github.com/openai/whisper)/[WhisperX](https://github.com/m-bain/whisperX)

The toolkit's output is an HTML file

---

### Component: [yt-dlp](https://github.com/yt-dlp/yt-dlp)

![](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/yt-dlp_screenshot.png)

]

- Fork of YouTube-DL

- Can be used to access any content streamed with DASH or HLS protocols

- Can be used to get video
]

---

### Component: [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader)

![](data:image/png;base64,#./TwitchDownloaderCLI_screenshot.png)

]

- Command-line interface for retrieving Twitch videos and chats

]

---

### Component: [WhisperX](https://github.com/m-bain/whisperX)

![](data:image/png;base64,#./WhisperX_screenshot.png)

]

Library based on OpenAI's Whisper providing Automatic speech recognition

- Word-level timestamps
- Speaker diarization
- Faster than Whisper, especially with GPU

]

---

### Schematic representation of pipeline functions

![:scale 35%](data:image/png;base64,#./diagram3.png)

---

### Example: YouTube stream

---

### Example: YouTube output

---

### Example: Twitch stream

---

### Example: Twitch output

---

### Use cases: Chat density

![:scale 75%](data:image/png;base64,#./chat_density.png)
- Chat density can be compared with and correlated with streamer utterances

---

### Potential use case: Automated analysis of video streams

- Retrieve video with yt-dlp
- Add cells to VoD Toolkit to import (e.g.) [X-CLIP](https://huggingface.co/microsoft/xclip-base-patch32) <span class="small">(Ni et al. 2022)</span>, [LLaVA-NeXT-Video-7B-h9](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-h) <span class="small">(Zhang et al. 2022)</span>, or other libraries

- Automatically generate text describing what is going on in different parts of the video
  - Who is chatting about what parts of the video?
  - Is chat about (for example) video content, other chat, or speech content?

---

### Use cases: Acoustic analysis

![:scale 50%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation//Github_phonetics_pipeline_screenshot.png)

- Acoustic features of particular streamers or streams with different topics/from different locations etc. can be analyzed <span class="small">(cf. Coats 2023; Méli et al. 2023)</span>

https://colab.research.google.com/github/stcoats/phonetics_pipeline/blob/main/phonetics_pipeline_v3.ipynb

---

### Component: Praat-Parselmouth <span class="small">(Jadoul et al. 2018)</span>

- Python interface to Praat, widely used software for acoustic analysis <span class="small">(Boersma & Weenink 2023)</span>

- Intergration into Python simplifies workflows and analysis

![:scale 75%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/praat_screenshot.png)

---

### Component: Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span>

.small[
- Forced alignment is aligning a transcript to an audio track so that the exact start and end times of segments (words, phones) can be determined

- Necessary for automated analysis of vowel quality or other phonetic analysis

- MFA may perform better than some other aligners (P2FA, MAUS)

- MFA is fragile

]
]

![](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/mfa_screenshot.png)

]

---

### Google Colab

- Google Colaboratory is an online server for running
code in Python or R in a notebook environment
- You need a Google account to use Colab
- Advantages include access to GPU/TPU, collaborative
editing, cloud-based execution, and integration with code on GitHub/Gitlab

![:scale 70%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/Colab.png)
---

### Summary and outlook

- Ready-to-use scripting pipeline for generation of combined speech transcript/live chat files
- Can be used for various text-based corpus-analytic research questions
- Can be used, with additional modules, for multimodal analysis of video and audio content

---

### Thank you for your attention!

---

### References

Coats, S. (2023). [A pipeline for the large-scale acoustic analysis of streamed content](https://doi.org/10.14618/1z5k-pb25). In Louis Cotgrove, Laura Herzberg, Harald Lüngen, and Ines Pisetta (eds.), *Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023)*, 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache.

Herring, S. (1999). [Interactional coherence in CMC](https://doi.org/10.1111/j.1083-6101.1999.tb00106.x ). *Journal of Computer-Mediated-Communication*, 4(4).

Johnson, M. R., & Woodcock, J. (2019). [The impacts of live streaming and Twitch.tv on the video game industry](https://doi.org/10.1177/0163443718818363  ). *Media, Culture & Society*, 41(5), 670–688.

Kim, J., Wohn, D. Y., & Cha, M. (2022). [Understanding and identifying the use of emotes in toxic chat on Twitch](https://doi.org/10.1016/j.osnem.2021.100180). *Online Social Networks and Media* 27.

Méli, Adrien, Steven Coats and Nicolas Ballier. (2023). [Methods for phonetic scraping of Youtube videos](https://aclanthology.org/2023.icnlsp-1.25/). In *Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)*, 244–249.

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). [Expanding language-image pretrained models for general video recognition](https://doi.org/10.48550/arXiv.2208.02816 ). *arXiv*, cs.CV, 2208.02816.

Olejniczak, J. (2015). A linguistic study of language variety used on twitch.tv: Descriptive and corpus-based approaches. In *Proceedings of RCIC’15: Redefining Community in Intercultural Context, Brasov, 21–23 May 2015* (pp. 329–334).

Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on Twitch. *Journal of Pragmatics* 115, 68–81.

Sjöblom, M., Törhönen, M., Hamari, J., & Macey, J. (2019). The ingredients of Twitch streaming: Affordances of game streams. *Computers in Human Behavior*, 92, 20–28.

Yu, E., Jung, C., Kim, H., & Jung, J. (2018). Impact of viewer engagement on gift-giving in live video streaming. *Telematics and Informatics*, 35(5), 1450–1460.

Zhang, Y. Li, B., Liu, H., Lee, Y. G., Gui, L., Fu, D., Feng, J., Liu, Z., & Li, C. (2024). [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/).

Zhou, J., Zhou, J., Ding, Y., & Wang, H. (2019). The magic of danmaku: A social interaction perspective of gift 
sending on live streaming platforms. *Electronic Commerce Research and Applications* 34, 100815.

]]