class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br><br><br><br><br> .pull-left[![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)![:scale 25%](data:image/png;base64,#./pewfull.png)] .pull-right[ <span style="font-family:Rubik;font-size:24pt;font-weight: 700;font-style: normal;float:right;text-align: right;color:white;-webkit-text-fill-color: black;-webkit-text-stroke: 0.8px;">Video on Demand Toolkit: A Framework for Analysis of Speech and Chat Content in YouTube and Twitch Streams</span> ] <p style="float:right;text-align: right;color:white;font-weight: 700;font-style: normal;-webkit-text-fill-color: black;-webkit-text-stroke: 0.5px;"> Steven Coats<br> English, University of Oulu, Finland<br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> CMC-Corpora 11, University of Provence-Côte d'Azur<br> September 5th, 2024<br> </p> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                            Framework for Stream Analysis | CMC-Corpora 11</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                            Framework for Stream Analysis | CMC-Corpora 11</span></div> --- ### Outline 1. Background - Video streams as an increasingly popular CMC modality - Corpus-based study of video streams 2. VoD Toolkit: Pipeline components 3. Use cases - Chat density - Automated analysis of video streams 4. Outlook and summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                            Framework for Stream Analysis | CMC-Corpora 11</span></div> --- ### Background - In the past 15 years, video-based CMC modalities have become popular - Many streaming sites have large numbers of users - Twitch (mostly gaming), YouTube Live (mostly music, live vlogging, tutorials, etc.), Instagram Live, Facebook Live, X Livestream, and others - Increasing importance as an economic activity <span class="small">(Zhou et al. 2019; Johnson & Woodcock 2019; Yu et al. 2018)</span> - Recorded streams contain multiple levels of communication at multiple levels <span class="small">(Sjöblom et al. 2019; Recktenwald 2017)</span>, for example - Speech and visual content (e.g. facial expressions or gestures) of the streamer - Text and graphical image content (emoji, emotes) of live chat - Text and graphical content of system messages (e.g. bots showing tips to streamer) - Secondary visual content (and text and speech) of video output (e.g., window showing gameplay) - These offer new perspectives for the study of online interactional coherence <span class="small">(Herring 1999)</span> - Most corpus-based analyses have focused on live chat content <span class="small">(Olejniczak 2015; Kim et al. 2022)</span> - Few studies consider the content of the streamer as well as chat and comments --- exclude: true ### Scripting pipelines for multimedia analysis - Scripts in Python or R in a cloud-based notebook environment (CSC's Tykky or Google's Colab) - Dependency conflict issues are minimal - Can use immediately without extensive setup of servers, databases - Script components are customizable - Can be adapted to handle various types of content - Can be adapted to handle large amounts of data --- ### VoD Toolkit (https://t.ly/le6_e) This contribution: A script pipeline to generate a structured, time-aligned transcript that combines the stream speech transcript with chat contributions and other types of content - Python Jupyter environment - Google Colab - Generates textual output that can be analyzed with corpus methods - Includes emoji and custom emotes - Can also be used to capture video/audio for combined multimedia analysis --- ### Pipeline components ![:scale 65%](data:image/png;base64,#./VoD_screenshot.png) - [yt-dlp](https://github.com/yt-dlp/yt-dlp) - [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader) - [Whisper](https://github.com/openai/whisper)/[WhisperX](https://github.com/m-bain/whisperX) The toolkit's output is an HTML file --- ### Component: [yt-dlp](https://github.com/yt-dlp/yt-dlp) .pull-left[ ![](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/yt-dlp_screenshot.png) ] .pull-right[ - Fork of YouTube-DL - Can be used to access any content streamed with DASH or HLS protocols - Can be used to get video ] --- ### Component: [TwitchDownloaderCLI](https://github.com/lay295/TwitchDownloader) .pull-left[ ![](data:image/png;base64,#./TwitchDownloaderCLI_screenshot.png) ] .pull-right[ - Command-line interface for retrieving Twitch videos and chats ] --- ### Component: [WhisperX](https://github.com/m-bain/whisperX) .pull-left[ ![](data:image/png;base64,#./WhisperX_screenshot.png) ] .pull-right[ Library based on OpenAI's Whisper providing Automatic speech recognition - Word-level timestamps - Speaker diarization - Faster than Whisper, especially with GPU ] --- ### Schematic representation of pipeline functions ![:scale 35%](data:image/png;base64,#./diagram3.png) --- ### Example: YouTube stream <iframe width="800" height="450" controls src="https://cc.oulu.fi/~scoats/PewDiePie_videoClip.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- ### Example: YouTube output <div class="container"> <iframe src="./PewDiePie.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- ### Example: Twitch stream <iframe width="800" height="450" controls src="https://cc.oulu.fi/~scoats/AutomaticJak_videoClip.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" scrolling="no" sandbox allowfullscreen></iframe> --- ### Example: Twitch output <div class="container"> <iframe src="./AutomaticJak.html" style="width: 100%; height: 450px;" style="max-width = 100%" sandbox="allow-same-origin allow-scripts" scrolling="yes" seamless="seamless" frameborder="0" align="middle"></iframe> </div> --- ### Use cases: Chat density ![:scale 75%](data:image/png;base64,#./chat_density.png) - Chat density can be compared with and correlated with streamer utterances --- ### Potential use case: Automated analysis of video streams - Retrieve video with yt-dlp - Add cells to VoD Toolkit to import (e.g.) [X-CLIP](https://huggingface.co/microsoft/xclip-base-patch32) <span class="small">(Ni et al. 2022)</span>, [LLaVA-NeXT-Video-7B-h9](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-h) <span class="small">(Zhang et al. 2022)</span>, or other libraries - Automatically generate text describing what is going on in different parts of the video - Who is chatting about what parts of the video? - Is chat about (for example) video content, other chat, or speech content? --- ### Use cases: Acoustic analysis ![:scale 50%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation//Github_phonetics_pipeline_screenshot.png) - Acoustic features of particular streamers or streams with different topics/from different locations etc. can be analyzed <span class="small">(cf. Coats 2023; Méli et al. 2023)</span> https://colab.research.google.com/github/stcoats/phonetics_pipeline/blob/main/phonetics_pipeline_v3.ipynb --- exclude: true ### Component: Praat-Parselmouth <span class="small">(Jadoul et al. 2018)</span> - Python interface to Praat, widely used software for acoustic analysis <span class="small">(Boersma & Weenink 2023)</span> - Intergration into Python simplifies workflows and analysis ![:scale 75%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/praat_screenshot.png) --- exclude: true ### Component: Montreal Forced Aligner <span class="small">(McAuliffe et al. 2017)</span> .pull-left30[ .small[ - Forced alignment is aligning a transcript to an audio track so that the exact start and end times of segments (words, phones) can be determined - Necessary for automated analysis of vowel quality or other phonetic analysis - MFA may perform better than some other aligners (P2FA, MAUS) - MFA is fragile ] ] .pull-right70[ ![](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/mfa_screenshot.png) ] --- exclude: True ### Google Colab - Google Colaboratory is an online server for running code in Python or R in a notebook environment - You need a Google account to use Colab - Advantages include access to GPU/TPU, collaborative editing, cloud-based execution, and integration with code on GitHub/Gitlab ![:scale 70%](data:image/png;base64,#../../ALOES_preconference_workshop/Workshop_presentation/Colab.png) --- ### Summary and outlook - Ready-to-use scripting pipeline for generation of combined speech transcript/live chat files - Can be used for various text-based corpus-analytic research questions - Can be used, with additional modules, for multimodal analysis of video and audio content --- ### Thank you for your attention! --- ### References .small[ .hangingindent[ Coats, S. (2023). [A pipeline for the large-scale acoustic analysis of streamed content](https://doi.org/10.14618/1z5k-pb25). In Louis Cotgrove, Laura Herzberg, Harald Lüngen, and Ines Pisetta (eds.), *Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023)*, 51–54. Mannheim: Leibniz-Institut für Deutsche Sprache. Herring, S. (1999). [Interactional coherence in CMC](https://doi.org/10.1111/j.1083-6101.1999.tb00106.x ). *Journal of Computer-Mediated-Communication*, 4(4). Johnson, M. R., & Woodcock, J. (2019). [The impacts of live streaming and Twitch.tv on the video game industry](https://doi.org/10.1177/0163443718818363 ). *Media, Culture & Society*, 41(5), 670–688. Kim, J., Wohn, D. Y., & Cha, M. (2022). [Understanding and identifying the use of emotes in toxic chat on Twitch](https://doi.org/10.1016/j.osnem.2021.100180). *Online Social Networks and Media* 27. Méli, Adrien, Steven Coats and Nicolas Ballier. (2023). [Methods for phonetic scraping of Youtube videos](https://aclanthology.org/2023.icnlsp-1.25/). In *Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)*, 244–249. Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). [Expanding language-image pretrained models for general video recognition](https://doi.org/10.48550/arXiv.2208.02816 ). *arXiv*, cs.CV, 2208.02816. Olejniczak, J. (2015). A linguistic study of language variety used on twitch.tv: Descriptive and corpus-based approaches. In *Proceedings of RCIC’15: Redefining Community in Intercultural Context, Brasov, 21–23 May 2015* (pp. 329–334). Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on Twitch. *Journal of Pragmatics* 115, 68–81. Sjöblom, M., Törhönen, M., Hamari, J., & Macey, J. (2019). The ingredients of Twitch streaming: Affordances of game streams. *Computers in Human Behavior*, 92, 20–28. Yu, E., Jung, C., Kim, H., & Jung, J. (2018). Impact of viewer engagement on gift-giving in live video streaming. *Telematics and Informatics*, 35(5), 1450–1460. Zhang, Y. Li, B., Liu, H., Lee, Y. G., Gui, L., Fu, D., Feng, J., Liu, Z., & Li, C. (2024). [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/). Zhou, J., Zhou, J., Ding, Y., & Wang, H. (2019). The magic of danmaku: A social interaction perspective of gift sending on live streaming platforms. *Electronic Commerce Research and Applications* 34, 100815. ]]