class: inverse, center, middle background-image: url(data:image/png;base64,#https://cc.oulu.fi/~scoats/oululogoRedTransparent.png); background-repeat: no-repeat; background-size: 80px 57px; background-position:right top; exclude: true --- class: title-slide <br> ## <span style="color:black;-webkit-text-fill-color: orange;-webkit-text-stroke: 1px;">Civic Engagement with Local Government Videos: Comparing YouTube Transcripts with User Comments</span> <br><br><br> <span style="color:yellow;">Steven Coats</span><br> <span style="color:yellow">English, University of Oulu, Finland</span><br> <a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a><br> <span style="color:yellow;">9th CMC-Corpora Conference, Santiago</span><br> <span style="color:yellow">September 29th, 2022</span><br> <br><br><span style="font-size:8pt;float: right"><a href="https://i.cbc.ca/1.6424620.1650474599!/fileImage/httpImage/image.jpg_gen/derivatives/16x9_780/town-of-raymond.jpg">image source</a></span> --- layout: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                  Comparing YouTube Transcripts and Comments | CMC-Corpora 9, Santiago</span></div> --- exclude: true <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                  Comparing YouTube Transcripts and Comments | CMC-Corpora 9, Santiago</span></div> ## Outline 1. Background, YouTube ASR captions files, data 2. Methods: Sentiment analysis and topic modeling with transformer models 3. Preliminary results: 4. Caveats, summary .footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats] <div class="my-header"><img border="0" alt="W3Schools" src="https://cc.oulu.fi/~scoats/oululogonewEng.png" width="80" height="80"></div> <div class="my-footer"><span>Steven Coats                  Comparing YouTube Transcripts and Comments | CMC-Corpora 9, Santiago</span></div> --- ### Background YouTube as a platform for mediated quasi-interaction with three communicative levels <span class="small">(Bou-Franch et al. 2012, Dynel 2014)</span> - First level: Face-to-face spoken interaction - Second level: Corresponds to classifications of mass media - *Third level*: Additional modalities/affordances of platform (e.g. commenting) Mediated interaction can be important for societal stakeholders such as companies, organizations, and governments - How do people interact with content uploaded by local governments: what do they like and dislike? --- ### Prior studies of YouTube comments .pull-left[ - Quality of comments <span class="small">(e.g. Goode et al. 2011)</span> - Typological classifications <span class="small">(e.g. Herring & Chae 2021, Häring et al. 2018)</span> - Sentiment of comments <span class="small">(e.g. Ksiazek 2018)</span> - Like ratio vs. text of comments <span class="small">(e.g. Schultes et al. 2012, Siersdorfer et al. 2014)</span> This study: **Transcripts of videos vs. comments** - First large-scale comparison of discourse content of the videos and comments? - Exploratory study to be developed ] .pull-right[ ![:scale 75%](data:image/png;base64,#great_content2.png) ] --- ### Data: transcripts Corpus of North American Spoken English ([CoNASE](https://cc.oulu.fi/~scoats/CoNASE.html)): 1.25b-word corpus of 301,846 word-timed, part-of-speech-tagged Automatic Speech Recognition (ASR) transcripts <span class="small">(Coats forthcoming a)</span> - Mostly transcripts of meetings and other local government content - Freely available for research use; download from the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) ![](data:image/png;base64,#conase_screenshot.png) --- ### Focus on regional and local government channels Many recordings of meetings of elected councillors: advantages in terms of representativeness and comparability - Speaker place of residence (cf. videos collected based on place-name search alone) - Topical contents and communicative contexts comparable - In most jurisdictions government content is in the public domain --- ### Data: comments - For all videos in CoNASE, retrieve all available comments with [youtube-comment-downloader](https://github.com/egbertbouman/youtube-comment-downloader) - = 190,079 total comments, for 20,965 videos (6.95% of CoNASE videos), 116,009 unique users, 5,334,096 word tokens - Most of these videos have few views/likes/comments - Local government does not engage people as much as music videos, video game streaming, makeup tutorials and other popular YouTube content --- ### Sentiment - Transformer models outperform "bag-of-words"-based sentiment analysis - YT comments are rich in emoji, so sentiment models need to include emoji ![](data:image/png;base64,#sentiment_screenshot.png) Model: Twitter-roBERTa-base, trained on ~124m tweets from January 2018 to December 2021 <span class="small">(Loureiro et al. 2022, Barbieri et al. 2020)</span> - BERT-derived transformer model processing pipelines can typically only handle texts up to 512 tokens long - Code to chunk transcripts, assign values to chunks, take mean values - Assign sentiment values in the range 0 (negative) to 2 (positive) to all video transcripts and all comments --- ### Topic Modeling BERTopic <span style="small">(Grootendorst 2022)</span> - Groups lexical items and documents together in "topics" - A form of dimensionality reduction that can give insight into discourse/content of text data - BERTopic uses embeddings from sentence transformer models (take word context into account) ```python from sklearn.feature_extraction.text import CountVectorizer vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english") topic_model = BERTopic(language="english", embedding_model="all-MiniLM-L12-v2", vectorizer_model=vectorizer_model, nr_topics="auto") topics, probs = topic_model.fit_transform(list(conase_subset.text)) ``` - Used [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2), a model trained on 1.7 billion words of web texts from various genres --- ### Research questions - What kind of discourse content is represented in the videos? - What does topic modeling tell us about the content of the transcripts? - Which content attracts positive/negative comments? --- ### 8 largerst topics (transcripts) <div style="text-align: center;"> <iframe style="text-align: center" src="8_topics_words_e.html" width="100%" height="500" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> </div> --- exclude: true ### Topics <div style="text-align: center;"> <iframe style="text-align: center" src="plotly_c.html" width="100%" height="500" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> </div> --- ### Topics interpretation .large[ .pull-left[ - Firefighting - City/community - Waste disposal and management - Art ] .pull-right[ - Police - Sports - Discourse (?) - City/discourse (?) ] ] --- ### Sentiment of transcripts by topic ![:scale 50%](data:image/png;base64,#topic_transcript_sentiment_boxplot.png) --- ### Sentiment of comments by topic ![:scale 50%](data:image/png;base64,#topic_sentiment_boxplot.png) --- exclude: true ### Data collection and processing - Identification of relevant channels (lists of councils with web pages -> scrape pages for links to YouTube) - Inspection of returned channels to remove false positives - Retrieval of ASR transcripts using [YT-DLP](https://github.com/yt-dlp/yt-dlp) - Geocoding: String containing council name + address + country location to Google's geocoding service - PoS tagging with SpaCy <span class="small">(Honnibal et al. 2019)</span> --- ### Mutual love score For a given channel with `\(n\)` videos, each of which has `\(m\)` comments, the mutual `\(love score\)` is the mean of the transcript sentiment times the mean of the comment sentiment: `$$love~score = \frac{1}{n}\sum_i^n{st_{i}} \cdot\frac{1}{m}\sum_j^m{sc_j}$$` - Ranges from 0 (negative) to 4 (postive videos and positive comments) --- exclude: true |transcript|transcript sentiment|comment|comment sentiment|LR| |----------|-------|--------------------|-----------------|--| |That's wonderful!| 2|I love this video|2|4| | | |amazing thank you!|2|| | | |the best!!|2|| |We love you!|2| This video sucks|0|2 | | | | I don't know |1|| | | | Love you too!|2|| `$$lovefest~ratio = \frac{1}{2}(2+2) \cdot \frac{1}{6}(2+2+2+0+1+2) = 2 \cdot \frac{9}{6} = 3$$` --- ### Lovefest ratio
--- class: small ### Comment sentiment map <div style="text-align: center;"> <iframe style="text-align: center" src="yt_comments_testmap_CMC9a.html" width="100%" height="500" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> </div> --- ### Implications for local government - Citizen engagement leads to better communities <span class="small">(Gaventa & Barrett 2012)</span> - More engagement in the form of art/food/outreach videos, fewer police videos? --- ### A few caveats - Videos of local government not representative of speech in general - ASR errors (mean WER after filtering ~14%), quality of transcript related to quality of audio as well as dialect features <span class="small">(Tatman 2017; Meyer et al. 2020; Markl & Lai 2021)</span> - Low-frequency phenomena: manually inspect corpus hits - High-frequency phenomena: signal of correct transcriptions will be stronger <span class="small">(Agarwal et al. 2009)</span> → classifiers - High variability in discourse in different videos, high variability in number of comments (most few contents) ![](data:image/png;base64,#comments_per_video.png) --- ### A few caveats Sample is small and probably not statistically reliable. Better approach: - Get channels with many, many videos - Randomly sample large number of videos - randomly sample large number of comments - need bigger datasets (coming...) - Transformer models (like all-MiniLM-L12-v2) are trained on segmented text with clear boundaries, but transcript text mostly has no punctuation such as periods or commas. --- exclude: true ### Summary and outlook - Large corpus of ASR transcripts from YouTube channels of local governments in Australia/NZ - Possibly useful for corpus studies of spoken language, dialectology, pragmatics - Double modals are more widespread than has previously been documented --- #Thank you! --- ### References .small[ .hangingindent[ Barbieri, Francesco, Jose Camacho-Collados, Leonardo Neves & Luis Espinosa-Anke. 2020. [TweetEval: Unified benchmark and comparative evaluation for tweet classification](https://doi.org/10.48550/arXiv.2010.12421). *arxiv*:2010.12421 [cs.CL]. Bou-Franch, Patricia, Nuria Lorenzo-Dus & Pilar Garcés-Conejos Blitvich. (2012). Social interaction in YouTube text-based polylogues: A study of coherence. *Journal of Computer-mediated Communication* 17, 501–521. Coats, Steven. (Forthcoming). Dialect corpora from YouTube. In: B. Busse & N. Dumrukcic (eds.). *Proceedings of ICAME*. Dynel, Marta. (2014). [Participation framework underlying YouTube interaction](https://doi.org/10.1016/j.pragma.2014.04.001). *Journal of Pragmatics* 73, 37–52. Gaventa, John & Gregory Barrett. (2012). [Mapping the outcomes of citizen engagement](https://doi.org/10.1016/j.worlddev.2012.05.014). *World Development* 40 (12), 2399–2410. Goode, Luke, Alexis McCullough & Gelise O'Hare. (2011). Unruly publics and the fourth estate on YouTube. *Participations: Journal of Audience and Reception Studies* 8(2), 594-615. Grootendorst, Maarten. (2022). [BERTopic: Neural topic modeling with a class-based TF-IDF procedure]( https://doi.org/10.48550/arXiv.2203.05794). *arXiv*:2203.05794 [cs.CL]. Häring, Mario, Wiebke Loosen & Walid Maalej. (2018). [Who is addressed in this comment? automatically classifying meta-comments in news comments](https://doi.org/10.1145/3274336). *Proceedings of the ACM on Human-Computer Interaction*, 2(CSCW), 1-20. Herring, Susan & Seung Woo Chae. (2021). Prompt-rich CMC on YouTube: To what or to whom do comments respond?. In *Proceedings of the 54th Hawaii International Conference on System Sciences*, 2906-2915. Ksiazek, Thomas B. (2018). [Commenting on the News](https://doi.org/10.1080/1461670X.2016.1209977). *Journalism Studies* 19(5), 650-673. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer & Veselin Stoyanov (2019). [RoBERTa: A robustly optimized BERT pretraining approach](https://doi.org/10.48550/arXiv.1907.11692). *arXiv*:1907.11692 [cs.CL]. ]] --- ### References 2 .small[ .hangingindent[ Loureiro, Daniel, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke & Jose Camacho-Collados. (2022). [TimeLMs: Diachronic Language Models from Twitter](https://doi.org/10.48550/arXiv.2202.03829). *arXiv*:2202.03829v2 [cs.CL]. Markl, Nina & Catherine Lai. (2021). [Context-sensitive evaluation of automatic speech recognition: considering user experience & language variation](https://aclanthology.org/2021.hcinlp-1.6). In *Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing*, 34–40. Meyer, Josh, Lindy Rauchenstein, Joshua D. Eisenberg & Nicholas Howell. (2020). [Artie bias corpus: An open dataset for detecting demographic bias in speech applications](https://aclanthology.org/2020.lrec-1.796 ). In *Proceedings of the 12th Language Resources and Evaluation Conference*, 6462–6468. Schultes, Peter, Verena Dorner & Franz Lehner. (2013). Leave a comment! An in-depth analysis of user comments on YouTube. In *11th International Conference on Wirtschaftsinformatik, 27th February – 1st March 2013, Leipzig, Germany*, 659-673. Siersdorfer, Stefan, Sergiu Chelaru, Jose San Pedro, Ismail Sengor Altingovde, & Wolfgang Nejdl. (2014). [Analyzing and mining comments and comment ratings on the social web](https://doi.org/10.1145/2628441). *ACM Transactions on the Web* 8 (3), Article 17. Tatman, Rachel. (2017). Gender and dialect bias in YouTube’s automatic captions. In *Proceedings of the First Workshop on Ethics in Natural Language Processing, April 4th, 2017, Valencia, Spain*, 53–59. ]]