Corpus of Australian and New Zealand Spoken English

The Corpus of Australian and New Zealand Spoken English (CoANZSE) is a 196-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts from local government channels in Australia and New Zealand, created for the study of lexical, grammatical, and discourse-pragmatic phenomena of spoken language. Annotation includes individual word timings and video IDs of transcripts, making it easy to instantly view the video(s) for any given search. CoANZSE represents a rich resource for the study of language and communication in Australia and New Zealand, but given the content of the transcripts, may also be of interest to students and researchers in diverse digital humanities and social science fields such as cultural studies, geography, political science, sociology, urban studies and planning, or tourism studies, among others.

Corpus description

The corpus was created from 55,896 ASR transcripts from 478 YouTube channels, corresponding to almost 24,007 hours of video. The size of the corpus is 195,527,977 tokens. The channels sampled in the corpus are associated with local government entities such as local, city, county, district, and regional councils. The transcripts are from a range of video types, and recordings of public meetings are well-represented. Metadata includes exact latitude-longitude coordinates for the council authorities, making geographical analysis of language use possible.

Format and metadata

The corpus exists in two versions: one in which the transcripts have not been annotated, and one in which word tokens have been annotated with part-of-speech tags and timing. Data is in tabular form in csv format (with the pipe character "|" as a separator). The columns of the table are 'country', 'state', 'council_name', 'channel_title', 'channel_url', 'video_title', 'video_id', 'upload_date', 'video_length', 'location','nr_words','text_pos' (or 'text', in the untagged version), and 'latlong'. Each row corresponds to an individual transcript (see excerpt below).

Transcripts are in the 'text_pos' column (or the 'text' column, for the untagged version). Tagging was done with spaCy and the Penn Treebank tagset. Tagged tokens have the format "token_POS_10.0", where "token" is the transcribed lexical item, "POS" the tag, and "10.0" the time offset in seconds from the start of the corresponding video.

	country	state	name	channel_name	channel_url	video_title	video_id	upload_date	video_length	text_pos	location	latlong	nr_words
0	AUS	NSW	Wollondilly Shire Council	Wollondilly Shire	https://www.youtube.com/c/wollondillyshire	Road Resurfacing Video	zVr6S5XkJ28	20181127	146.120	g_NNP_2.75 'day_XX_2.75 my_PRP$_3.75 name_NN_4.53 is_VBZ_4.74 ...	62/64 Menangle St, Picton NSW 2571, Australia	(-34.1700078, 150.612913)	433
1	AUS	NSW	Wollondilly Shire Council	Wollondilly Shire	https://www.youtube.com/c/wollondillyshire	Weather update 5pm 1 March 2022 - Mayor Matt Gould	p4MjirCc1oU	20220301	181.959	hi_UH_0.64 guys_NNS_0.96 i_PRP_1.439 'm_VBP_1.439 just_RB_1.76 ...	62/64 Menangle St, Picton NSW 2571, Australia	(-34.1700078, 150.612913)	620
2	AUS	NSW	Wollondilly Shire Council	Wollondilly Shire	https://www.youtube.com/c/wollondillyshire	Transport Capital Works Video	DXlkVTcmeho	20180417	140.450	council_NNP_0.53 is_VBZ_1.53 placing_VBG_1.65 is_VBZ_2.07 2018-19_CD_2.57 ...	62/64 Menangle St, Picton NSW 2571, Australia	(-34.1700078, 150.612913)	347
3	AUS	NSW	Wollondilly Shire Council	Wollondilly Shire	https://www.youtube.com/c/wollondillyshire	Council Meeting Wrap Up February 2022	2NhuhF2fBu8	20220224	107.840	g_NNP_0.399 'day_NNP_0.399 guys_NNS_0.799 and_CC_1.12 welcome_JJ_1.199 ...	62/64 Menangle St, Picton NSW 2571, Australia	(-34.1700078, 150.612913)	341
4	AUS	NSW	Wollondilly Shire Council	Wollondilly Shire	https://www.youtube.com/c/wollondillyshire	CITY DEAL 4 March 2018	4-cv69ZcwVs	20180305	130.159	[Music]_XX_0.85 it_PRP_2.27 's_VBZ_2.27 a_DT_3.27 fantastic_JJ_3.36 ...	62/64 Menangle St, Picton NSW 2571, Australia	(-34.1700078, 150.612913)	420

Locations

Channels have been sampled in all of the principal provinces, territories and regions of Australia and New Zealand. The map below shows the locations, names, and sizes of the sampled channels.

Sample script

Corpus hits can be presented as a table with links to the videos at the times of utterance, permitting quick manual filtering and annotation of audio and video material. This Python script will retrieve all instances of "good on you" in the PoS-tagged text in the table coanzse_df with the immediate context:


import re
hits = []
pat1 = re.compile("((\\w+_\\S+_\\S+\\s){3}good_\\w+_\\S+\\son_\\w+_\\S+\\syou_\\w+_\\S+\\s(\\w+_\\S+_\\S+\\s){3})",re.IGNORECASE)
for i,x in coanzse_df.iterrows():
	if pat1.search(x["text_pos"]):
		finds1 = pat1.findall(x["text_pos"])[0]
		seq = " ".join([x.split("_")[0] for x in finds1[0].split()])
		time = finds1[0].split()[0].split("_")[-1] 
		hits.append((x["country"],x["channel_title"],x["video_title"],seq,"https://youtu.be/"+x["video_id"]+"?t="+str(round(float(time)-3))))
pd.DataFrame(hits)

(...)

Access

The corpus is available free of charge for the purposes of research, education, and scholarship. Users may not redistribute the corpus or use the corpus for commercial purposes. The corpus is available through the Harvard Dataverse here. Please also see the Corpus of North American Spoken English and the Corpus of British Isles Spoken English, created using similar methods.

Citation

To cite the corpus, please use

Coats, Steven. 2022. The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. In Pradeesh Parameswaran, Jennifer Biggs, and David Powers (eds.), Proceedings of the The 20th Annual Workshop of the Australasian Language Technology Association, 1–5. Adelaide, Australia: Australasian Language Technology Association. https://aclanthology.org/2022.alta-1.1/