Exploring Code-switching and Borrowing Using Word Vectors

class: center, middle, inverse, title-slide

# Exploring Code-switching and Borrowing Using Word Vectors
### <div class="line-block">Steven Coats<br />
English Philology, University of Oulu, Finland<br />
<a href="mailto:steven.coats@oulu.fi">steven.coats@oulu.fi</a></div>
### <div class="line-block"><br />
14th ESSE Conference, Brno<br />
September 1st, 2018</div>

---

class: inverse, center, middle
background-image: url(https://cc.oulu.fi/~scoats/oululogoRedTransparent.png);
background-repeat: no-repeat;
background-size: 80px 57px;
background-position:right top;
exclude: true

---

layout: true

<div class="my-footer"><span>Steven Coats  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;Exploring Code-switching and Borrowing using Word Vectors | ESSE 18</span></div>

---

<div class="my-footer"><span>Steven Coats  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;Exploring Code-switching and Borrowing using Word Vectors | ESSE 18</span></div>

## Outline

1. Code-switching and borrowing German-English in a Twitter data set

2. Data collection

3. Word vectors and embeddings

4. Identification of code-switches and borrowings

5. Tracing changes in meaning and visualizing borrowings

.footnote[Slides for the presentation are on my homepage at https://cc.oulu.fi/~scoats]

---

<div class="my-footer"><span>Steven Coats  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;Exploring Code-switching and Borrowing using Word Vectors | ESSE 18</span></div>

### Code-switching and borrowing German-English

Code-switching: Use of two or more languages in an utterance/turn

Lexical borrowing: Use of a second-language lexical item as a lone element within an L1 matrix (Myers-Scotton 1997)

#### German-English code-switched tweet
- Der Dauerregen ist zuviel für unseren Platz, daher müssen wir das heutige Spiel absagen. - Land under, the rain won today. So it's a rainout. .small[(The unrelenting rain is too much for our pitch, so we have to cancel today's game...)]

#### German tweets with English borrowings
- @user Gibt es schon einen Ort, also location für das Treffen? .small[(*@user Is there already a place, that is, a location for the meet-up?*)]

- Bahhhh draußen scheint voll die Sonne. EKELHAFT. Erinnert mich daran, dass ich noch immer nicht in shape für den Sommer bin. ![](https://twemoji.maxcdn.com/16x16/1f629.png) .small[(*Bahhhh the sun is totally shining outside. DISGUSTING. Reminds me that I'm still not in shape for summer.* ![](https://twemoji.maxcdn.com/16x16/1f629.png))]

---

### Code-switching and borrowing German-English in a Twitter data set

- Lexical borrowings can undergo semantic shift: many Anglicisms in German have meanings incommensurate with their meanings in English (Onysko 2007)

@user Danke für den like ![](https://twemoji.maxcdn.com/16x16/1f49c.png) .small[(*@user Thanks for the like!* ![](https://twemoji.maxcdn.com/16x16/1f49c.png))]

@user yo danke für deinen follow bro! ich weiß das zu schätzen. #RealHipHop .small[(*@user yo thanks for your follow bro! I appreciate it. #RealHipHop*)]

- How can we trace the semantic shift of English borrowings in German?

- By using word embeddings from large corpora that contain **borrowings**

---

### Data collection

- 653,457,659 tweets with *place* metadata collected globally from the Twitter Streaming API from November 2016 until June 2017

- 60,683 authors of at least one German-language tweet with place metadata from Germany, Austria or Switzerland identified and all of their tweets/most recent 3,250 tweets (whichever was larger) downloaded from REST API in April 2018

- Retain tweets in German according to Twitter's metadata

- 36,240,530 (59.3%) of tweets in German = 534,211,366 tokens

---

### Identifying sentences with code-switching/borrowing

- Tokenize all tweets, remove punctuation, URLs, user names, hashtags, emoji

- Match each word in each message with large German and English word lists

Anyone in Oberwart, der am Abend nach Wien fährt & mir was abholen und mitbringen könnte? Biete Aufwandsentschädigung & ewige Dankbarkeit! .small[(*Anyone in Oberwart who is driving this evening to Vienna and can pick up something and bring it to me? I offer reimbursement for the effort & eternal thanks!*)]

- 1 English word of 19: Borrowing

Seit Anfang dieses Jahres habe ich soooo oft heißhunger auf Asiatisches Essen. This year I have so often cravings for asian #food. ![](https://twemoji.maxcdn.com/16x16/1f371.png)![](https://twemoji.maxcdn.com/16x16/1f359.png)![](https://twemoji.maxcdn.com/16x16/1f35b.png)![](https://twemoji.maxcdn.com/16x16/1f364.png)![](https://twemoji.maxcdn.com/16x16/1f363.png)

- 9 English words of 21: Code-switching

- We can distinguish code-switching from borrowing on the basis of counts of English and German types

---

### Lexica

English words:

- 236,736 English words from NLTK (Bird et al. 2009)

German words:

- 50k most frequent German words ([Dave 2017](https://github.com/hermitdave/FrequencyWords/tree/master/content/2016/de), Lison & Tiedemann 2016)

Issue: Cross-contamination of word lists (built automatically from web sources)

---

### Corpus for word embeddings

- Tweets with least 8 tokens, of which one or two are English words from the list

- 2,488,673 of 36m tweets have borrowings (many more have code-switches!)

---

### Word embeddings

- Distributional hypothesis (Harris 1968): Word meanings correspond to their aggregate contexts of use

- Collocational information can be represented with vectors of co-occurrence probabilities within a word span

- Similarity of collocational context (and thus meaning) for any two types in a data set (corpus) can then be quantified

- Word2Vec algorithm (Mikolov et al. 2013) in Gensim (Řehůřek and Sojka 2010), 5-token co-occurrence span, minimum of 20 occurrences, 200-dimensional vectors

- Vectors for 51,336 types (mostly German words, but many English words as well)

---

### Cosine similarity

- For word types `$a$` and `$b$`, corresponding to vectors `$\mathbf{a}$` and `$\mathbf{b}$`:

`$$\text{similarity} = \cos(\theta) = {\mathbf{a} \cdot \mathbf{b} \over \|\mathbf{a}\| \|\mathbf{b}\|} = \frac{ \sum\limits_{i=1}^{n}{a_i  b_i} }{ \sqrt{\sum\limits_{i=1}^{n}{a_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{b_i^2}} }$$`

- Value can range from -1 (types never occur in same context, meanings probably very different) to 1 (types occur in exact same contexts, meanings probably very similar)

---

### Vectors for *mann*, *frau*, *könig*, and *königin*

![](unnamed-chunk-2-1.png)

---

### Cosine similarity

![](fraumannex.png)

- Cosine similarity preserves semantic relations

---

### Cosine similarity

![](ESSEexMannApps.png)

---

### Vectors for *banane*, *apfel*, *melone*, and *peach*

![](unnamed-chunk-4-1.png)

---

### Cosine similarity

![](fruitex.png)

- If using data that contains many borrowings, some semantic relations are preserved cross-linguistically

---

<div class="my-footer"><span>Steven Coats  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;Exploring Code-switching and Borrowing using Word Vectors | ESSE 18</span></div>

### Research questions

- Which English borrowings have meanings closest to or furthest from their German translations (i.e. have undergone semantic shift)

- What role do frequency effects play?

- How can we visualize the meanings of the Anglicisms in the German lexicon?

---