Learning Objectives
By the end of this tutorial, you will be able to
- Train your own word embeddings from scratch
- Test for implicit associations using word embeddings trained on a large language corpus
Target audience
This tutorial is aimed at beginners with some knowledge in R.
Setting up the computational environment
The following R packages are required:
require(quanteda)
require(lsa)
require(rsparse)
require(sweater)
Duration
Around 15 minutes
Basic concepts
Cosine distance
Suppose we have the following corpus of 10 documents:
Doc 1: berlin is the capital of germany
Doc 2: paris is the capital of france
Doc 3: tokyo is the capital of japan
Doc 4: the cat is weird
Doc 5: berlin
Doc 6: paris is nice
Doc 7: paris is nice
Doc 8: paris is nice
Doc 9: paris is nice
Doc 10: berlin is weird
The unique token types are: “berlin”, “is”, “the”, “capital”, “of”, “germany”, “paris”, “france”, “tokyo”, “japan”, “cat”, “weird”, “nice”. The representation of the above corpus as a document-term matrix is:
berlin | is | the | capital | of | germany | paris | france | tokyo | japan | cat | weird | nice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
8 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
9 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
10 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
The row vector is the vector representation of the document. One way to compare the similarity between two documents is to calculate the cosine similarity. Cosine similarity between two vectors (\(\mathbf{A}\), \(\mathbf{B}\)) is defined as:
\[ \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} \]
For example, the cosine similarity between Doc 1 [1,1,1,1,1,1,0,0,0,0,0,0,0]
and Doc 2 [0,1,1,1,1,0,1,1,0,0,0,0,0]
is:
\[ \begin{aligned} \mathbf{A} \cdot \mathbf{B} &= 1 \times 0 + 1 \times 1 + 1 \times 1 +1 \times 1 +1 \times 1 + 1 \times 0 + 0 \times 1 + 0 \times 1 + 0 \times 0 + 0 \times 0 + 0 \times 0 + 0 \times 0 + 0 \times 0 \\ &= 4\\ \|\mathbf{A}\| &= \sqrt{1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 0^2 + 0^2 + 0^2 + 0^2 + 0^2} \\ &= \sqrt{6}\\ \|\mathbf{B}\| &= \sqrt{0^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 1^2 + 1^2 + 0^2 + 0^2 + 0^2 + 0^2 + 0^2} \\ &= \sqrt{6}\\ \cos(\theta) &= { 4 \over \sqrt{6} \times \sqrt{6}}\\ &= 0.\overline{6} \end{aligned} \]
In R, lsa
(Wild, 2022) can be used to calculate cosine similarity.
library(lsa)
<- c(1,1,1,1,1,1,0,0,0,0,0,0,0)
doc1 <- c(0,1,1,1,1,0,1,1,0,0,0,0,0)
doc2 cosine(doc1, doc2)
[,1]
[1,] 0.6666667
Or using quanteda
(Benoit et al., 2018)
library(quanteda)
<- c("berlin is the capital of germany",
docs "paris is the capital of france",
"tokyo is the capital of japan",
"the cat is weird",
"berlin",
"paris is nice",
"paris is nice",
"paris is nice",
"paris is nice",
"berlin is weird")
<- corpus(docs) %>% tokens %>% dfm()
docs_dtm docs_dtm
Document-feature matrix of: 10 documents, 13 features (70.77% sparse) and 0 docvars.
features
docs berlin is the capital of germany paris france tokyo japan
text1 1 1 1 1 1 1 0 0 0 0
text2 0 1 1 1 1 0 1 1 0 0
text3 0 1 1 1 1 0 0 0 1 1
text4 0 1 1 0 0 0 0 0 0 0
text5 1 0 0 0 0 0 0 0 0 0
text6 0 1 0 0 0 0 1 0 0 0
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 3 more features ]
cosine(as.vector(docs_dtm[1,]), as.vector(docs_dtm[2,]))
[,1]
[1,] 0.6666667
The cosine similarity between Doc 1 and Doc 6 is much lower, as there is just one common word “is”.
<- c(0,1,0,0,0,0,1,0,0,0,0,0,1)
doc6 ##or
##cosine(as.vector(docs_dtm[1,]), as.vector(docs_dtm[6,]))
cosine(doc1, doc6)
[,1]
[1,] 0.2357023
One-hot vectors
In the traditional “bag of words” text representation, a word is presented as a so-called “one-hot” vector. In the above example, the fifth document has just one word of “berlin”. This document is represented as row vector [1,0,0,0,0,0,0,0,0,0,0,0,0]
in the document-term matrix. This vector is sparse (many zeros). One can also reason it as the “one-hot” vector representation of the word “berlin”, because there is exactly one instance of “1” in the entire vector. One-hot vector representation of word is not very useful for comparison between words. For example, cosine similarity of “one-hot” vectors of two different words is always 0.
## comparing "berlin" and "paris"
cosine(c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0))
[,1]
[1,] 0
## comparing "berlin" and "cat"
cosine(c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0))
[,1]
[1,] 0
## comparing "berlin" and "nice"
cosine(c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1))
[,1]
[1,] 0
## comparing "paris" and "nice"
cosine(c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1))
[,1]
[1,] 0
However, we should anticipate that “berlin” and “paris” should be more similar in latent meaning (at least both of them are described in the context of “capital”) than “berlin” and “cat”.
(Dense) word vectors
An improvement to this is to make them “dense” word vectors. One way is to train word embeddings on the corpus in order to generate dense word vectors. Word embeddings can capture the distributional semantics of words: words that are used and occur in the same contexts tend to purport similar meanings.
In the following example, the GLoVe embeddings are trained (Pennington et al., 2014). The GLoVe algorithm is based on the weighted feature-cooccurrence matrix (FCM) representation of the corpus, which can be generated by quanteda (Benoit et al., 2018).
There are three parameters needed to be specified: WINDOWS_SIZE
, RANK
, LEARNING_RATE
. WINDOW_SIZE
determines how close two words are is being considered as related. RANK
determines the output length of the word vector. LEARNING_RATE
determines the learning rate of the algorithm.
The FCM is a square matrix with each cell represents how likely two words are cooccured.
<- 3
WINDOW_SIZE
<- corpus(docs) %>% tokens() %>%
weighted_fcm_corpus fcm(window = WINDOW_SIZE, weights = 1 / seq_len(WINDOW_SIZE),
count = "weighted", context = "window", tri = TRUE)
weighted_fcm_corpus
Feature co-occurrence matrix of: 13 by 13 features.
features
features berlin is the capital of germany paris france tokyo
berlin 0 2 0.5 0.3333333 0 0 0 0 0
is 0 0 3.5 1.5000000 1.0 0 5.0000000 0 1.0000000
the 0 0 0 3.0000000 1.5 0.3333333 0.5000000 0.3333333 0.5000000
capital 0 0 0 0 3.0 0.5000000 0.3333333 0.5000000 0.3333333
of 0 0 0 0 0 1.0000000 0 1.0000000 0
germany 0 0 0 0 0 0 0 0 0
paris 0 0 0 0 0 0 0 0 0
france 0 0 0 0 0 0 0 0 0
tokyo 0 0 0 0 0 0 0 0 0
japan 0 0 0 0 0 0 0 0 0
features
features japan
berlin 0
is 0
the 0.3333333
capital 0.5000000
of 1.0000000
germany 0
paris 0
france 0
tokyo 0
japan 0
[ reached max_feat ... 3 more features, reached max_nfeat ... 3 more features ]
library(rsparse)
<- 5
RANK <- 0.05
LEARNING_RATE
<- GloVe$new(rank = RANK, x_max = RANK / 2, learning_rate = LEARNING_RATE)
glove
<- glove$fit_transform(weighted_fcm_corpus, n_iter = 100,
wv_main convergence_tol = 0.01, n_threads = 8)
<- glove$components
wv_context <- wv_main + t(wv_context) dense_vectors
The (dense) word vectors are, as the name suggested, dense. The dimension is equal to RANK
. We can also think about the following matrix as a multidimensional word embedding space.
dense_vectors
[,1] [,2] [,3] [,4] [,5]
berlin 0.7962076 0.21984710 -0.06590360 0.4640666 0.200658567
is 1.1302406 1.03270491 -1.09595404 -0.1845924 0.874482343
the -0.8397201 0.29093737 -1.16928236 -0.7077817 -0.125785228
capital -1.6771152 0.25861811 -1.07368898 -0.3093713 -0.513707813
of -1.1292377 -0.34534147 -0.15581791 0.3416199 -0.078691491
germany 0.0763018 0.04335321 0.59521219 0.4454654 0.518075303
paris 1.5443045 0.57864880 -0.68058169 0.2215021 0.502033449
france 0.1234471 0.01210106 1.01941268 0.1776582 0.317007046
tokyo 0.3408898 -0.85104938 0.99850738 0.3228591 0.711689488
japan -0.1582605 -0.03597847 0.91172650 0.3585346 0.009605806
cat 0.1558439 -0.22169701 -0.24271641 0.5368465 -0.249243583
weird 0.4737327 0.26432514 0.04800918 -0.2826121 0.212263138
nice 0.8290189 0.49207891 0.19336488 -0.2503394 0.587893290
And the row vectors can be compared using cosine similarity. Now, we can see that the similarity between “berlin” and “paris” is higher than “berlin” and “cat”.
cosine(dense_vectors["berlin",], dense_vectors["paris",])
[,1]
[1,] 0.8859385
cosine(dense_vectors["berlin",], dense_vectors["cat",])
[,1]
[1,] 0.4307027
Target words and attribute words
Suppose “paris”, “tokyo”, and “berlin” are the words we are interested in. We can called these words target words. We can determine how these words are similar to some other words in the word embeddings space. Suppose we set “nice” to be the attribute word. By determining the similarities between the target words and attribute words, we can see the implicit association between words. In this example, we can see that “paris” is highly associated with “nice”; but not “tokyo” and “berlin”.
cosine(dense_vectors["paris",], dense_vectors["nice",])
[,1]
[1,] 0.7643881
cosine(dense_vectors["tokyo",], dense_vectors["nice",])
[,1]
[1,] 0.214991
cosine(dense_vectors["berlin",], dense_vectors["nice",])
[,1]
[1,] 0.6653055
Pretrained word embeddings
We can train our own word embeddings. But there are also several pretrained word embeddings trained on large corpora available for download.
- word2vec trained on Google News
- GLoVe trained on Wikipedia, Common Crawl, and Gigaword
- fastText trained on Wikipedia and Common Crawl
In the following example, we will use the pretrained GLoVe word embeddings to replicate the findings by Caliskan et al. (2017) and pretrained word2vec word embeddings to replicate the findings from Garg et al. (2018). The R package sweater
(Chan, 2022) can be used to read the downloaded word embedding file.
library(sweater)
<- read_word2vec("glove.840B.300d.txt") glove
The package also provides a subset of the word embeddings called glove_math
.
Query
sweater
use the concept of query to look for associations.
A query requires two sets of words: Target words (\(\mathcal{S}\), \(\mathcal{T}\)) and Attribute words (\(\mathcal{A}\), \(\mathcal{B}\)). The package provides different methods and they require different combinations of \(\mathcal{S}\), \(\mathcal{T}\), \(\mathcal{A}\), and \(\mathcal{B}\).
Method | Target words | Attribute words |
---|---|---|
Mean Average Cosine Similarity | \(\mathcal{S}\) | \(\mathcal{A}\) |
Relative Norm Distance | \(\mathcal{S}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
Relative Negative Sentiment Bias | \(\mathcal{S}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
SemAxis | \(\mathcal{S}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
Normalized Association Score | \(\mathcal{S}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
Embedding Coherence Test | \(\mathcal{S}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
Word Embedding Association Test | \(\mathcal{S}\), \(\mathcal{T}\) | \(\mathcal{A}\), \(\mathcal{B}\) |
All methods use the same query
function.
query(w, S_words, T_words, A_words, B_words, method = "guess", verbose = FALSE)
Case study: Gender biases in word embeddings
Word Embedding Association Test
Word Embedding Association Test (WEAT) (Caliskan et al., 2017) requires all four wordsets of \(\mathcal{S}\), \(\mathcal{T}\), \(\mathcal{A}\), and \(\mathcal{B}\). The method is modeled after the Implicit Association Test (IAT) and it measures the relative strength of \(\mathcal{S}\)’s association with \(\mathcal{A}\) to \(\mathcal{B}\) against the same of \(\mathcal{T}\).
require(sweater)
<- c("math", "algebra", "geometry", "calculus", "equations", "computation",
S "numbers", "addition")
<- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama",
T "sculpture")
<- c("male", "man", "boy", "brother", "he", "him", "his", "son")
A <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
B <- query(glove_math, S, T, A, B)
sw sw
── sweater object ──────────────────────────────────────────────────────────────
Test type: weat
Effect size: 1.055015
── Functions ───────────────────────────────────────────────────────────────────
• `calculate_es()`: Calculate effect size
• `weat_resampling()`: Conduct statistical test
The effect size can be interpreted the same way as Cohen’s d. In this case, the effect size is positive, indicating the math-related concepts are more associated with male than female, whereas art-related concepts are more associated with female than male.
Relative Norm Distance
Garg et al. (2018) use Relative Norm Distance to quantify gender biases of occupation words in the pretrained Google News Word Embeddings. The method requires one set of target words and two sets of attribute words.
<- c("janitor", "statistician", "midwife", "bailiff", "auctioneer",
S1 "photographer", "geologist", "shoemaker", "athlete", "cashier",
"dancer", "housekeeper", "accountant", "physicist", "gardener",
"dentist", "weaver", "blacksmith", "psychologist", "supervisor",
"mathematician", "surveyor", "tailor", "designer", "economist",
"mechanic", "laborer", "postmaster", "broker", "chemist", "librarian",
"attendant", "clerical", "musician", "porter", "scientist", "carpenter",
"sailor", "instructor", "sheriff", "pilot", "inspector", "mason",
"baker", "administrator", "architect", "collector", "operator",
"surgeon", "driver", "painter", "conductor", "nurse", "cook",
"engineer", "retired", "sales", "lawyer", "clergy", "physician",
"farmer", "clerk", "manager", "guard", "artist", "smith", "official",
"police", "doctor", "professor", "student", "judge", "teacher",
"author", "secretary", "soldier")
<- c("he", "son", "his", "him", "father", "man", "boy", "himself",
A1 "male", "brother", "sons", "fathers", "men", "boys", "males",
"brothers", "uncle", "uncles", "nephew", "nephews")
<- c("she", "daughter", "hers", "her", "mother", "woman", "girl",
B1 "herself", "female", "sister", "daughters", "mothers", "women",
"girls", "females", "sisters", "aunt", "aunts", "niece", "nieces"
)<- query(googlenews, S_words = S1, A_words = A1, B_words = B1)
res res
── sweater object ──────────────────────────────────────────────────────────────
Test type: rnd
Effect size: -6.341598
── Functions ───────────────────────────────────────────────────────────────────
• `calculate_es()`: Calculate effect size
• `plot()`: Plot the bias of each individual word
The more positive effect size indicates that words in \(\mathcal{S}\) are more associated with \(\mathcal{B}\). As the effect size is negative, it indicates that the concept of occupation is more associated with \(\mathcal{B}\), i.e. male.
We can also produce a visualization to study the differences among target words. Target words such as “nurse”, “midwife”, and “housekeeper” are more associated with female than male.
plot(res)
Conclusion
In this tutorial, I showed you how the quantification of implicit association among words works. The R package sweater
was introduced for this task.
Social Science Usecase(s)
This method has been used in previous studies to evaluate implicit biases against minority groups in large news corpora (e.g. Müller et al., 2023).