Description
Duct tape the ‘quanteda’ ecosystem (Benoit et al., 2018) doi:10.21105/joss.00774 to modern Transformer-based text classification models (Wolf et al., 2020) doi:10.18653/v1/2020.emnlp-demos.6, in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of ‘quanteda.textmodels’ and provides a function to setup the ‘Python’ environment to use the pretrained models from ‘Hugging Face’ https://huggingface.co/. More information: doi:10.5117/CCR2023.1.003.CHAN.
Keywords
- Deep Learning
- Supervised machine learning
- Text analysis
Use Cases
This package can be used in any typical supervised machine learning usecase involving text data. In the software paper (Chan et al.), several cases were presented, e.g. Prediction of incivility based on tweets (Theocharis et al., 2020).
Input Data
grafzahl
accepts text data as either character vector or the corpus
data structure of quanteda
.
Sample Input and Output Data
A sample input is a corpus
. This is an example dataset:
library(grafzahl)
library(quanteda)
unciviltweets
Corpus consisting of 19,982 documents and 1 docvar.
text1 :
"@ @ Karma gave you a second chance yesterday. Start doing m..."
text2 :
"@ With people like you, Steve King there's still hope for we..."
text3 :
"@ @ You bill is a joke and will sink the GOP. #WEDESERVEBETT..."
text4 :
"@ Dream on. The only thing trump understands is how to enric..."
text5 :
"@ @ Just like the Democrat taliban party was up front with t..."
text6 :
"@ you are going to have more of the same with HRC, and you a..."
[ reached max_ndoc ... 19,976 more documents ]
The output is an S3 object.
Hardware Requirements
Grafzahl runs on any machine that can run R. A GPU that supports CUDA is optional.
Environment Setup
With R installed:
install.packages("grafzahl")
How to Use
Before training, please setup the conda environment.
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
A typical way to train and make predictions.
input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold)
Use the x
(text data), y
(label, in this case a docvar
), and model_name
(Model name, from Hugging Face) parameters to control how the supervised machine learning model is trained.
model2 <- grafzahl(x = training_corpus,
y = "value",
model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predict(model2, test_corpus)
Technical Details
See the publication for tested and selected models and parameters, the reasoning behind the model selection, and employed datasets for training.
References
- Chan, C. H. (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research, 5(1), 76. https://doi.org/10.5117/CCR2023.1.003.CHAN
Contact Details
Maintainer: Chung-hong Chan chainsawtiney@gmail.com
Issue Tracker: https://github.com/gesistsa/grafzahl/issues