Methods Hub (beta)

grafzahl

Abstract:

duct tape the quanteda ecosystem to modern Transformer-based text classification models

Type: Method
Topics: Data Analysis
License: GNU General Public License v3.0 only
Programming Language: R

Description

Duct tape the ‘quanteda’ ecosystem (Benoit et al., 2018) doi:10.21105/joss.00774 to modern Transformer-based text classification models (Wolf et al., 2020) doi:10.18653/v1/2020.emnlp-demos.6, in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of ‘quanteda.textmodels’ and provides a function to setup the ‘Python’ environment to use the pretrained models from ‘Hugging Face’ https://huggingface.co/. More information: doi:10.5117/CCR2023.1.003.CHAN.

Keywords

  • Deep Learning
  • Supervised machine learning
  • Text analysis

Use Cases

This package can be used in any typical supervised machine learning usecase involving text data. In the software paper (Chan et al.), several cases were presented, e.g. Prediction of incivility based on tweets (Theocharis et al., 2020).

Input Data

grafzahl accepts text data as either character vector or the corpus data structure of quanteda.

Sample Input and Output Data

A sample input is a corpus. This is an example dataset:

library(grafzahl)
library(quanteda)
unciviltweets
Corpus consisting of 19,982 documents and 1 docvar.
text1 :
"@ @ Karma gave you a second chance yesterday.  Start doing m..."

text2 :
"@ With people like you, Steve King there's still hope for we..."

text3 :
"@ @ You bill is a joke and will sink the GOP. #WEDESERVEBETT..."

text4 :
"@ Dream on. The only thing trump understands is how to enric..."

text5 :
"@ @ Just like the Democrat taliban party was up front with t..."

text6 :
"@ you are going to have more of the same with HRC, and you a..."

[ reached max_ndoc ... 19,976 more documents ]

The output is an S3 object.

Hardware Requirements

Grafzahl runs on any machine that can run R. A GPU that supports CUDA is optional.

Environment Setup

With R installed:

install.packages("grafzahl")

How to Use

Before training, please setup the conda environment.

setup_grafzahl(cuda = TRUE) ## if you have GPU(s)

A typical way to train and make predictions.

input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold)

Use the x (text data), y (label, in this case a docvar), and model_name (Model name, from Hugging Face) parameters to control how the supervised machine learning model is trained.

model2 <- grafzahl(x = training_corpus,
                  y = "value",
                  model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predict(model2, test_corpus)

Technical Details

See the publication for tested and selected models and parameters, the reasoning behind the model selection, and employed datasets for training.

References

  1. Chan, C. H. (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research, 5(1), 76. https://doi.org/10.5117/CCR2023.1.003.CHAN

Contact Details

Maintainer: Chung-hong Chan

Issue Tracker: https://github.com/gesistsa/grafzahl/issues

Scholarly articles