Methods Hub (beta)

ELFEN - Efficient Linguistic Feature Extraction for Natural Language Datasets

Abstract:

Efficiently extract linguistic features from text datasets

Type: Method
License: MIT License
Programming Language: Python
Code Repository Version: 48747f5

Description

A Python package to efficiently extract linguistic features at scale for text datasets.

Keywords

  • Text as data
  • Linguistic features
  • Feature extraction
  • nlp

Use Cases

Though none of the following example use our package, the features can be used, for example, for: - Assessment of stylistic differences of social media texts for different sociodemographic groups (e.g. Flekova et al., 2016) - Finding patterns in LLM-produced text (e.g. Miaschi et al., 2024)

Input Data

The input data is any textual data a user may want to extract linguistic features for. The expected format is a polars dataframe with a column containing text instances.

Sample Input and Output Data

The input is a polars dataframe with a column containing text instances:

> print(df)

shape: (2, 3)
┌────────────────────────────────┬─────────┬───────────┐
│ text                           ┆ subject ┆ condition │
│ ---                            ┆ ---     ┆ ---       │
│ str                            ┆ str     ┆ str       │
╞════════════════════════════════╪═════════╪═══════════╡
│ This is a test sentence.       ┆ A       ┆ C         │
│ This is another test sentence. ┆ B       ┆ D         │
└────────────────────────────────┴─────────┴───────────┘

Running the extraction of a single feature, for example n_tokens using the Extractor will yield the original dataframe with a column containing the extracted feature:

> print(extractor.data)

shape: (2, 4)
┌────────────────────────────────┬─────────┬───────────┬──────────┐
│ text                           ┆ subject ┆ condition ┆ n_tokens │
│ ---                            ┆ ---     ┆ ---       ┆ ---      │
│ str                            ┆ str     ┆ str       ┆ i64      │
╞════════════════════════════════╪═════════╪═══════════╪══════════╡
│ This is a test sentence.       ┆ A       ┆ C         ┆ 6        │
│ This is another sentence.      ┆ B       ┆ D         ┆ 5        │
└────────────────────────────────┴─────────┴───────────┴──────────┘

In practice, extractor.data will contain the additional helper columns nlp, and tokens and/or types (depending on which features are extracted).

To write the original dataframe including the extracted features but not the helper columns, a user can run the following command for saving the dataframe to a CSV file.

extractor.write_csv("path/to/csv/")

Hardware Requirements

Elfen is compatible with Python versions ≥ 3.10 and ≤ 3.12.11, on any supported hardware.

Environment Setup

Dependencies are defined in pyproject.toml.

To install the package along with the necessary dependencies, run

python -m pip install elfen

If you want to use the spacy backbone, you will need to download the respective model, e.g. “en_core_web_sm”: bash python -m spacy download en_core_web_sm

For the full functionality, some external resources are necessary. While most of them are downloaded and located automatically, some have to be loaded manually.

To use wordnet features, download open multilingual wordnet using:

python -m wn download omw:1.4

Note that for some languages, you will need to install another wordnet collection. For example, for German, you can use the following command:

python -m wn download odenet:1.4

If you are running this in a Jupyter notebook on a binder instance, you can use %%bash magic commands to run the commands in a cell:

%%bash
python -m wn download omw:1.4

For more information on the available wordnet collections, consult the wn package documentation.

Repository structure

The repo follows the usual structure of a Python package.

It contains two main directories: elfen/docs and elfen/elfen.

The top-level directory elfen contains files to define the package (pyproject.toml), the license file, READMEs along with some documentation files.

elfen/docs contains the reStructuredText files and configuration code for creating and rendering the documentation for the package.

elfen/elfen contains the package’s code structured in Python files per feature area, the main extractor class, and utilities.

elfen
├── docs
└── elfen

How to Use

For a comprehensive tutorial of all the features visit the official documentation page.

Technical Details

See the official documentation for further information about technical details.

Acknowledgements

While all feature extraction functions in this package are written from scratch, the choice of features in the readability and lexical richness feature areas (partially) follows the readability and lexicalrichness Python packages.

We use the wn Python package to extract Open Multilingual Wordnet synsets.

Disclaimer

Multiprocessing and limiting the numbers of cores used

The underlying dataframe library, polars, uses all available cores by default. If you are working on a shared server, you may want to consider limiting the resources available to polars. To do that, you will have to set the POLARS_MAX_THREADS variable in your shell, e.g.:

export POLARS_MAX_THREADS=8

Usage of some of the features

The extraction of psycholinguistic, emotion/lexicon and semantic features available through this package relies on third-party resources such as lexicons. Please refer to the original author’s licenses and conditions for usage, and cite them if you use the resources through this package in your analyses.

For an overview which features use which resource, and how to export all third-party resource references in a bibtex string, consult the documentation.

Contact Details

Maintainer: Maximilian Maurer

Issue Tracker: https://github.com/mmmaurer/elfen/issues