Methods Hub (beta)

webtrackR - preprocess & analyze webtracking data

Abstract:

R package to preprocess and analyze web tracking data, i.e., web browsing histories of participants

Type: Method
Topics: Preprocessing
License: MIT License
Programming Language: R
Code Repository Git Reference: bc39b96

Description

The package offers data structures and methods to work with web tracking data. The functions cover data preprocessing steps, enriching web tracking data with external information and methods for the analysis of digital behavior as used in several academic papers (e.g., Clemm von Hohenberg et al., 2023 doi:10.17605/OSF.IO/M3U9P; Stier et al., 2022 doi:10.1017/S0003055421001222).

Keywords

  • Digital Behavioral Data
  • Webtracking
  • Data Preprocessing

Use Cases

Web tracking data provides a powerful tool for social science research, enabling the analysis of human behavior and interaction patterns in digital environments. By capturing detailed user activity, such as page visits, clickstreams, and time spent on various platforms, researchers can uncover insights into online decision-making, information diffusion, and social influence. This data can also be used to study phenomena like polarization or the impact of targeted advertising on public opinion.

Input Data

webtrackR accepts raw webtracking data as provided by GESIS and includes a sample dataset of raw webtracking data.

Output Data

The functions in webtrackR return processed wt_dt objects (enhanced data frames) that contain enriched, cleaned, or summarized web tracking data. These objects remain in-memory within R and can be directly analyzed, visualized, or explicitly exported by the user into external file formats such as CSV or RDS.

# Example: export a processed webtrackR object to CSV
write.csv(my_wt_data, "processed_webtracking.csv", row.names = FALSE)

Environment Setup

With R installed:

install.packages("webtrackR")

How to Use

The webtrackR package is designed to preprocess, classify, and analyze web tracking data (web browsing histories of participants in academic studies). A typical workflow combines preprocessing of raw tracking data, enrichment with additional information, classification of visits, and aggregation or summarization for further analysis.

Prepare your raw data

Raw web tracking data must contain at least the following variables:

  • panelist_id: the identifier of the participant from whom the data was collected
  • url: the visited URL
  • timestamp: the time of the visit

After loading the data into R, use the function as.wt_dt() to convert it into the special wt_dt format. This assigns the correct class, ensures that required variables are present, and converts the timestamp into POSIXct format. All subsequent functions in the package check for these variables and will throw an error if they are missing.

Preprocess the data

Several functions can be used to derive new variables and enrich the raw data:

  • add_duration(): calculates the time spent on a visit by measuring the difference between subsequent timestamps, with options to handle unusually long gaps using the cutoff and replace_by arguments.
  • add_session(): groups subsequent visits into browsing sessions, stopping when the gap between visits exceeds a specified cutoff.
  • extract_host(), extract_domain(), and extract_path(): parse URLs into host, domain, and path components.
  • drop_query(): removes query strings or fragments from URLs.
  • add_next_visit() and add_previous_visit(): add the following or preceding URL, domain, or host as a new variable.
  • add_referral(): flags whether a visit was referred by a social media platform (based on Schmidt et al., 2023).
  • add_title(): retrieves the <title> text of the visited webpage and adds it as a variable.
  • add_panelist_data(): joins web tracking data with additional participant information such as survey data.

Classify visits

To categorize website visits, use classify_visits(). Visits can be matched by extracting the domain or host and comparing them to a predefined list, or by applying regular expressions to the raw URL. This step is essential if you want to distinguish between classes of sites (e.g., news, social media, search engines).

Summarize and aggregate

Once the data has been preprocessed and classified, it can be aggregated for analysis:

  • deduplicate(): flags, drops, or aggregates consecutive visits to the same URL within a given time window.
  • sum_visits(): counts visits by participant and timeframe (e.g., day, week, or date), optionally by a visit class.
  • sum_durations(): aggregates total visit durations across timeframes or classes.
  • sum_activity(): counts the number of active periods (e.g., active days) per participant.

Work with the results

All functions return updated wt_dt objects, which are simply enhanced data frames. These objects remain in memory in R and can be directly inspected, analyzed, or visualized.

By default, webtrackR does not create output files. If you wish to save the processed or summarized data, you must export it explicitly. For example:

library("webtrackR")

# load example data and turn it into wt_dt
data("testdt_tracking")
my_wt_data <-as.wt_dt(testdt_tracking)

# output the data as a csv
write.csv(my_wt_data, "processed_webtracking.csv", row.names = FALSE)

Customize the workflow

The workflow is highly customizable through function arguments:

  • Adjust cutoff values in add_duration() or add_session() to change how long gaps between visits are handled.
  • Specify within and method in deduplicate() to control how duplicates are flagged, dropped, or aggregated.
  • Set timeframe in sum_visits() or sum_durations() to change the level of aggregation (e.g., daily, weekly).
  • Provide your own domain lists or regular expressions in classify_visits() to match visits to custom categories.

Example Code

A typical workflow including preprocessing, classifying and aggregating web tracking data looks like this (using the available example data):

# load example data and turn it into wt_dt
data("testdt_tracking")
wt <- as.wt_dt(testdt_tracking)

# add duration
wt <- add_duration(wt)

# extract domains
wt <- extract_domain(wt)

# drop duplicates (consecutive visits to the same URL within one second)
wt <- deduplicate(wt, within = 1, method = "drop")

# load example domain classification and classify domains
data("domain_list")
wt <- classify_visits(wt, classes = domain_list, match_by = "domain")

# load example survey data and join with web tracking data
data("testdt_survey_w")
wt <- add_panelist_data(wt, testdt_survey_w)

# aggregate number of visits by day and panelist, and by domain class
wt_summ <- sum_visits(wt, timeframe = "date", visit_class = "type")

Technical Details

See the official CRAN page for further information about technical details.

Contact Details

Maintainer: David Schoch

Issue Tracker: https://github.com/gesistsa/webtrackR/issues