Methods Hub (beta)

adaR

Abstract:

Break down URLs into their components

Type: Method
Topics: Data Analysis
License: MIT License
Programming Language: R
Code Repository Git Reference: 23795fd

Description

A wrapper for ‘ada-url’, a ‘WHATWG’ compliant and fast URL parser written in modern ‘C++’. Also contains auxiliary functions such as a public suffix extractor.

Keywords

  • URL Parsing
  • Webtracking Data
  • Webscraping

Use Cases

URL parsing is an important process in the analysis of webtracking data, e.g. GESIS Web Tracking. Although not using this package, the technique has been used in various social science publications, e.g. de León et al. (2023).

The package was used in various webscraping projects for communication research, e.g. paperboy.

Input Data

The input data has to be a vector of URLs and looks like this:

urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1")

urls
[1] "https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1"

Output Data

The output data is a data frame of parsed URLs.

Hardware Requirements

adaR runs on any hardware that can run R.

Environment Setup

With R installed:

install.packages("adaR")

How to Use

Please refer to the “Introduction to adaR” for a comprehensive introduction of the package.

The main function of this package is ada_url_parse() and it decomposes a url into its components.

library(adaR)

urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1",
          "https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html",
          "https://www.sueddeutsche.de/thema/Fu%C3%9Fball-EM")

ada_url_parse(urls)
                                                                                          href
1 https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2                  https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html
3                                                 https://www.sueddeutsche.de/thema/Fußball-EM
  protocol username password                host            hostname port
1   https:                         www.google.de       www.google.de     
2   https:                       www.nytimes.com     www.nytimes.com     
3   https:                   www.sueddeutsche.de www.sueddeutsche.de     
                                              pathname
1                                              /search
2 /2024/06/19/world/africa/sudan-darfur-takeaways.html
3                                    /thema/Fußball-EM
                                                            search hash
1 ?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1     
2                                                                      
3                                                                      

Technical Details

See the official CRAN page for further information about technical details.

Contact Details

Maintainer: David Schoch

Issue Tracker: https://github.com/gesistsa/adaR/issues