Taxonomy

Data Collection: This category encompasses methods for obtaining and gathering data from various sources. It includes techniques such as web scraping, web tracking, API data extraction, and sensor data acquisition. These methods serve as the foundation for acquiring diverse datasets for subsequent analysis.
- Web Scraping: Web scraping involves extracting data from websites, usually in an automated manner. This method allows for the retrieval of information from web pages, enabling the collection of diverse data sets for analysis.
- Web Tracking: Web tracking involves monitoring and collecting user activities on the Internet. It typically uses technologies like cookies or other tracking mechanisms to analyze user behavior and interactions with online content.
- API (Application Programming Interface) data extraction: API data extraction involves accessing and retrieving data from web services through their programming interfaces. APIs provide a structured and controlled way to interact with external systems or platforms.
- Sensor data acquisition methods (e.g., smartphones, RFID sensors): This method encompasses gathering data from physical sensors embedded in devices like smartphones or RFID sensors. It involves capturing real-world information such as location, movement, interactions with devices, or environmental conditions.
Preprocessing: Data preprocessing involves preparing raw data for analysis. This category includes methods to clean, structure, and enhance the quality of data. Tasks such as data cleaning, text preprocessing, data encryption, data anonymization, feature engineering, normalization, and data transformation fall under this category.
- Data Cleaning: Data cleaning involves identifying and correcting errors or inconsistencies in datasets. It aims to enhance data quality by removing inaccuracies, duplicates, or missing values.
- Text Preprocessing: Text preprocessing is the preparation of textual data for analysis. It includes tasks such as tokenization, stemming, and removing stop words to structure and clean text data.
- Data Encryption: Data encryption is the process of converting sensitive information into a coded form to protect it from unauthorized access. It ensures that only authorized parties can decipher and use the data.
- Data Anonymization: Data anonymization involves modifying or removing personally identifiable information from datasets to protect the privacy of individuals while still maintaining the usefulness of the data for analysis.
- Feature Engineering: Feature engineering is the process of selecting, transforming, or creating new features from raw data to improve the performance of machine learning models. It aims to highlight relevant information for predictive modeling.
- Normalization: Normalization is the process of scaling numerical features to a standard range, typically between 0 and 1. It ensures that different features contribute equally to analyses and prevents the dominance of certain variables.
- Data Transformation: Data transformation involves converting data from one format or structure to another. This step is crucial for preparing data for specific analyses or machine learning algorithms.
Data Analysis: The analysis category focuses on examining and interpreting data to derive meaningful conclusions. It includes methods such as social network analysis, spatial analysis, temporal analysis, time-series analysis, pattern recognition, community detection, outlier detection, anomaly detection, and the use of advanced language models. These methods provide insights into relationships, trends, and anomalies within the data.
- Social Network Analysis: Social network analysis examines relationships and interactions within social networks, providing insights into the structure and dynamics of social connections.
- Spatial Analysis: Spatial analysis involves studying the geographical distribution of data to uncover patterns or relationships related to location.
- Temporal Analysis: Temporal analysis focuses on analyzing data over time to identify trends, patterns, or changes in behavior or phenomena.
- Time-series Analysis: Time-series analysis involves studying data collected over regular intervals to understand patterns and trends that unfold over time.
- Pattern Recognition: Pattern recognition is the identification of recurring structures or patterns within datasets, aiding in the understanding and interpretation of data.
- Outlier Detection: Outlier detection aims to identify data points that deviate significantly from the expected or normal behavior within a dataset.
  - Community Detection: Community detection involves identifying groups or communities within a network based on the patterns of connections between nodes.
- Anomaly Detection: Anomaly detection is the identification of abnormal or unexpected patterns or events within data, signaling potential issues or interesting phenomena.
- Text Mining: Text mining involves extracting valuable insights and patterns from unstructured textual data. It includes tasks such as information retrieval, natural language processing, and sentiment analysis.
  - Sentiment Analysis: Sentiment analysis aims to determine the emotional tone expressed in text data. It is often used to understand public opinion, customer feedback, or social media sentiment.
  - Topic Modeling: Topic modeling is a technique to identify topics or themes within a collection of documents. It helps in uncovering the main subjects discussed in large datasets.
  - Named Entity Recognition: Named entity recognition involves identifying and classifying entities (such as names of people, organizations, or locations) within text data.
  - Text Classification: Text classification involves assigning predefined categories or labels to text data. It is commonly used for tasks like spam detection or sentiment categorization.
  - Text Clustering: Text clustering groups similar data together based on their features. It helps in organizing and understanding large volumes of data.
- Graph Mining: Graph mining involves analyzing relationships and patterns within graphs or networks. It is particularly useful for uncovering connections in social networks or organizational structures.
- Data Classification: Data classification is the process of assigning predefined labels or categories to data instances. It is a fundamental step in supervised machine learning.
- Data Clustering: Data clustering involves grouping similar data points together based on certain characteristics. It is useful for exploring patterns and structures within datasets.
- Data Enrichment: Data enrichment involves enhancing existing datasets with additional information from external sources. This process adds valuable context to the data for more comprehensive analysis.
- Large Language Models: Large Language Models are advanced natural language processing models that are pre-trained on vast amounts of text data. They can be fine-tuned for specific tasks and are powerful tools for language-related analyses.
  - Fine-tuning: Fine-tuning is the process of adjusting a pre-trained model to perform specific tasks or cater to domain-specific data. It helps optimize model performance for a particular use case.
  - Pretraining: Pretraining involves training a model on a large, general dataset before fine-tuning it for specific tasks. It allows the model to capture broad patterns and knowledge from diverse data sources.
  - Prompt Engineering: Prompt engineering involves crafting specific queries or instructions (prompts) to elicit desired responses from language models. It is a technique to guide the model's output for specific tasks.
Data Visualization: Data visualization involves presenting data in a visual format to aid understanding. This category includes network visualization for representing relationships and tabular data visualization for structured numerical information. Effective visualization enhances the communication of findings and patterns derived from the data.
- Network Visualization: Network visualization represents relationships and interactions within a network graphically. It helps in understanding the structure and dynamics of complex systems.
- Tabular Data Visualization: Tabular data visualization involves presenting structured data in tables or charts, making it easier to interpret and derive insights from numerical information.
Validation & Evaluation: Validation techniques help to ensure the reliability and accuracy of the results obtained by examining the techniques used. This could include applications to perform cross-validation, to allow comparison with ground truth data, to include expert judgement as a benchmark, etc.
Computational Workflows: Applications that help researchers to set up reproducible computational workflows, to learn the basics of data science and data management, and to gain the necessary skills to continue analysing complex data sets.