Skip to content
Snippets Groups Projects

OWS Preprocessing pipeline

This pipeline is used to run jobs to extract metadata from a collection of WARC files. All necessary configurations can be made in the config.yaml file.

Schema

The parquet files produced by this pipeline will contain the following columns:

Fixed columns

Column Description Pyarrow Datatype
id UUID of the WARC record pyarrow.string
title Title from the HTML pyarrow.string
plain_text Cleaned text from the HTML pyarrow.string
warc_date Date from the WARC header pyarrow.date64
warc_ip IP Address from the WARC header pyarrow.string
url_scheme URL scheme specifier pyarrow.string
url_path Hierarchical path after TLD pyarrow.string
url_params Parameters for last path element pyarrow.string
url_query Query component pyarrow.string
url_fragment Fragment identifier pyarrow.string
url_subdomain Subdomain of the network location pyarrow.string
url_domain Domain of the network location pyarrow.string
url_suffix Suffix according to the Public Suffix List pyarrow.string
http_content_type Content type from the HTTP Header pyarrow.string
http_server Server from the from the HTTP Header pyarrow.string
language Language as identified by language.py; Code according to ISO-639 Part 3 pyarrow.string
domain_labels List of language specific domain labels according to Curlie.org pyarrow.list_
domain_labels_en List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano pyarrow.list_

Columns from HTML Modules

Additional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.

Column Description Pyarrow Datatype
outgoing_links List of all hyperlinks in the HTML that start with 'http' pyarrow.list_

Requirements

After cloning this repository, make sure to complete the following steps.

Configure pipeline

In order to make the jobs run as intended, you need to configure the pipeline using the config.yaml:

  1. Define settings for your data center(s):
    1. Add the name of your data center (currently your_data_center in line two)
    2. Adapt settings for resources at each node, how many nodes should be requested at maximum etc.
      1. node_memory: Memory available at each node in MB
      2. node_cores: Number of cores available at each node
      3. max_nodes: The highest number of nodes to request at once
      4. min_nodes: The minimum number of nodes to request at once
      5. max_mem_per_core: How much memory to allocate at most to each core in MB
      6. sec_per_executor: How many seconds one executor takes to process one WARC file on average (default 60s)
  2. Update minio settings
    1. Add your credentials (endpoint, access_key, and secret_key)
    2. Potentially adapt other settings
      1. region: Region of the S3 bucket
      2. bucket_name: Name of the S3 bucke
      3. input_dir: Toplevel directory in the bucket containing the WARC files
      4. output_dir: Toplevel directory in the bucket containing the parquet files
      5. secure: Whether to use secure (TLS) connection to S3 service or not
  3. Optional: If you want to adapt which modules are used for html parsing, you can add/remove them under parse -> html. Note that each module needs to have three key value pairs:
    1. module_name: Name of the python script (without .py) in modules
    2. column_name: Name of the column in the parquet file
    3. pyarrow_data_type: Type of the module output as pyarrow Datatype

Prepare environment

After adapting the config.yaml, the first step consists of installing necessary packages. For local experimentation, simply install packages from requirements.txt using your preferred approach (conda, venv, ...). For different data centers, you can adapt the prepare.sh script to handle the preparation with the Makefile.

Minio

If you don't have access to a minio instance, you can run a standalone server to experiment locally. A description can be found here.

As outlined above, all minio configurations (endpoint, credentials, and bucket and zone naming) can be made in config.yaml.

Apache Spark

As the jobs use Apache Spark for parallel processing, the machine executing jobs needs to have access to a cluster or a local instance (e.g., through a container). Please refer to the respective documentations.

Job submission

The submission of a job to a Spark cluster depends on the cluster you have access to. The following files are designed to create a Spark cluster using Slurm and run the job on that cluster.

File Description
submit_spark.sh Submit a job for a specified data center. Uses the resource estimates from estimate_resources.py
spark_your_data_center.sh Submit a job for a specified data center. Uses the resource estimates from estimate_resources.py

If that setting fits your data center, you can adapt the files to your specifics:

  • submit_spark.sh: Add cases for your data center in lines 10ff. and 58ff. The first part loads a module to run Python and the second submits the job
  • spark_your_data_center.sh: Load the correct modules (line 9) and rename the file for your data center

Make commands

The main interaction point is the Makefile. Below is an outline of how to use them.

Preparation

In order to prepare the environment for specified datacenters, run the following command after adding a case for your data center to prepare.sh.

make prepare DATA_CENTER=YOUR_DATA_CENTER_NAME

Note: In addition to the environment, this will also create zip files for --py-files to be shipped with a Spark job.

  1. parse
  2. conf
  3. log

Local experiments

If you want to try the jobs locally, you can run make local_spark. With that command, you also need to define the OBJECT_PATH to WARC-files in the input_bucket. This path will also be used for the parquet files in the output_bucket.

make local_spark OBJECT_PATH=test

Prerequisites

In order to have files to process, you need to have access to a minio instance as described under requirements.

Spark cluster job

When running the spark job on a cluster, you need to provide the following environment variables.

  1. DATA_CENTER: Which job script to use
  2. OBJECT_PATH: Path to WARC-files in the input_bucket; Output path for the parquet files in the output_bucket.
  3. MAX_MINUTES: Upper time limit (in minutes) for the job to be completed. Used to estimate the number of nodes

Example

make submit_spark DATA_CENTER=YOUR_DATA_CENTER_NAME OBJECT_PATH=test MAX_MINUTES=120