Skip to content
Snippets Groups Projects

OWS Preprocessing pipeline

This pipeline uses Apache Spark to run jobs to extract meta data from a collection of WARC files.

Necessary configurations are made in magpie.rc and the YAML files in conf.

Schema

The parquet files produced by this pipeline will contain the following columns:

Fixed columns

Column Description Pyarrow Datatype
id Unique ID based on hash of the URL and crawling time pyarrow.string
record_id UUID of the WARC record pyarrow.string
title Title from the HTML pyarrow.string
plain_text Cleaned text from the HTML pyarrow.string
json-ld String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents) pyarrow.string
microdata String list of HTML Microdata (http://www.w3.org/TR/microdata/#json) pyarrow.string
warc_date Date from the WARC header pyarrow.date64
warc_ip IP Address from the WARC header pyarrow.string
url Full URL pyarrow.string
url_scheme URL scheme specifier pyarrow.string
url_path Hierarchical path after TLD pyarrow.string
url_params Parameters for last path element pyarrow.string
url_query Query component pyarrow.string
url_fragment Fragment identifier pyarrow.string
url_subdomain Subdomain of the network location pyarrow.string
url_domain Domain of the network location pyarrow.string
url_suffix Suffix according to the Public Suffix List pyarrow.string
http_content_type Content type from the HTTP Header pyarrow.string
http_server Server from the from the HTTP Header pyarrow.string
language Language as identified by language.py; Code according to ISO-639 Part 3 pyarrow.string
domain_label English top level label according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano pyarrow.string
domain_labels List of language specific domain labels according to Curlie.org pyarrow.list_
domain_labels_en List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano pyarrow.list_
valid True: The record can be used for indexing and retrieval; False: The record is no longer valid and should not be processed. pyarrow.bool_

Columns from HTML Modules

Additional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.

Column Description Pyarrow Datatype
outgoing_links List of all hyperlinks in the HTML that start with 'http' pyarrow.list_
image_links List of all links to images in the HTML that start with 'http' pyarrow.list_
video_links List of all links to videos in the HTML that start with 'http' pyarrow.list_

Requirements

After cloning this repository, make sure to complete the following steps.

Configure pipeline

In order to make the jobs run as intended, you need to configure the pipeline using the YAML files in the conf directory and set the PROJECT_DIR value in preprocessing:

  1. Change the constant PROJECT_DIR in preprocessing from "/opt" to what applies to your deployment. The directory should be the parent of the ows_preprocessing directory. It is used, among other things, to reference the resources directory on cluster deployments.
  2. Define settings for your data center(s) in data_centers.yaml:
    1. Add the name of your data center as the highest level key
    2. Adapt settings for resources at each node, how many nodes should be requested at maximum etc.
      1. node_memory: Memory available at each node in MB
      2. node_cores: Number of cores available at each node
      3. max_nodes: The highest number of nodes to request at once
      4. min_nodes: The minimum number of nodes to request at once
      5. max_mem_per_core: How much memory to allocate at most to each core in MB
      6. sec_per_executor: How many seconds one executor takes to process one WARC file on average (default 60s)
  3. Update minio settings (minio.yaml)
    1. Add your credentials (endpoint, access_key, and secret_key)
    2. Potentially adapt other settings
      1. region: Region of the S3 bucket
      2. bucket_name: Name of the S3 bucke
      3. input_dir: Toplevel directory in the bucket containing the WARC files
      4. output_dir: Toplevel directory in the bucket containing the parquet files
      5. secure: Whether to use secure (TLS) connection to S3 service or not
  4. Optional: If you want to adapt which modules are used for html parsing, you can add/remove them in the (modules.yaml) by adding the name of the python script (without .py) in modules

Prepare environment

After adapting the config.yaml, the first step consists of installing necessary packages. For local experimentation, simply install packages from requirements.txt using your preferred approach (conda, venv, ...). For different data centers, you can adapt the prepare.sh script to handle the preparation with the Makefile.

Minio

If you don't have access to a minio instance, you can run a standalone server to experiment locally. A description can be found here.

As outlined above, all minio configurations (endpoint, credentials, and bucket and zone naming) can be made in minio.yaml.

Apache Spark

As the jobs use Apache Spark for parallel processing, the machine executing the jobs needs to have access to a cluster or a local instance. Please refer to the respective documentation.

Slurm deployment

To deploy the preprocessing pipeline in a compute cluster running Slurm, we make use of Magpie. Magpie ensures Hadoop/HDFS and Spark are set up correctly within a Slurm batch allocation. We have set up a few scripts and configuration files to simplify deployment at a Slurm-powered cluster:

  • conf/magpie.rc: configuration of Magpie. Please ensure that the variables are set to the specifics of your data center
  • scripts/submit_spark.sh: computes a rough estimate of required resources and then calls a data center specific Slurm batch script
  • scripts/slurm_spark_cluster.sh: runs the Magpie commands required to set up a cluster with Hadoop and Spark within a Slurm allocation
  • scripts/run_preprocessor.sh: submits the Spark application to the temporary Spark cluster
  • scripts/spark_[YOUR_DATA_CENTER].sh: contains Slurm directives and data center specific Magpie configuration

To ensure that magpie works as intended, please also apply the applicable patch for your spark and hadoop versions by executing the following command from the spark directory: (Example for Spark version 3.3.2 and Hadoop 3.X.X)

patch -p1 < [PATH TO MAGPIE]/patches/spark/spark-3.3.2-bin-hadoop3-alternate.patch

Once all configuration is complete, the indexer can be run by issuing the following command:

make submit_spark DATA_CENTER=[DATA CENTER] OBJECT_PATH=[YYYY-MM/DD] MAX_MINUTES=[MAX RUNTIME AS INTEGER]

Additional steps to avoid common issues

  • Ensure that the cluster nodes can access necessary files like the curlie_domains.csv by cloning the repository to shared directories
  • Ensure hostnames are mapped correctly. For instance, add export MAGPIE_HOSTNAME_CMD="hostname -s" where necessary
  • Add module loading to ~/.bashrc if the environment is not transferred to cluster nodes

Make commands

The main interaction point is the Makefile. Below is an outline of how to use them.

Preparation

In order to prepare the environment for specified datacenters, run the following command after adding a case for your data center to prepare.sh.

make prepare DATA_CENTER=YOUR_DATA_CENTER_NAME

Note: In addition to the environment, this will also create a zip file of the preprocessing module to be shipped with a Spark job.

Spark cluster job

When running the spark job on a cluster, you need to provide the following environment variables.

  1. DATA_CENTER: Which job script to use
  2. OBJECT_PATH: Path to WARC-files in the input_bucket; Output path for the parquet files in the output_bucket.
  3. MAX_MINUTES: Upper time limit (in minutes) for the job to be completed. Used to estimate the number of nodes

Example

make submit_spark DATA_CENTER=YOUR_DATA_CENTER_NAME OBJECT_PATH=test MAX_MINUTES=120