Snippets Groups Projects

preprocessing-pipeline

Updated public version of the Preprocessing Pipeline

Sebastian Heineking authored 1 year ago

ed3ea5e8

ed3ea5e8 1 year ago

Name	Last commit	Last update
preprocessing
resources
scripts
LICENSE
Makefile
README.md
requirements.txt

OWS Preprocessing pipeline

This pipeline uses Apache Spark to run jobs to extract meta data from a collection of WARC files.

Necessary configurations are made in magpie.rc and the YAML files in conf.

Schema

The parquet files produced by this pipeline will contain the following columns:

Fixed columns

Column	Description	Pyarrow Datatype
id	Unique ID based on hash of the URL and crawling time	pyarrow.string
record_id	UUID of the WARC record	pyarrow.string
title	Title from the HTML	pyarrow.string
plain_text	Cleaned text from the HTML	pyarrow.string
json-ld	String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents)	pyarrow.string
microdata	String list of HTML Microdata (http://www.w3.org/TR/microdata/#json)	pyarrow.string
warc_date	Date from the WARC header	pyarrow.date64
warc_ip	IP Address from the WARC header	pyarrow.string
url	Full URL	pyarrow.string
url_scheme	URL scheme specifier	pyarrow.string
url_path	Hierarchical path after TLD	pyarrow.string
url_params	Parameters for last path element	pyarrow.string
url_query	Query component	pyarrow.string
url_fragment	Fragment identifier	pyarrow.string
url_subdomain	Subdomain of the network location	pyarrow.string
url_domain	Domain of the network location	pyarrow.string
url_suffix	Suffix according to the Public Suffix List	pyarrow.string
http_content_type	Content type from the HTTP Header	pyarrow.string
http_server	Server from the from the HTTP Header	pyarrow.string
language	Language as identified by language.py; Code according to ISO-639 Part 3	pyarrow.string
domain_label	English top level label according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano	pyarrow.string
domain_labels	List of language specific domain labels according to Curlie.org	pyarrow.list_
domain_labels_en	List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano	pyarrow.list_
valid	`True`: The record can be used for indexing and retrieval; `False`: The record is no longer valid and should not be processed.	pyarrow.bool_

Columns from HTML Modules

Additional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.

Column	Description	Pyarrow Datatype
outgoing_links	List of all hyperlinks in the HTML that start with 'http'	pyarrow.list_
image_links	List of all links to images in the HTML that start with 'http'	pyarrow.list_
video_links	List of all links to videos in the HTML that start with 'http'	pyarrow.list_

Requirements

After cloning this repository, make sure to complete the following steps.

Configure pipeline

In order to make the jobs run as intended, you need to configure the pipeline using the YAML files in the conf directory and set the PROJECT_DIR value in preprocessing:

Change the constant PROJECT_DIR in preprocessing from "/opt" to what applies to your deployment. The directory should be the parent of the ows_preprocessing directory. It is used, among other things, to reference the resources directory on cluster deployments.
Define settings for your data center(s) in data_centers.yaml:
1. Add the name of your data center as the highest level key
2. Adapt settings for resources at each node, how many nodes should be requested at maximum etc.
  1. node_memory: Memory available at each node in MB
  2. node_cores: Number of cores available at each node
  3. max_nodes: The highest number of nodes to request at once
  4. min_nodes: The minimum number of nodes to request at once
  5. max_mem_per_core: How much memory to allocate at most to each core in MB
  6. sec_per_executor: How many seconds one executor takes to process one WARC file on average (default 60s)
Update minio settings (minio.yaml)
1. Add your credentials (endpoint, access_key, and secret_key)
2. Potentially adapt other settings
  1. region: Region of the S3 bucket
  2. bucket_name: Name of the S3 bucke
  3. input_dir: Toplevel directory in the bucket containing the WARC files
  4. output_dir: Toplevel directory in the bucket containing the parquet files
  5. secure: Whether to use secure (TLS) connection to S3 service or not
Optional: If you want to adapt which modules are used for html parsing, you can add/remove them in the (modules.yaml) by adding the name of the python script (without .py) in modules

Prepare environment

After adapting the config.yaml, the first step consists of installing necessary packages. For local experimentation, simply install packages from requirements.txt using your preferred approach (conda, venv, ...). For different data centers, you can adapt the prepare.sh script to handle the preparation with the Makefile.

Minio

If you don't have access to a minio instance, you can run a standalone server to experiment locally. A description can be found here.

As outlined above, all minio configurations (endpoint, credentials, and bucket and zone naming) can be made in minio.yaml.

Apache Spark

As the jobs use Apache Spark for parallel processing, the machine executing the jobs needs to have access to a cluster or a local instance. Please refer to the respective documentation.

Slurm deployment

To deploy the preprocessing pipeline in a compute cluster running Slurm, we make use of Magpie. Magpie ensures Hadoop/HDFS and Spark are set up correctly within a Slurm batch allocation. We have set up a few scripts and configuration files to simplify deployment at a Slurm-powered cluster:

conf/magpie.rc: configuration of Magpie. Please ensure that the variables are set to the specifics of your data center
scripts/submit_spark.sh: computes a rough estimate of required resources and then calls a data center specific Slurm batch script
scripts/slurm_spark_cluster.sh: runs the Magpie commands required to set up a cluster with Hadoop and Spark within a Slurm allocation
scripts/run_preprocessor.sh: submits the Spark application to the temporary Spark cluster
scripts/spark_[YOUR_DATA_CENTER].sh: contains Slurm directives and data center specific Magpie configuration

To ensure that magpie works as intended, please also apply the applicable patch for your spark and hadoop versions by executing the following command from the spark directory: (Example for Spark version 3.3.2 and Hadoop 3.X.X)

patch -p1 < [PATH TO MAGPIE]/patches/spark/spark-3.3.2-bin-hadoop3-alternate.patch

Once all configuration is complete, the indexer can be run by issuing the following command:

make submit_spark DATA_CENTER=[DATA CENTER] OBJECT_PATH=[YYYY-MM/DD] MAX_MINUTES=[MAX RUNTIME AS INTEGER]

Additional steps to avoid common issues

Ensure that the cluster nodes can access necessary files like the curlie_domains.csv by cloning the repository to shared directories
Ensure hostnames are mapped correctly. For instance, add export MAGPIE_HOSTNAME_CMD="hostname -s" where necessary
Add module loading to ~/.bashrc if the environment is not transferred to cluster nodes

Make commands

The main interaction point is the Makefile. Below is an outline of how to use them.

Preparation

In order to prepare the environment for specified datacenters, run the following command after adding a case for your data center to prepare.sh.

make prepare DATA_CENTER=YOUR_DATA_CENTER_NAME

Note: In addition to the environment, this will also create a zip file of the preprocessing module to be shipped with a Spark job.

Spark cluster job

When running the spark job on a cluster, you need to provide the following environment variables.

DATA_CENTER: Which job script to use
OBJECT_PATH: Path to WARC-files in the input_bucket; Output path for the parquet files in the output_bucket.
MAX_MINUTES: Upper time limit (in minutes) for the job to be completed. Used to estimate the number of nodes

Example

make submit_spark DATA_CENTER=YOUR_DATA_CENTER_NAME OBJECT_PATH=test MAX_MINUTES=120