OWS Preprocessing pipeline
This pipeline uses Apache Spark to run jobs to extract meta data from a collection of WARC files.
Necessary configurations are made in magpie.rc and the YAML files in conf.
Schema
The parquet files produced by this pipeline will contain the following columns:
Fixed columns
Column | Description | Pyarrow Datatype |
---|---|---|
id | Unique ID based on hash of the URL and crawling time | pyarrow.string |
record_id | UUID of the WARC record | pyarrow.string |
title | Title from the HTML | pyarrow.string |
plain_text | Cleaned text from the HTML | pyarrow.string |
json-ld | String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents) | pyarrow.string |
microdata | String list of HTML Microdata (http://www.w3.org/TR/microdata/#json) | pyarrow.string |
warc_date | Date from the WARC header | pyarrow.date64 |
warc_ip | IP Address from the WARC header | pyarrow.string |
url | Full URL | pyarrow.string |
url_scheme | URL scheme specifier | pyarrow.string |
url_path | Hierarchical path after TLD | pyarrow.string |
url_params | Parameters for last path element | pyarrow.string |
url_query | Query component | pyarrow.string |
url_fragment | Fragment identifier | pyarrow.string |
url_subdomain | Subdomain of the network location | pyarrow.string |
url_domain | Domain of the network location | pyarrow.string |
url_suffix | Suffix according to the Public Suffix List | pyarrow.string |
http_content_type | Content type from the HTTP Header | pyarrow.string |
http_server | Server from the from the HTTP Header | pyarrow.string |
language | Language as identified by language.py; Code according to ISO-639 Part 3 | pyarrow.string |
domain_label | English top level label according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano | pyarrow.string |
domain_labels | List of language specific domain labels according to Curlie.org | pyarrow.list_ |
domain_labels_en | List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano | pyarrow.list_ |
valid |
True : The record can be used for indexing and retrieval; False : The record is no longer valid and should not be processed. |
pyarrow.bool_ |
HTML Modules
Columns fromAdditional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.
Column | Description | Pyarrow Datatype |
---|---|---|
outgoing_links | List of all hyperlinks in the HTML that start with 'http' | pyarrow.list_ |
image_links | List of all links to images in the HTML that start with 'http' | pyarrow.list_ |
video_links | List of all links to videos in the HTML that start with 'http' | pyarrow.list_ |
Requirements
After cloning this repository, make sure to complete the following steps.
Configure pipeline
In order to make the jobs run as intended, you need to configure the pipeline using the YAML files in the conf directory and set the PROJECT_DIR
value in preprocessing:
- Change the constant
PROJECT_DIR
in preprocessing from"/opt"
to what applies to your deployment. The directory should be the parent of theows_preprocessing
directory. It is used, among other things, to reference the resources directory on cluster deployments. - Define settings for your data center(s) in data_centers.yaml:
- Add the name of your data center as the highest level key
- Adapt settings for resources at each node, how many nodes should be requested at maximum etc.
-
node_memory
: Memory available at each node in MB -
node_cores
: Number of cores available at each node -
max_nodes
: The highest number of nodes to request at once -
min_nodes
: The minimum number of nodes to request at once -
max_mem_per_core
: How much memory to allocate at most to each core in MB -
sec_per_executor
: How many seconds one executor takes to process one WARC file on average (default 60s)
-
- Update minio settings (minio.yaml)
- Add your credentials (
endpoint
,access_key
, andsecret_key
) - Potentially adapt other settings
-
region
: Region of the S3 bucket -
bucket_name
: Name of the S3 bucke -
input_dir
: Toplevel directory in the bucket containing the WARC files -
output_dir
: Toplevel directory in the bucket containing the parquet files -
secure
: Whether to use secure (TLS) connection to S3 service or not
-
- Add your credentials (
- Optional: If you want to adapt which modules are used for html parsing, you can add/remove them in the (modules.yaml) by adding the name of the python script (without
.py
) in modules
Prepare environment
After adapting the config.yaml, the first step consists of installing necessary packages. For local experimentation, simply install packages from requirements.txt using your preferred approach (conda, venv, ...). For different data centers, you can adapt the prepare.sh script to handle the preparation with the Makefile.
Minio
If you don't have access to a minio instance, you can run a standalone server to experiment locally. A description can be found here.
As outlined above, all minio configurations (endpoint, credentials, and bucket and zone naming) can be made in minio.yaml.
Apache Spark
As the jobs use Apache Spark for parallel processing, the machine executing the jobs needs to have access to a cluster or a local instance. Please refer to the respective documentation.
Slurm deployment
To deploy the preprocessing pipeline in a compute cluster running Slurm, we make use of Magpie. Magpie ensures Hadoop/HDFS and Spark are set up correctly within a Slurm batch allocation. We have set up a few scripts and configuration files to simplify deployment at a Slurm-powered cluster:
-
conf/magpie.rc
: configuration of Magpie. Please ensure that the variables are set to the specifics of your data center -
scripts/submit_spark.sh
: computes a rough estimate of required resources and then calls a data center specific Slurm batch script -
scripts/slurm_spark_cluster.sh
: runs the Magpie commands required to set up a cluster with Hadoop and Spark within a Slurm allocation -
scripts/run_preprocessor.sh
: submits the Spark application to the temporary Spark cluster -
scripts/spark_[YOUR_DATA_CENTER].sh
: contains Slurm directives and data center specific Magpie configuration
To ensure that magpie works as intended, please also apply the applicable patch for your spark and hadoop versions by executing the following command from the spark directory: (Example for Spark version 3.3.2 and Hadoop 3.X.X)
patch -p1 < [PATH TO MAGPIE]/patches/spark/spark-3.3.2-bin-hadoop3-alternate.patch
Once all configuration is complete, the indexer can be run by issuing the following command:
make submit_spark DATA_CENTER=[DATA CENTER] OBJECT_PATH=[YYYY-MM/DD] MAX_MINUTES=[MAX RUNTIME AS INTEGER]
Additional steps to avoid common issues
- Ensure that the cluster nodes can access necessary files like the curlie_domains.csv by cloning the repository to shared directories
- Ensure hostnames are mapped correctly. For instance, add
export MAGPIE_HOSTNAME_CMD="hostname -s"
where necessary - Add module loading to ~/.bashrc if the environment is not transferred to cluster nodes
Make commands
The main interaction point is the Makefile. Below is an outline of how to use them.
Preparation
In order to prepare the environment for specified datacenters, run the following command after adding a case for your data center to prepare.sh.
make prepare DATA_CENTER=YOUR_DATA_CENTER_NAME
Note: In addition to the environment, this will also create a zip file of the preprocessing module to be shipped with a Spark job.
Spark cluster job
When running the spark job on a cluster, you need to provide the following environment variables.
-
DATA_CENTER
: Which job script to use -
OBJECT_PATH
: Path to WARC-files in theinput_bucket
; Output path for the parquet files in theoutput_bucket
. -
MAX_MINUTES
: Upper time limit (in minutes) for the job to be completed. Used to estimate the number of nodes
Example
make submit_spark DATA_CENTER=YOUR_DATA_CENTER_NAME OBJECT_PATH=test MAX_MINUTES=120