OWS Preprocessing pipeline
This pipeline is used to run jobs to extract metadata from a collection of WARC files. All necessary configurations can be made in the config.yaml file.
Schema
The parquet files produced by this pipeline will contain the following columns:
Fixed columns
Column | Description | Pyarrow Datatype |
---|---|---|
id | UUID of the WARC record | pyarrow.string |
title | Title from the HTML | pyarrow.string |
plain_text | Cleaned text from the HTML | pyarrow.string |
warc_date | Date from the WARC header | pyarrow.date64 |
warc_ip | IP Address from the WARC header | pyarrow.string |
url_scheme | URL scheme specifier | pyarrow.string |
url_path | Hierarchical path after TLD | pyarrow.string |
url_params | Parameters for last path element | pyarrow.string |
url_query | Query component | pyarrow.string |
url_fragment | Fragment identifier | pyarrow.string |
url_subdomain | Subdomain of the network location | pyarrow.string |
url_domain | Domain of the network location | pyarrow.string |
url_suffix | Suffix according to the Public Suffix List | pyarrow.string |
http_content_type | Content type from the HTTP Header | pyarrow.string |
http_server | Server from the from the HTTP Header | pyarrow.string |
language | Language as identified by language.py; Code according to ISO-639 Part 3 | pyarrow.string |
domain_labels | List of language specific domain labels according to Curlie.org | pyarrow.list_ |
domain_labels_en | List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano | pyarrow.list_ |
HTML Modules
Columns fromAdditional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.
Column | Description | Pyarrow Datatype |
---|---|---|
outgoing_links | List of all hyperlinks in the HTML that start with 'http' | pyarrow.list_ |
Requirements
After cloning this repository, make sure to complete the following steps.
Configure pipeline
In order to make the jobs run as intended, you need to configure the pipeline using the config.yaml:
- Define settings for your data center(s):
- Add the name of your data center (currently
your_data_center
in line two) - Adapt settings for resources at each node, how many nodes should be requested at maximum etc.
-
node_memory
: Memory available at each node in MB -
node_cores
: Number of cores available at each node -
max_nodes
: The highest number of nodes to request at once -
min_nodes
: The minimum number of nodes to request at once -
max_mem_per_core
: How much memory to allocate at most to each core in MB -
sec_per_executor
: How many seconds one executor takes to process one WARC file on average (default 60s)
-
- Add the name of your data center (currently
- Update minio settings
- Add your credentials (
endpoint
,access_key
, andsecret_key
) - Potentially adapt other settings
-
region
: Region of the S3 bucket -
bucket_name
: Name of the S3 bucke -
input_dir
: Toplevel directory in the bucket containing the WARC files -
output_dir
: Toplevel directory in the bucket containing the parquet files -
secure
: Whether to use secure (TLS) connection to S3 service or not
-
- Add your credentials (
- Optional: If you want to adapt which modules are used for html parsing, you can add/remove them under
parse -> html
. Note that each module needs to have three key value pairs:-
module_name
: Name of the python script (without.py
) in modules -
column_name
: Name of the column in the parquet file -
pyarrow_data_type
: Type of the module output as pyarrow Datatype
-
Prepare environment
After adapting the config.yaml, the first step consists of installing necessary packages. For local experimentation, simply install packages from requirements.txt using your preferred approach (conda, venv, ...). For different data centers, you can adapt the prepare.sh script to handle the preparation with the Makefile.
Minio
If you don't have access to a minio instance, you can run a standalone server to experiment locally. A description can be found here.
As outlined above, all minio configurations (endpoint, credentials, and bucket and zone naming) can be made in config.yaml.
Apache Spark
As the jobs use Apache Spark for parallel processing, the machine executing jobs needs to have access to a cluster or a local instance (e.g., through a container). Please refer to the respective documentations.
Job submission
The submission of a job to a Spark cluster depends on the cluster you have access to. The following files are designed to create a Spark cluster using Slurm and run the job on that cluster.
File | Description |
---|---|
submit_spark.sh | Submit a job for a specified data center. Uses the resource estimates from estimate_resources.py |
spark_your_data_center.sh | Submit a job for a specified data center. Uses the resource estimates from estimate_resources.py |
If that setting fits your data center, you can adapt the files to your specifics:
- submit_spark.sh: Add cases for your data center in lines 10ff. and 58ff. The first part loads a module to run Python and the second submits the job
- spark_your_data_center.sh: Load the correct modules (line 9) and rename the file for your data center
Make commands
The main interaction point is the Makefile. Below is an outline of how to use them.
Preparation
In order to prepare the environment for specified datacenters, run the following command after adding a case for your data center to prepare.sh.
make prepare DATA_CENTER=YOUR_DATA_CENTER_NAME
Note: In addition to the environment, this will also create zip files for --py-files
to be shipped with a Spark job.
Local experiments
If you want to try the jobs locally, you can run make local_spark
.
With that command, you also need to define the OBJECT_PATH
to WARC-files in the input_bucket
. This path will also be used for the parquet files in the output_bucket
.
make local_spark OBJECT_PATH=test
Prerequisites
In order to have files to process, you need to have access to a minio instance as described under requirements.
Spark cluster job
When running the spark job on a cluster, you need to provide the following environment variables.
-
DATA_CENTER
: Which job script to use -
OBJECT_PATH
: Path to WARC-files in theinput_bucket
; Output path for the parquet files in theoutput_bucket
. -
MAX_MINUTES
: Upper time limit (in minutes) for the job to be completed. Used to estimate the number of nodes
Example
make submit_spark DATA_CENTER=YOUR_DATA_CENTER_NAME OBJECT_PATH=test MAX_MINUTES=120