Resilipipe
Resilipipe is an open source software framework that implements a scalable cluster-based web content analysis pipeline for web archive data based on Resiliparse. It can be run on HPC clusters using Apache Spark and Magpie. Users can expand the content analysis by implementing their own modules following our interface.
Necessary configurations are made in the YAML files in conf.
Schema
The Parquet files produced by this pipeline will contain the following columns:
Fixed columns
Schema Version 0.1.0
Column | Description | Pyspark Datatype |
---|---|---|
id | Unique ID based on hash of the URL and crawling time | StringType() |
record_id | UUID of the WARC record | StringType() |
title | Title from the HTML | StringType() |
plain_text | Cleaned text from the HTML | StringType() |
json-ld | String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents) | StringType() |
microdata | String list of HTML Microdata (http://www.w3.org/TR/microdata/#json) | StringType() |
warc_date | Date from the WARC header | StringType() |
warc_ip | IP Address from the WARC header | StringType() |
url | Full URL | StringType() |
url_scheme | URL scheme specifier | StringType() |
url_path | Hierarchical path after TLD | StringType() |
url_params | Parameters for last path element | StringType() |
url_query | Query component | StringType() |
url_fragment | Fragment identifier | StringType() |
url_subdomain | Subdomain of the network location | StringType() |
url_domain | Domain of the network location | StringType() |
url_suffix | Suffix according to the Public Suffix List | StringType() |
url_is_private | If the URL has a private suffix | BooleanType() |
mime_type | MIME-Type from the HTTP Header | StringType() |
charset | charset from the HTTP Header | StringType() |
content_type_other | List of key, value pairs from the content type that could not be parsed into MIME-type or charset | MapType(StringType(), StringType()) |
http_server | Server from the from the HTTP Header | StringType() |
language | Language as identified by language.py; Code according to ISO-639 Part 3 | StringType() |
valid |
True : The record is valid; False : The record is no longer valid and should not be processed. |
BooleanType() |
warc_file | Name of the original WARC-file that contained record | StringType() |
ows_canonical | The canonical link if it exists | StringType() |
ows_resource_type | Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with "Owler" | StringType() |
ows_curlielabel | One of the 15 Curlie top level labels | StringType() |
ows_index |
True : The content is allowed to be used for the purposes of web indexing/web search; False : The content cannot be used |
BooleanType() |
ows_genai |
True : The content is allowed to be used for the purposes of developing Generative AI models; False : The content cannot be used |
BooleanType() |
ows_genai_details | If ows_genai=False , this provides additional context |
StringType() |
ows_fetch_response_time | Fetch time in ms | IntegerType() |
ows_fetch_num_errors | Number of errors while fetching (Timeout is the most prominent fetch error) | StringType() |
schema_metadata | List of key, value pairs that contain global settings like the schema_version
|
MapType(StringType(), StringType()) |
modules
Columns fromAdditional columns can be added by providing modules as outlined in the respective README. One example is detecting outgoing links.
Column | Description | Pyspark Datatype |
---|---|---|
outgoing_links | List of all hyperlinks in the HTML that start with 'http' | ArrayType(StringType()) |
image_links | List of all links to images in the HTML that start with 'http' | See get_spark_schema in links.py
|
video_links | List of all links to videos in the HTML that start with 'http' or iframes with a video | See get_spark_schema in links.py
|
iframes | List of tuples for nodes that contain an iframe (and are not a video) | See get_spark_schema in links.py
|
curlielabels | List of language specific domain labels according to Curlie.org | ArrayType(StringType()) |
curlielabels_en | List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano | ArrayType(StringType()) |
address | List of dictionaries containing extracted location and coordinates | See get_spark_schema in geoparsing.py
|
collection_indices | List of collection indices that a record belongs to. Are defined via yaml files on the S3 instance |
`ArrayType(StringType()) |
Setup
After cloning this repository, make sure to complete the following steps.
Configure the pipeline
If you want to adapt which modules are used for html parsing, you can add/remove them in the (modules.yaml) by adding the name of the python script (without .py
) in modules
make prepare
Installation using After configuration, the next step consists of installing necessary packages in a virtual environment. This is done using the following command
make prepare
Standalone Minio Instance
If you don't have access to an S3 instance, you can run a minio standalone server to experiment locally. A description can be found here.
Please add the following environment variables to your /.bashrc
(Adapt the values if necessary):
export MINIO_ENDPOINT="http://localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"
export MINIO_BUCKET_NAME="bucket"
Apache Spark
As the jobs use Apache Spark for parallel processing, the machine executing the jobs needs to have access to a cluster or a local instance. Please refer to the respective documentation.
Slurm deployment
To deploy the preprocessing pipeline in an HPC-cluster, we make use of Magpie. The spark deployment can be found here Magpie ensures Hadoop/HDFS and Spark are set up correctly within a Slurm batch allocation. We have set up a few scripts and configuration files to simplify deployment at a Slurm-powered cluster.
Additional steps to avoid common issues
- Ensure that the cluster nodes can access the resources-directory by cloning the repository to shared directories
- Add module loading to
~/.bashrc
if the environment is not transferred to cluster nodes
Docker image
In order to test the pipeline locally, you can use the Dockerfile. Simply mount a local directory (data
in the example below) that contains a WARC-file that you want to process and run this command:
docker run \
--rm \
-v "$PWD/data":/data \
opencode.it4i.eu:5050/openwebsearcheu-public/preprocessing-pipeline \
/data/crawl.warc.gz \
/data/metadata.parquet