Newer
Older
# Parsing
## warc_preprocessing.py
This file contains all basic functions to parse records of a WARC file by using the [Resiliparse](https://resiliparse.chatnoir.eu/) API. These mainly include HTML extraction and cleaning to obtain plain text.
Any further processing steps can be added by providing the necessary code in a .py file in the [html](./html) folder.
## html folder
In this folder, you can create your own modules to be applied to the [HTMLTree](https://resiliparse.chatnoir.eu/en/stable/api/parse/html.html#resiliparse.parse.html.HTMLTree) and/or plain text of a WARC record.
Therefore, you need to perform the following steps:

Sebastian Heineking
committed
1. Write your code to a .py file that contains a function called `process_tree_and_text` that returns the result of your preprocessing step. The function needs to take in a HTMLTree instance and optionally the (i) plain text and (ii) language as strings
2. Add your module to the [config.yaml](../conf/config.yaml) under `parse` and `html`. This requires three attributes:
1. `module name`: The name of your .py file
2. `column name`: The name of the output column in the parquet file
3. `pyarrow_data_type`: The pyarrow [datatype](https://arrow.apache.org/docs/python/api/datatypes.html) of your output
3. Optional: If you use any additional Python packages, be sure to add them to the [requirements.txt](../requirements.txt).

Sebastian Heineking
committed
These steps are illustrated for [outgoing_links.py](./html/outgoing_links.py) in files mentioned in steps 1-3.