Skip to content
Snippets Groups Projects
README.md 1.45 KiB
Newer Older
  • Learn to ignore specific revisions
  • Sebastian Heineking's avatar
    Sebastian Heineking committed
    # Parsing
    ## warc_preprocessing.py
    This file contains all basic functions to parse records of a WARC file by using the [Resiliparse](https://resiliparse.chatnoir.eu/) API. These mainly include HTML extraction and cleaning to obtain plain text.
    
    Any further processing steps can be added by providing the necessary code in a .py file in the [html](./html) folder.
    
    ## html folder
    In this folder, you can create your own modules to be applied to the [HTMLTree](https://resiliparse.chatnoir.eu/en/stable/api/parse/html.html#resiliparse.parse.html.HTMLTree) and/or plain text of a WARC record.
    
    Therefore, you need to perform the following steps:
    
    
    1. Write your code to a .py file that contains a function called `process_tree_and_text` that returns the result of your preprocessing step. The function needs to take in a HTMLTree instance and optionally the (i) plain text and (ii) language as strings
    
    Sebastian Heineking's avatar
    Sebastian Heineking committed
    2. Add your module to the [config.yaml](../conf/config.yaml) under `parse` and `html`. This requires three attributes:
    	1. `module name`: The name of your .py file
    	2. `column name`: The name of the output column in the parquet file
    	3. `pyarrow_data_type`: The pyarrow [datatype](https://arrow.apache.org/docs/python/api/datatypes.html) of your output
    	
    3. Optional: If you use any additional Python packages, be sure to add them to the [requirements.txt](../requirements.txt).
    
    
    These steps are illustrated for [outgoing_links.py](./html/outgoing_links.py) in files mentioned in steps 1-3.