OWLer URLFrontier Log Ingestion
This repository facilitates the ingestion of crawler logs into the backend of the OWLer system.
Overview
The project integrates with the owler-urlfrontier
dependency, which provides backend interfaces for various storage systems such as ScyllaDB and OpenSearch.
Key Components
- Parser Plugins: Tag and filter encountered URLs.
- Scheduler: Determines the next planned fetch time for URLs.
- Logs Consumer: Archives logs into a separate backend and triggers the logs aggregation pipeline.
The log format is defined in the owler-crawler
repository.
Installation
Prerequisites
- Java 11
- Maven
- Podman or Docker
Step 1: Install Dependency
Clone the owler-urlfrontier
repository and install the dependency:
git clone https://opencode.it4i.eu/openwebsearcheu-public/open-web-crawler/owler-urlfrontier.git
cd owler-urlfrontier
mvn clean install -DskipTests
Step 2: Build the Project
Clone this repository and build the project:
git clone https://opencode.it4i.eu/openwebsearcheu-public/open-web-crawler/owler-urlfrontier-ingest.git
cd owler-urlfrontier-ingest
mvn clean package -DskipTests
Usage
Configure the Ingestion Process
Update the config.ini
file with the necessary parameters. The file should be located in the root directory of the project.
# Default parameters
# Ingest Server
prometheus.port=0
http.stream.port=0
# Ingest Service
ingest.threads=4
input.directory=input
archive.directory=archive
# URL Parser Plugins
plugins.config.file=plugins.json
# Backend
backend.class= # No Default Parameter
backend.host=localhost
backend.port= # Default Port depends on the Backend class
backend.service.interest=default
backend.queue.id=default
Start the Process
Run the following command to start the ingestion service:
podman-compose up -d --build
License
This project is licensed under the MIT License. See the LICENSE file for details.