Skip to content
Snippets Groups Projects

OWLer URLFrontier Log Ingestion

This repository facilitates the ingestion of crawler logs into the backend of the OWLer system.

Overview

The project integrates with the owler-urlfrontier dependency, which provides backend interfaces for various storage systems such as ScyllaDB and OpenSearch.

Key Components

  1. Parser Plugins: Tag and filter encountered URLs.
  2. Scheduler: Determines the next planned fetch time for URLs.
  3. Logs Consumer: Archives logs into a separate backend and triggers the logs aggregation pipeline.

The log format is defined in the owler-crawler repository.


Installation

Prerequisites

  • Java 11
  • Maven
  • Podman or Docker

Step 1: Install Dependency

Clone the owler-urlfrontier repository and install the dependency:

git clone https://opencode.it4i.eu/openwebsearcheu-public/open-web-crawler/owler-urlfrontier.git
cd owler-urlfrontier
mvn clean install -DskipTests

Step 2: Build the Project

Clone this repository and build the project:

git clone https://opencode.it4i.eu/openwebsearcheu-public/open-web-crawler/owler-urlfrontier-ingest.git
cd owler-urlfrontier-ingest
mvn clean package -DskipTests

Usage

Configure the Ingestion Process

Update the config.ini file with the necessary parameters. The file should be located in the root directory of the project.

# Default parameters

# Ingest Server
prometheus.port=0
http.stream.port=0

# Ingest Service
ingest.threads=4
input.directory=input
archive.directory=archive

# URL Parser Plugins
plugins.config.file=plugins.json

# Backend
backend.class=  # No Default Parameter
backend.host=localhost
backend.port=  # Default Port depends on the Backend class
backend.service.interest=default
backend.queue.id=default

Start the Process

Run the following command to start the ingestion service:

podman-compose up -d --build

License

This project is licensed under the MIT License. See the LICENSE file for details.