Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
  • village
  • switch-to-datacite
  • develop
  • v0.22.2
  • v0.22.1
  • v0.22.0
  • v0.21.0
  • v0.20.0
  • v0.19.0
  • v0.18.1
  • v0.18.0
  • v0.17.1
  • v0.17.0
  • v0.16.5
  • v0.16.4
  • v0.16.3
  • v0.16.2
  • v0.16.1
  • v0.16.0
  • v0.15.0
  • v0.14.0
  • v0.13.0
  • v0.12.1
24 results

owi-cli

  • Clone with SSH
  • Clone with HTTPS
  • OWI CLI Documentation

    Overview

    The OWI (Open Web Index) CLI is a command-line tool designed for managing OpenWebIndex data slices. This documentation will guide you through its setup, usage, and troubleshooting.

    Data License

    By using this tool you agree to our Open Web Index License (OWIL), to our Terms of Use as well as to the terms of use of the used data centers (via B2ACCESS Login).

    Software License

    owilix itself is licensed under the Apache 2.0 License.

    Installation

    Requirements

    • Python Version: 3.10 or 3.11 (Python 3.12 may have issues with py4lexis)
    • Package Manager: Pip
    • Optional Manager: Poetry (recommended for managing Python dependencies)
    • Operating Systems: Linux, MacOSX (not tested under Windows)

    Method 1: Install from Package URL

    # Create a new environment
    conda create -n owi pip python=3.11
    conda activate owi
    # Install required packages
    pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
    pip install owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
    # Verify installation
    owilix --help

    upgrade

    pip install --upgrade py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
    pip install --upgrade owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple

    Method 2: Install from Repository

    Installing from the repository directly requires having poetry installed (in your base environment)

    # Clone the repository
    git clone https://opencode.it4i.eu/openwebsearcheu-public/owi-cli.git
    cd owi-cli
    # Install directly from the repository
    poetry install

    Using Poetry for Installation

    # Add py4lexis as a source and install
    poetry source add --priority=supplemental py4lexis https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
    poetry add --source py4lexis py4lexis
    
    # Clone the repository and install with Poetry
    git clone https://opencode.it4i.eu/openwebsearcheu-public/owi-cli.git
    cd owi-cli
    poetry install

    Quick Checks after installation:

    After installation, run quick checks that everything is working with commands

    owilix remote doctor

    to list the connection status and

    owilix remote ls all

    to list all available datasets. For more please see the Commands and Example below.

    Usage

    Defaults

    • Path: ~/.owi (modifiable via OWS_OWI_PATH environment variable or using the --target option)
    • File Names: {internalid}.tar.gz for datasets and {internalid}.json for metadata
    • Specifier Format: {datacenter|all}:{YYYY-MM-DD|latest}#{days}/{key=value;key=value}

    Specifier Components

    1. Datacenter: Specific data center or "all" (default: "all"). Current datacenters are lrz,it4i, csc
    2. Date: In YYYY-MM-DD or "latest" (default: current day)
    3. Days: Days in the past to retrieve (optional)
    4. Key=Value: Additional filters (optional). If key ends with "*", the value is interpreted as regexp

    Commandline Escapes

    • all parameters up to the command are usually parsed by click and follow click syntax
    • Parameters for a specific command are interepreted as args parameter by default, except when they contain an "=", which translates them to kwargs
    • kwargs translation can be avoided using (), so while "where=WHERE A=b" will fill the kwargs parameter with name where `"(where=WHERE A=b)" will be considered as args parameter at the specific position

    Commands and Examples

    Note that owi and owilix commands are installed.

    owilix defines different command groups, namely local, remote and admin

    Local Commands

    Local commands owi local are used to manage local datasets.

    • Listing local datasets

      owilix local ls all
      owilix local ls all:latest
      owilix local ls all:latest#14/access=public

      Lists dataset according to the specifier

    • Listing Files

        # List datasets and enumerate files matching --files filter
        owilix local ls all:latest files="**/*ciff*"
       # List datasets and enumerate files matching --files filter and show grouped aggregation in file count with depth k 
       owilix local ls all files="**/*" groups=4
    • Creating a local dataset and inserting files into local datasets
      Creating a dataset from a local directory. Note that the dataset is only moved / copied to the configured owilix directory and not pushed to the server (which needs to be done separately)

      • If no InternalID is provided, the dataset is newly created

        owilix local insert file:///data/owseu/owilix/it4i/231203 access=public collectionName="mgrani" move=False
      • If an internalID is provided, only files will be inserted

        owilix local insert file:///data/owseu/owilix/it4i/231203 access=public collectionName="mgrani" internalID=33ad44a3-8a85-41bf-8118-1647987c4e52 move=False
      • Creating datasets on the project level and test the diffs then:

        owilix local insert /Users/username/mydata//2023-12-03 access=project collectionName="main" move=False
        owilix --profile M remote diff all/access=project
      • Inserting using sub path selector (e.g. only parts of a directory):

        owilix local insert file:///data/migration/it4i/2023-12-5 'sub_path=year=2023/month=12/day=6/**/*' access=public collectionName="main" move=False owner="OpenWebSearch.eu Consortium" creator="OpenWebSearch.eu Consortium" publisher="OpenWebSearch.eu Consortium"
      • Listing all datasets that have been inserted, but not pushed (i.e. data center is unknown)

        local ls all/dataCenter=unknown

    Remote Commands

    Remote commands operate on the datacenters specified. You can use the --exclude dc1,dc2 flag to exclude some data centers

    Pulling datasets

    Pulling works per file and allows for specifying file-based glob filters.
    Locally datasets are stored in the configured directory, by default ~/.owi (can be changed via OWS_OWI_PATH environment variable or in using owilix config). The download is also synced, i.e. files that exists locally are not downloaded from remote. Please be as specific as possible to avoid downloading too much data.

    • Basic Pulls:

      owilix remote pull lrz:latest/access=public
      owilix remote pull lrz:2024-01-03
      owilix remote pull all/access=public
    • advanced pulls using query types:

      owilix remote ls lrz:2023-11-29#0/access=public --details --file_select ".*eng.*"
      owilix remote pull lrz:2024-01-03/access=public;ResourceType=warc
      owilix remote pull all/access=public;subResourceType*=.*parq.* --files "**/*=slv*" --details

      note that in the query part a '*' suffix allows to specify a regular expression (e.g. subResourceType*=.*parq.*)

      • Pulls and push to another remote

          remote pull it4i:latest#7 num_threads=1 push_to_remote=myrepository

        Note that you can add your own s3 repository to be available in owilix (but be careful, as datasets appera multiple times them). Todo so, add teh following entry to the owilix.cfg file (usually under .owi)

         repositories:
            config:
              .....
              myrepository:
                 options:
                      protocol: s3a                        # protocol: file, s3a, irods
                      key: yourkey                         # optional access credentials
                      secret: yoursecret                   # optional access credentials
                      endpoint: https://yourendpoint       
                      path: openwebsearch-public/{access}  # your path. note that {access} is needed to tell owilix, where to find different access levels dataset (e.g. public/project/privat).
                      async: False                         # whether the protocol is async capebale. To with FAlse as safe option
                      anonymous: True                      # whether the connetion is anonymous. to be used for publci buckets set to download
                 repository: s3a
    • Diffs between datasets You can create diffs between datasets on both, the dataset and teh file level. Note that we usually use the minimal display profile to get the full ids.

      owilix --profile M remote diff all/access=project
      owilix --exclude lrz remote diff all:latest/access=public;subResourceType*=.*parq.* files=**/*
    • Configuration:

      owilix config fields "Date,Title,Access,DataCenter"
    • DuckDB Queries:

      owilix duckdb lrz:2023-11-29#0/Access=public --files "*.parquet"
    • Logout: owilix logs in via LEXIS and keeps a refresh / offline token. If you want to remove that token after usage or for security reasons, do:

      owilix remote logout

    Workflows for creating datasets

    Datasets can be created in two steps:

    1. Inserting dataset into the local repository:

      owilix local insert /Users/username/mydata//2023-12-03 access=project collectionName="main" move=False

      After inserting, you can still add some more metadata to the local json file, or you can add additional files in the repository.

      Metadata can be overwritten during insert.

      owilix local insert /Users/mgrani/owseudata/migration/it4i/2023-12-03 access=project collectionName="main" move=False owner="OpenWebSearch.eu Consortium" creator="OpenWebSearch.eu Consortium" publisher="OpenWebSearch.eu Consortium"
    2. Pushing the data to the server. Note that this changes the internal id.

      owilix remote push  it4i:latest/access=project;internalId=6ebaf89d-adf2-4680-ab67-82b8eaf999fa

      Pushing includes a selection of local datasets and the specification of the remote properties, particularly the data center. Note that in this case the datacenter in the dataset specifier is interpreted as being the remote datacenter. This can be also set explicitly by using dataCenter=<dc>. Further remarks:

      • you can specify a file glob using files=glob and only those files will be considered for syncing
      • sync does not push files already at the server. This behaviour can be changed using overwrite=True
      • Be careful when pushing data to not pollute datasets server side.

    Querying Commands

    owilix allows to query datasets using parquet files plus duckdb. This works on both, local and remote data in parallel. However, on remote data it can take some time to issue the query, especially when larger chunks of data need to be transferred. So depending on the query (e.g. not very specific select or where statements), it might be better to first pull the dataset and then issue the query.

    Listing dataset content:

    owilix supports a less command to browse content in datasets:

    • list all ".at" websites in local datasets from all data-centers with date 2023-12-4

      owilix query less --local all:2023-12-4 limit=500 "where=url_suffix='at'"
    • list all ".at" websites in remote datasets from all data-centers with date 2023-12-4, but only use repository 'it4i'

      owilix --remotes it4i query less --remote all:2023-12-4 limit=500 "where=url_suffix='at'"
    • list all ".at" websites but fetch 10 batches as buffer and select only url and title (should be faster)

        owilix --remotes it4i query less --remote it4i:2023-12-3  limit=500 "where=url_suffix='at'" select=url,title page_size=30 prefetch=10
    • list all ".at" websites but only consider files under partition "language=deu". This should speed up the setup time, as fewer files need to be considered (note that file selector need to select parquet files)

        owilix --remotes it4i query less --remote it4i:2023-12-3  limit=500 "where=url_suffix='at' and domain_label='Regional'" files="**/language=deu/*parquet"

      Note that querying goes linear over the data, so the more specific a query is, the longer it could take. Querying over multiple data center is parallelized and thus should be faster.

    Advanced Usage Examples

    Debugging Problems

    owilix usually only reports user information with a log level only on exceptions, errors or warning. However, to get more information, you can set teh --loglevel DEBUG option, which will log everything to the output.

    e.g.

    owilix --loglevel DEBUG remote ls all/access=public;subResourceType*=.*parq.*

    Configuration

    owilix also supports configuration via command line. Using owilix config set key=value key2=value or owilix config get key key2 for getting or setting config keys. Keys have the format key1.key2#4.key3 where #k indicates a position in a list (with -1 being appended)

    owilix config set "user.name=Michael Granitzer" "user.email=michael.granitzer@uni-passau.de" "user.organistion=University of Passau"
    owilix config get user.name user.email user.organistion

    Using the LexisHTTP repository

    owilix is primarily using direct file based repositories, particularly iRODS. However, this requires non-default ports to be open, which can cause problems in non-standard environments.

    To overcome this, owilix provides the so called lexishttprepository, which is deactivated by default. The repository is slower and does not support all commands as of know. But you should be able to list and pull datasets with it.

    The option --remote lexishttp activates the repository. Note that in order to keep the data consistent, no other datacenter which is available via the Lexis Portal should be active.

    owilix --remotes lexishttp remote ls all/access=public;dataCenter=it4i files=**/language=eng/index*

    Note also, the filtering for data center changes. Usually, owilix is configured with one repository per data center in order to enable maximum parallel data transfer. However, the lexis http repo aggregates datasets over data centers. The specifier specifies the repository to be used, here lexishttp, which does not filter appropriately for the dataCenter. Consequently, filtering for datacenter must be done in the query part, as in the example above.

    Administrative Commands

    Should be handled with care

    Setting metadata directly

    Mainly intended to be used after an HPC execution workflow. Metadata are inferred automatically and placeholders in strings (e.g. {resourcetype}) are also replaced

    There is an interactive mode for correcting metadata which can be turned off using owilix --yes ......

    owilix admin set_irods_metadata irods://username:password@server:port/ZONE/path_in_zone metadta1=key1

    Alternative, metdata can be set using a py4lexis session without specifying username:password above (but might require login via B2ACCESS browser login)

    admin set_irods_metadata owilix://<ZONE>/[public|project]/<DatasetID> do_infer=True do_count=True metadata1=value1 metadata2=value2

    Example 1: Establishing a Session and Querying Data

    from py4lexis.session import LexisSession
    import fsspec, irods_fsspec, duckdb
    from py4lexis.lexis_irods import iRODS
    
    # Authenticate
    session = LexisSession(in_cli=True)
    
    # Setup filesystem and retrieve data
    irods = iRODS(session)
    fs = fsspec.filesystem("irods", session=irods._iRODS__get_irods_session())
    files = fs.glob('/IT4ILexisV2/public/proj862c5962623246664c1fda27b7afb108/**/*.parquet')
    
    # Register filesystem in DuckDB and execute a query
    parquet_files = ["irods://" + file for file in files]
    duckdb.register_filesystem(fs)
    df = duckdb.sql(f"SELECT COUNT(*) FROM read_parquet({parquet_files[:2]})").df()
    print(df)

    Example 2: Running a Filtered Query

    query = f"SELECT url FROM read_parquet('{files[0]}') WHERE url LIKE 'https://www.ru.nl/' LIMIT 5"
    physical_plan = duckdb.sql(f"EXPLAIN {query}").df().iloc[0, 1]
    print(physical_plan)

    Troubleshooting

    • Authentication Issues: Delete the .env and/ or ~/.tokens_lxs files. you can also run owilix clean which does it for you.
    • Python Errors: Ensure Python 3.11 is used. Reinstall dependencies if needed.
    • DuckDB Issues: Reinstall DuckDB via pip or as an OWI plugin.

    Releases

    To publish via Poetry:

    poetry config repositories.opencode https://opencode.it4i.eu/api/v4/projects/92/packages/pypi
    poetry build 
    poetry publish --repository opencode -u <token-name> -p <token-secret>

    For help, run python -m owi.cli --help or owi --help.

    Screencast

    A demonstration is available via this screencast demo.

    Development

    Semantic Versioning

    owilix uses semantic versioning for releases, concretely python semantic versioning. While it is not perfectly setup yet, there is a build script ./scripts/build.sh. The script expects a .env-rc file containing credentials for pushing releases with env variables GITLAB_TOKEN_NAME and GITLAB_TOKEN_SECRET.

    Commit messages are used according to the Angular git Log message conventions:

    • feat: A new feature
    • fix: A bug fix
    • docs: Documentation only changes
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semicolons, etc.)
    • refactor: A code change that neither fixes a bug nor adds a feature
    • perf: A code change that improves performance
    • test: Adding missing or correcting existing tests
    • chore: Changes to the build process or auxiliary tools and libraries such as documentation generation

    Misc

    • poetry add --group dev sphinx myst-parser
    • poetry run sphinx-build -b html docs/source/ docs/build/html

    Roadmap

    • update to duckdb 1.0
    • update to py4lexis v2.2.2
    • license agreement
    • loading default config from dashboard to push config changes if needed.
    • add elastic / opensearch stream consumer