OWI CLI Documentation
Overview
The OWI (Open Web Index) CLI is a command-line tool designed for managing OpenWebIndex data slices. This documentation will guide you through its setup, usage, and troubleshooting.
Data License
By using this tool you agree to our Open Web Index License (OWIL), to our Terms of Use as well as to the terms of use of the used data centers (via B2ACCESS Login).
Software License
owilix
itself is licensed under the Apache 2.0 License.
Installation
Requirements
-
Python Version: 3.10 or 3.11 (Python 3.12 may have issues with
py4lexis
) - Package Manager: Pip
- Optional Manager: Poetry (recommended for managing Python dependencies)
- Operating Systems: Linux, MacOSX (not tested under Windows)
Method 1: Install from Package URL
# Create a new environment
conda create -n owi pip python=3.11
conda activate owi
# Install required packages
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
# Verify installation
owilix --help
upgrade
pip install --upgrade py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install --upgrade owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
Method 2: Install from Repository
Installing from the repository directly requires having poetry installed (in your base environment)
# Clone the repository
git clone https://opencode.it4i.eu/openwebsearcheu-public/owi-cli.git
cd owi-cli
# Install directly from the repository
poetry install
Using Poetry for Installation
# Add py4lexis as a source and install
poetry source add --priority=supplemental py4lexis https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
poetry add --source py4lexis py4lexis
# Clone the repository and install with Poetry
git clone https://opencode.it4i.eu/openwebsearcheu-public/owi-cli.git
cd owi-cli
poetry install
Quick Checks after installation:
After installation, run quick checks that everything is working with commands
owilix remote doctor
to list the connection status and
owilix remote ls all
to list all available datasets. For more please see the Commands and Example below.
Usage
Defaults
-
Path:
~/.owi
(modifiable viaOWS_OWI_PATH
environment variable or using the--target
option) -
File Names:
{internalid}.tar.gz
for datasets and{internalid}.json
for metadata -
Specifier Format:
{datacenter|all}:{YYYY-MM-DD|latest}#{days}/{key=value;key=value}
Specifier Components
-
Datacenter: Specific data center or "all" (default: "all"). Current datacenters are
lrz
,it4i
,csc
-
Date: In
YYYY-MM-DD
or "latest" (default: current day) - Days: Days in the past to retrieve (optional)
- Key=Value: Additional filters (optional). If key ends with "*", the value is interpreted as regexp
Commandline Escapes
- all parameters up to the command are usually parsed by click and follow click syntax
- Parameters for a specific command are interepreted as args parameter by default, except when they contain an "=", which translates them to kwargs
- kwargs translation can be avoided using (), so while
"where=WHERE A=b"
will fill the kwargs parameter with namewhere
`"(where=WHERE A=b)" will be considered as args parameter at the specific position
Commands and Examples
Note that owi
and owilix
commands are installed.
owilix
defines different command groups, namely local, remote and admin
Local Commands
Local commands owi local
are used to manage local datasets.
-
Listing local datasets
owilix local ls all owilix local ls all:latest owilix local ls all:latest#14/access=public
Lists dataset according to the specifier
-
Listing Files
# List datasets and enumerate files matching --files filter owilix local ls all:latest files="**/*ciff*"
# List datasets and enumerate files matching --files filter and show grouped aggregation in file count with depth k owilix local ls all files="**/*" groups=4
-
Creating a local dataset and inserting files into local datasets
Creating a dataset from a local directory. Note that the dataset is only moved / copied to the configured owilix directory and not pushed to the server (which needs to be done separately)-
If no InternalID is provided, the dataset is newly created
owilix local insert file:///data/owseu/owilix/it4i/231203 access=public collectionName="mgrani" move=False
-
If an internalID is provided, only files will be inserted
owilix local insert file:///data/owseu/owilix/it4i/231203 access=public collectionName="mgrani" internalID=33ad44a3-8a85-41bf-8118-1647987c4e52 move=False
-
Creating datasets on the project level and test the diffs then:
owilix local insert /Users/username/mydata//2023-12-03 access=project collectionName="main" move=False owilix --profile M remote diff all/access=project
-
Inserting using sub path selector (e.g. only parts of a directory):
owilix local insert file:///data/migration/it4i/2023-12-5 'sub_path=year=2023/month=12/day=6/**/*' access=public collectionName="main" move=False owner="OpenWebSearch.eu Consortium" creator="OpenWebSearch.eu Consortium" publisher="OpenWebSearch.eu Consortium"
-
Listing all datasets that have been inserted, but not pushed (i.e. data center is unknown)
local ls all/dataCenter=unknown
-
Remote Commands
Remote commands operate on the datacenters specified. You can use the --exclude dc1,dc2
flag to exclude some data centers
Pulling datasets
Pulling works per file and allows for specifying file-based glob filters.
Locally datasets are stored in the configured directory, by default ~/.owi
(can be changed via OWS_OWI_PATH
environment variable or in using owilix config
).
The download is also synced, i.e. files that exists locally are not downloaded from remote.
Please be as specific as possible to avoid downloading too much data.
-
Basic Pulls:
owilix remote pull lrz:latest/access=public owilix remote pull lrz:2024-01-03 owilix remote pull all/access=public
-
advanced pulls using query types:
owilix remote ls lrz:2023-11-29#0/access=public --details --file_select ".*eng.*" owilix remote pull lrz:2024-01-03/access=public;ResourceType=warc owilix remote pull all/access=public;subResourceType*=.*parq.* --files "**/*=slv*" --details
note that in the query part a '*' suffix allows to specify a regular expression (e.g.
subResourceType*=.*parq.*
)-
Pulls and push to another remote
remote pull it4i:latest#7 num_threads=1 push_to_remote=myrepository
Note that you can add your own s3 repository to be available in owilix (but be careful, as datasets appera multiple times them). Todo so, add teh following entry to the
owilix.cfg
file (usually under.owi
)repositories: config: ..... myrepository: options: protocol: s3a # protocol: file, s3a, irods key: yourkey # optional access credentials secret: yoursecret # optional access credentials endpoint: https://yourendpoint path: openwebsearch-public/{access} # your path. note that {access} is needed to tell owilix, where to find different access levels dataset (e.g. public/project/privat). async: False # whether the protocol is async capebale. To with FAlse as safe option anonymous: True # whether the connetion is anonymous. to be used for publci buckets set to download repository: s3a
-
-
Diffs between datasets You can create diffs between datasets on both, the dataset and teh file level. Note that we usually use the minimal display profile to get the full ids.
owilix --profile M remote diff all/access=project owilix --exclude lrz remote diff all:latest/access=public;subResourceType*=.*parq.* files=**/*
-
Configuration:
owilix config fields "Date,Title,Access,DataCenter"
-
DuckDB Queries:
owilix duckdb lrz:2023-11-29#0/Access=public --files "*.parquet"
-
Logout:
owilix
logs in via LEXIS and keeps a refresh / offline token. If you want to remove that token after usage or for security reasons, do:owilix remote logout
Workflows for creating datasets
Datasets can be created in two steps:
-
Inserting dataset into the local repository:
owilix local insert /Users/username/mydata//2023-12-03 access=project collectionName="main" move=False
After inserting, you can still add some more metadata to the local json file, or you can add additional files in the repository.
Metadata can be overwritten during insert.
owilix local insert /Users/mgrani/owseudata/migration/it4i/2023-12-03 access=project collectionName="main" move=False owner="OpenWebSearch.eu Consortium" creator="OpenWebSearch.eu Consortium" publisher="OpenWebSearch.eu Consortium"
-
Pushing the data to the server. Note that this changes the internal id.
owilix remote push it4i:latest/access=project;internalId=6ebaf89d-adf2-4680-ab67-82b8eaf999fa
Pushing includes a selection of local datasets and the specification of the remote properties, particularly the data center. Note that in this case the datacenter in the dataset specifier is interpreted as being the remote datacenter. This can be also set explicitly by using
dataCenter=<dc>
. Further remarks:- you can specify a file glob using
files=glob
and only those files will be considered for syncing - sync does not push files already at the server. This behaviour can be changed using
overwrite=True
- Be careful when pushing data to not pollute datasets server side.
- you can specify a file glob using
Querying Commands
owilix
allows to query datasets using parquet files plus duckdb.
This works on both, local and remote data in parallel.
However, on remote data it can take some time to issue the query, especially when larger chunks of data need to be transferred.
So depending on the query (e.g. not very specific select or where statements), it might be better to first pull the dataset and then issue the query.
Listing dataset content:
owilix
supports a less command to browse content in datasets:
-
list all ".at" websites in local datasets from all data-centers with date 2023-12-4
owilix query less --local all:2023-12-4 limit=500 "where=url_suffix='at'"
-
list all ".at" websites in remote datasets from all data-centers with date 2023-12-4, but only use repository 'it4i'
owilix --remotes it4i query less --remote all:2023-12-4 limit=500 "where=url_suffix='at'"
-
list all ".at" websites but fetch 10 batches as buffer and select only url and title (should be faster)
owilix --remotes it4i query less --remote it4i:2023-12-3 limit=500 "where=url_suffix='at'" select=url,title page_size=30 prefetch=10
-
list all ".at" websites but only consider files under partition "language=deu". This should speed up the setup time, as fewer files need to be considered (note that file selector need to select parquet files)
owilix --remotes it4i query less --remote it4i:2023-12-3 limit=500 "where=url_suffix='at' and domain_label='Regional'" files="**/language=deu/*parquet"
Note that querying goes linear over the data, so the more specific a query is, the longer it could take. Querying over multiple data center is parallelized and thus should be faster.
Advanced Usage Examples
Debugging Problems
owilix
usually only reports user information with a log level only on exceptions, errors or warning.
However, to get more information, you can set teh --loglevel DEBUG
option, which will log everything to the output.
e.g.
owilix --loglevel DEBUG remote ls all/access=public;subResourceType*=.*parq.*
Configuration
owilix
also supports configuration via command line.
Using owilix config set key=value key2=value
or owilix config get key key2
for getting or setting config keys.
Keys have the format key1.key2#4.key3
where #k
indicates a position in a list (with -1 being appended)
owilix config set "user.name=Michael Granitzer" "user.email=michael.granitzer@uni-passau.de" "user.organistion=University of Passau"
owilix config get user.name user.email user.organistion
LexisHTTP
repository
Using the owilix
is primarily using direct file based repositories, particularly iRODS.
However, this requires non-default ports to be open, which can cause problems in non-standard environments.
To overcome this, owilix
provides the so called lexishttp
repository, which is deactivated by default.
The repository is slower and does not support all commands as of know. But you should be able to list and pull datasets with it.
The option --remote lexishttp
activates the repository.
Note that in order to keep the data consistent, no other datacenter which is available via the Lexis Portal should be active.
owilix --remotes lexishttp remote ls all/access=public;dataCenter=it4i files=**/language=eng/index*
Note also, the filtering for data center changes.
Usually, owilix
is configured with one repository per data center in order to enable maximum parallel data transfer.
However, the lexis http repo aggregates datasets over data centers.
The specifier specifies the repository to be used, here lexishttp, which does not filter appropriately for the dataCenter.
Consequently, filtering for datacenter must be done in the query part, as in the example above.
Administrative Commands
Should be handled with care
Setting metadata directly
Mainly intended to be used after an HPC execution workflow. Metadata are inferred automatically and placeholders in strings (e.g. {resourcetype}) are also replaced
There is an interactive mode for correcting metadata which can be turned off using owilix --yes ......
owilix admin set_irods_metadata irods://username:password@server:port/ZONE/path_in_zone metadta1=key1
Alternative, metdata can be set using a py4lexis session without specifying username:password above (but might require login via B2ACCESS browser login)
admin set_irods_metadata owilix://<ZONE>/[public|project]/<DatasetID> do_infer=True do_count=True metadata1=value1 metadata2=value2
Example 1: Establishing a Session and Querying Data
from py4lexis.session import LexisSession
import fsspec, irods_fsspec, duckdb
from py4lexis.lexis_irods import iRODS
# Authenticate
session = LexisSession(in_cli=True)
# Setup filesystem and retrieve data
irods = iRODS(session)
fs = fsspec.filesystem("irods", session=irods._iRODS__get_irods_session())
files = fs.glob('/IT4ILexisV2/public/proj862c5962623246664c1fda27b7afb108/**/*.parquet')
# Register filesystem in DuckDB and execute a query
parquet_files = ["irods://" + file for file in files]
duckdb.register_filesystem(fs)
df = duckdb.sql(f"SELECT COUNT(*) FROM read_parquet({parquet_files[:2]})").df()
print(df)
Example 2: Running a Filtered Query
query = f"SELECT url FROM read_parquet('{files[0]}') WHERE url LIKE 'https://www.ru.nl/' LIMIT 5"
physical_plan = duckdb.sql(f"EXPLAIN {query}").df().iloc[0, 1]
print(physical_plan)
Troubleshooting
-
Authentication Issues: Delete the
.env
and/ or~/.tokens_lxs
files. you can also runowilix clean
which does it for you. - Python Errors: Ensure Python 3.11 is used. Reinstall dependencies if needed.
- DuckDB Issues: Reinstall DuckDB via pip or as an OWI plugin.
Releases
To publish via Poetry:
poetry config repositories.opencode https://opencode.it4i.eu/api/v4/projects/92/packages/pypi
poetry build
poetry publish --repository opencode -u <token-name> -p <token-secret>
For help, run python -m owi.cli --help
or owi --help
.
Screencast
A demonstration is available via this screencast demo.
Development
Semantic Versioning
owilix
uses semantic versioning for releases, concretely python semantic versioning.
While it is not perfectly setup yet, there is a build script ./scripts/build.sh
.
The script expects a .env-rc
file containing credentials for pushing releases with env variables GITLAB_TOKEN_NAME
and GITLAB_TOKEN_SECRET
.
Commit messages are used according to the Angular git Log message conventions:
- feat: A new feature
- fix: A bug fix
- docs: Documentation only changes
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semicolons, etc.)
- refactor: A code change that neither fixes a bug nor adds a feature
- perf: A code change that improves performance
- test: Adding missing or correcting existing tests
- chore: Changes to the build process or auxiliary tools and libraries such as documentation generation
Misc
poetry add --group dev sphinx myst-parser
poetry run sphinx-build -b html docs/source/ docs/build/html
Roadmap
- update to duckdb 1.0
- update to py4lexis v2.2.2
- license agreement
- loading default config from dashboard to push config changes if needed.
- add elastic / opensearch stream consumer