A small example project for efficiently configuring a Python application with YAMLs and the CLI

Overview

Project generated with PyScaffold

Hydra Example Project for Python

A small example project for efficiently configuring a Python application with YAMLs and the CLI.

Why should I care?

A frequent requirement for productive Python application is that they are configurable via configuration files and/or the command-line-interface (CLI). This allows you to change the behavior without touching the source code by e.g. configuring another database URL or the logging verbosity. For the CLI-part, Click is often used and with PyYAML configuration files can be easily read, so where is the problem?

The CLI and configuration file of a Python application have many things in commmon, i.e., both

  1. configure the runtime behaviour of your application,
  2. need to implement validations, e.g. is the port an integer above 1024,
  3. need to be consistent and mergeable, i.e. a CLI flag should be named like the YAML key and if both are passed the CLI overwrites the YAML configuration.

Thus implementing the CLI and configuration by a YAML file seperated from each other, leads to often to code duplication and inconsistent behavior, not to mention the enormous amount of work that must be done to get this right.

With this in mind, Facebook implemented the Hydra library, which allows you to do hierarchical configuration by composition and override it through config files and the command-line. This repository serves as an example project set up to demonstrate the most important features. We also show how Hydra can be used in conjunction with pydantic, which extends the validation capabilities of OmegaConf that is used internally by Hydra.

Ok, so give me the gist of it!

Sure, just take a look into src/my_pkg/cli.py and src/my_pkg/config.py first as these are the only files we added, roughly 70 lines of code. The hierarchical configuration can be found in configs.

├── configs
│   ├── main.yaml             <- entry point for configuration
│   ├── db                    <- database configuration group
│   │   ├── mysql.yaml        <- configuration of MySQL
│   │   └── postgresql.yaml   <- configuration for PostgreSQL
│   └── experiment            <- experiment configuration group
│       ├── exp1.yaml         <- configuration for experiment 1
│       ├── exp2.yaml         <- configuration for experiment 2
│       ├── missing_key.yaml  <- wrong configuration with missing key
│       └── wrong_type.yaml   <- wrong configuration with wrong type

Basically, this structure allows you to mix and match your configuration by choosing for instance the MySQL database with the setup of experiment 2. Each group then corresponds to an attribute in the configuration object, which includes other attributes, just think of a dictionary where some keys are again dictionaries.

We can invoke our application with the console command hydra-test and this will execute the main function in cli.py:

None: # this line actually runs the checks of pydantic OmegaConf.to_object(cfg) print(OmegaConf.to_yaml(cfg)) # we just print the config and wait a few secs # note that IDEs allow auto-complete for accessing the attributes! time.sleep(cfg.main.sleep)">
@hydra.main(config_path=None, config_name="main")
def main(cfg: Config) -> None:
    # this line actually runs the checks of pydantic
    OmegaConf.to_object(cfg)
    print(OmegaConf.to_yaml(cfg)) # we just print the config and wait a few secs
    # note that IDEs allow auto-complete for accessing the attributes!
    time.sleep(cfg.main.sleep)

So executing just hydra-test results in:

Cannot find primary config 'main'. Check that it's in your config search path.

Config search path:
	provider=hydra, path=pkg://hydra.conf
	provider=main, path=pkg://my_pkg
	provider=schema, path=structured://

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

This is due to the fact that we set config_path=None, which is desirable for productive application. The application itself doesn't know where it is gonna be installed and thus defining a path to the configuration files doesn't make sense. For this reason we pass the configuration at execution with -cd, short for --config-dir:

hydra-test -cd configs

This results in the error:

Error executing job with overrides: []
Traceback (most recent call last):
  File ".../hydra-example-project/src/my_pkg/cli.py", line 11, in main
    OmegaConf.to_object(cfg)
omegaconf.errors.MissingMandatoryValue: Structured config of type `Config` has missing mandatory value: experiment
    full_key: experiment
    object_type=Config

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

This is exactly as we want it, since by taking a look into config.py, we see that the schema of the main configuration is:

@dataclass
class Config:
    main: Main
    db: DataBase
    neptune: Neptune
    experiment: Experiment = MISSING

So experiment is a mandatory parameter that the CLI user needs to provide. Thus, we add +experiment=exp1 to select the configuration from exp1.yaml and finally get what we expect:

❯ hydra-test -cd configs +experiment=exp1
main:
  sleep: 3
neptune:
  project: florian.wilhelm/my_expriments
  api_token: ~/.neptune_api_token
  tags:
  - run-1
  description: Experiment run on GCP
  mode: async
db:
  driver: mysql
  host: server_string
  port: 1028
  username: myself
  password: secret
experiment:
  model: XGBoost
  l2: 0.01
  n_steps: 1000

Note that we had to use a plus sign in the flag +experiment as we added the mandatory experiment parameter.

So the section main and neptune are directly defined in main.yaml but why did Hydra now choose the MySQL database? This is due to fact that in main.yaml, we defined some defaults:

# hydra section to build up the config hierarchy with defaults
defaults:
  - _self_
  - base_config
  - db: mysql.yaml
  # experiment: is not mentioned here but in config.py to have a mandatory setting

We can override this default behavior by adding db=postgresql and this time without + as we override a default:

❯ hydra-test -cd configs +experiment=exp1 db=postgresql
Error executing job with overrides: ['+experiment=exp1', 'db=postgresql']
Traceback (most recent call last):
  File ".../hydra-example-project/src/my_pkg/cli.py", line 11, in main
    OmegaConf.to_object(cfg)
pydantic.error_wrappers.ValidationError: 1 validation error for DataBase
port
  Choose a non-privileged port! (type=value_error)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Nice this worked as expected by telling us that our port configuration is actually wrong as we chose a privileged port! This is the magic of pydantic doing its validation work. Taking a look into config.py, we see the check that assures a port greater than 1023.

int: if port < 1024: raise ValueError(f"Choose a non-privileged port!") return port port: int username: str password: str">
@dataclass
class DataBase:
    driver: str
    host: str

    @validator("port")
    def validate_some_var(cls, port: int) -> int:
        if port < 1024:
            raise ValueError(f"Choose a non-privileged port!")
        return port

    port: int
    username: str
    password: str

Good, so we could fix our configuration file or pass an extra parameter if we are in a hurry, i.e.:

❯ hydra-test -cd configs +experiment=exp1 db=postgresql db.port=1832
main:
  sleep: 3
neptune:
  project: florian.wilhelm/my_expriments
  api_token: ~/.neptune_api_token
  tags:
  - run-1
  description: Experiment run on GCP
  mode: async
db:
  driver: postgreqsql
  host: server_string
  port: 1832
  username: me
  password: birthday
experiment:
  model: XGBoost
  l2: 0.01
  n_steps: 1000

And this works! So much flexibility and robustness in just 70 lines of code, awesome! While you are at it, you can also run hydra-test -cd configs +experiment=missing_key and hydra-test -cd configs +experiment=wrong_type to see some nice errors from pydantic telling you about a missing key and wrong type of the configuration value, respectively. Hydra and pydantic work nicely together. Just remember to use the dataclass from pydantic, not the standard library and call OmegaConf.to_object(cfg) at the start of your application to fail as early as possible.

Hydra has many more, really nice features. Imagine you want to run now the experiments exp1 and exp2 consecutively, you can just use the --multirun feature, or -m for short:

hydra-test -m -cd configs "+experiment=glob(exp*)"

There's much more to Hydra and several plugins even for hyperparameter optimization exist. So check it out!

BTW, how was this example project set up?

Glad you asked. This project was set up using the miraculous PyScaffold with the dsproject extension. The setup was as simple as:

putup -f hydra-example-project --dsproject -p my_pkg

In order to have the CLI command hydra-test, we changed in setup.cfg the following lines:

# Add here console scripts like:
console_scripts =
     hydra-test = my_pkg.cli:main

Remember to run pip install -e . after any changes to setup.cfg.

Installation & Testing

If you want to play around with this project, just clone it and set up the necessary environment:

  1. create an environment hydra-example-project with the help of conda:
    conda env create -f environment.yml
    
  2. activate the new environment with:
    conda activate hydra-example-project
    

and you are all set to run hydra-test.

Project Organization

├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├── setup.py                <- [DEPRECATED] Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── my_pkg              <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Note

This project has been set up using PyScaffold 4.1.4 and the dsproject extension 0.7.

Owner
Florian Wilhelm
Data Scientist with a mathematical background.
Florian Wilhelm
Apt2sbom python package generates SPDX or YAML files

Welcome to apt2sbom This package contains a library and a CLI tool to convert a Ubuntu software package inventory to a software bill of materials. You

Eliot Lear 15 Nov 13, 2022
Load Django Settings from Environmental Variables with One Magical Line of Code

DjEnv: Django + Environment Load Django Settings Directly from Environmental Variables features modify django configuration without modifying source c

Daniel J. Dufour 28 Oct 01, 2022
sqlconfig: manage your config files with sqlite

sqlconfig: manage your config files with sqlite The problem Your app probably has a lot of configuration in git. Storing it as files in a git repo has

Pete Hunt 4 Feb 21, 2022
A YAML validator for Programming Historian lessons.

phyaml A simple YAML validator for Programming Historian lessons. USAGE: python3 ph-lesson-yaml-validator.py lesson.md The script automatically detect

Riva Quiroga 1 Nov 07, 2021
Python-dotenv reads key-value pairs from a .env file and can set them as environment variables.

python-dotenv Python-dotenv reads key-value pairs from a .env file and can set them as environment variables. It helps in the development of applicati

Saurabh Kumar 5.5k Jan 04, 2023
A set of Python scripts and notebooks to help administer and configure Workforce projects.

Workforce Scripts A set of Python scripts and notebooks to help administer and configure Workforce projects. Notebooks Several example Jupyter noteboo

Esri 75 Sep 09, 2022
A slightly opinionated template for iPython configuration for interactive development

A slightly opinionated template for iPython configuration for interactive development. Auto-reload and no imports for packages and modules in the project.

Seva Zhidkov 24 Feb 16, 2022
environs is a Python library for parsing environment variables.

environs: simplified environment variable parsing environs is a Python library for parsing environment variables. It allows you to store configuration

Steven Loria 920 Jan 04, 2023
Napalm-vs-openconfig - Comparison of NAPALM and OpenConfig YANG with NETCONF transport

NAPALM vs NETCONF/OPENCONFIG Abstracts Multi vendor network management and autom

Anton Karneliuk 1 Jan 17, 2022
Pyleri is an easy-to-use parser created for SiriDB

Python Left-Right Parser Pyleri is an easy-to-use parser created for SiriDB. We first used lrparsing and wrote jsleri for auto-completion and suggesti

Cesbit 106 Dec 06, 2022
Yamale (ya·ma·lē) - A schema and validator for YAML.

Yamale (ya·ma·lē) ⚠️ Ensure that your schema definitions come from internal or trusted sources. Yamale does not protect against intentionally maliciou

23andMe 534 Dec 21, 2022
A Python library to parse PARI/GP configuration and header files

pari-utils A Python library to parse PARI/GP configuration and header files. This is mainly used in the code generation of https://github.com/sagemath

Sage Mathematical Software System 3 Sep 18, 2022
Python YAML Environment (ymlenv) by Problem Fighter Library

In the name of God, the Most Gracious, the Most Merciful. PF-PY-YMLEnv Documentation Install and update using pip: pip install -U PF-PY-YMLEnv Please

Problem Fighter 2 Jan 20, 2022
Event Coding for the HV Protocol MEG datasets

Scripts for QA and trigger preprocessing of NIMH HV Protocol Install pip install git+https://github.com/nih-megcore/hv_proc Usage hv_process.py will

2 Nov 14, 2022
A modern simfile parsing & editing library for Python 3

A modern simfile parsing & editing library for Python 3

ash garcia 38 Nov 01, 2022
Config files for my GitHub profile.

Hacked This is a python base script from which you can hack or clone any person's facebook friendlist or followers accounts which have simple password

2 Dec 10, 2021
An application pulls configuration information from JSON files generated

AP Provisioning Automation An application pulls configuration information from JSON files generated by Ekahau and then uses Netmiko to configure the l

Cisco GVE DevNet Team 1 Dec 17, 2021
Organize Django settings into multiple files and directories. Easily override and modify settings. Use wildcards and optional settings files.

Organize Django settings into multiple files and directories. Easily override and modify settings. Use wildcards in settings file paths and mark setti

Nikita Sobolev 942 Jan 05, 2023
A compact library for Python 3.10x that allows users to configure their SimPads real-time

SimpadLib v1.0.6 What is this? This is a python library programmed by Ashe Muller that allows users to interface directly with their SimPad devices, a

Ashe Muller 2 Jan 08, 2022
Dynamic Django settings.

Constance - Dynamic Django settings A Django app for storing dynamic settings in pluggable backends (Redis and Django model backend built in) with an

Jazzband 1.5k Jan 04, 2023