A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Overview

Lightweight Cluster/Cloud VM Job Management 🚀

Pyversions PyPI version Code style: black

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

  • MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
  • MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

Colab Local Slurm Grid Engine SSH GCP

Installation

pip install mle-scheduler

Managing a Single Job with MLEJob Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with MLEQueue Locally 🚀 ... 🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_2
   
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_2
   
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

", # Partition to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_slurm", random_seeds=[0, 1] ) queue.run() ">
# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "
   
    "
   ,  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

", # Queue to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}" } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_grid_engine", random_seeds=[0, 1] ) queue.run() ">
# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "
   
    "
   ,  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

", # SSH server user name "pkey_path": " ", # Private key path (e.g. ~/.ssh/id_rsa) "main_server": " ", # SSH Server address "jump_server": '', # Jump host address "ssh_port": 22, # SSH port "remote_dir": "mle-code-dir", # Dir to sync code to on server "start_up_copy_dir": True, # Whether to copy code to server "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True # Whether to use anaconda venv } queue = MLEQueue( resource_to_run="ssh-node", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_ssh_queue", job_arguments=job_args, ssh_settings=ssh_settings) queue.run() ">
ssh_settings = {
    "user_name": "
     
      "
     ,  # SSH server user name
    "pkey_path": "
     
      "
     ,  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "
     
      "
     ,  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run()

Launching GCP VM-Based Jobs 🦄

", # Name of your GCP project "bucket_name": " ", # Name of your GCS bucket "remote_dir": " ", # Name of code dir in bucket "start_up_copy_dir": True, # Whether to copy code to bucket "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "num_gpus": 0, # Number of requested GPUs per job "gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100" "num_logical_cores": 1, # Number of requested CPU cores per job } queue = MLEQueue( resource_to_run="gcp-cloud", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_gcp_queue", job_arguments=job_args, cloud_settings=cloud_settings, ) queue.run() ">
cloud_settings = {
    "project_name": "
     
      "
     ,  # Name of your GCP project
    "bucket_name": "
     
      "
     , # Name of your GCS bucket
    "remote_dir": "
     
      "
     ,  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run()

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 . In future releases I plan on implementing the following:

  • Clean up TPU GCP VM & JAX dependencies case
  • Add local launching of cluster jobs via SSH to headnode
  • Add Docker/Singularity container setup support
  • Add Azure support
  • Add AWS support
Comments
  • use sys.executable instead of 'python'

    use sys.executable instead of 'python'

    In some systems (like mine, when I run locally on conda), the Python executable is not "python". I used here a global variable - not sure if that's the best way, but it allows for cases where we don't want the executable to be the same as sys.executable (e.g. if we want to execute the job on a different python interpreter than the one we are using).

    opened by boazbk 4
  • Handle case when experiment_dir is not provided

    Handle case when experiment_dir is not provided

    At the moment if "experiment_dir" is None, then cmd_line_args is not initialized, and hence future lines like cmd_line_args += " -config " + self.config_filename will fail.

    The proposed change just initializes cmd_line_args to the empty string, and then adds all options to it later.

    opened by boazbk 2
  • [Feature] Make `meta_log` accessible from queue

    [Feature] Make `meta_log` accessible from queue

    Instead of having to ...

    # Merge logs of random seeds & configs -> load & get final scores
    queue.merge_configs(merge_seeds=True)
    meta_log = load_meta_log("logs_search/meta_log.hdf5")
    test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]
    

    it would be great to do the load_meta_log already within the MLEQueue if merge_configs is called.

    opened by RobertTLange 1
  • Handling Errors thrown in GCP VMs

    Handling Errors thrown in GCP VMs

    Complete newbie to using VMs, so I'm guessing this will be a rookie questions.

    If an error is encountered when executing a job on a GCP VM, what are the best practices for handling them? I'm not even sure how to know if there was an error, which obviously complicates the debugging process.

    opened by wbrenton 0
  • Cmd capture

    Cmd capture

    • Adds MLEQueue option to delete config after job has finished
    • Adds debug_mode option to store stdout & stderr to files - partially addresses #3
    • Adds merging/loading of generated logs in MLEQueue w. automerge_configs option
    • Use system executable python version
    opened by RobertTLange 0
  • What environment does it depend on?

    What environment does it depend on?

    It's greate of you have finished so good tool for job scheduler. I want to konw what environment does it depend on? And if it can run on Kubernetes docker environment? Thanks!

    opened by kongjibai 0
Releases(v0.0.5)
  • v0.0.5(Jan 5, 2022)

    • Adds MLEQueue option to delete config after job has finished (delete_config)
    • Adds debug_mode option to store stdout & stderr to files
    • Adds merging/loading of generated logs in MLEQueue w. automerge_configs option
    • Use system executable python version
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Dec 7, 2021)

    • [x] Track config base strings for auto-merging of mle-logs & add merge_configs
    • [x] Allow scheduling on multiple partitions via -p <part1>,<part2> & queues via -q <queue1>,<queue2>
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Nov 12, 2021)

    First release 🤗 implementing core API of MLEJob and MLEQueue

    # Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
    job_args = {
        "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
        "env_name": "mle-toolbox",  # Env to activate at job start-up
        "use_conda_venv": True,  # Whether to use anaconda venv
        "num_logical_cores": 5,  # Number of requested CPU cores per job
        "num_gpus": 1,  # Number of requested GPUs per job
        "gpu_type": "V100S",  # GPU model requested for each job
        "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
    }
    
    queue = MLEQueue(
        resource_to_run="slurm-cluster",
        job_filename="train.py",
        job_arguments=job_args,
        config_filenames=["base_config_1.yaml",
                          "base_config_2.yaml"],
        experiment_dir="logs_slurm",
        random_seeds=[0, 1]
    )
    queue.run()
    
    Source code(tar.gz)
    Source code(zip)
dragonscales is a highly customizable asynchronous job-scheduler framework

dragonscales 🐉 dragonscales is a highly customizable asynchronous job-scheduler framework. This framework is used to scale the execution of multiple

Sorcero 2 May 16, 2022
Aiorq is a distributed task queue with asyncio and redis

Aiorq is a distributed task queue with asyncio and redis, which rewrite from arq to make improvement and include web interface.

PY-GZKY 5 Mar 18, 2022
Remote task execution tool

Gunnery Gunnery is a multipurpose task execution tool for distributed systems with web-based interface. If your application is divided into multiple s

Gunnery 747 Nov 09, 2022
A task scheduler with task scheduling, timing and task completion time tracking functions

A task scheduler with task scheduling, timing and task completion time tracking functions. Could be helpful for time management in daily life.

ArthurLCW 0 Jan 15, 2022
A flexible python library for building your own cron-like system, with REST APIs and a Web UI.

Nextdoor Scheduler ndscheduler is a flexible python library for building your own cron-like system to schedule jobs, which is to run a tornado process

1k Dec 15, 2022
A simple scheduler tool that provides desktop notifications about classes and opens their meet links in the browser automatically at the start of the class.

This application provides desktop notifications about classes and opens their meet links in browser automatically at the start of the class.

Anshit 14 Jun 29, 2022
generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

DPDispatcher DPDispatcher is a python package used to generate HPC(High Performance Computing) scheduler systems (Slurm/PBS/LSF/dpcloudserver) jobs in

DeepModeling 23 Nov 30, 2022
A calendaring app for Django. It is now stable, Please feel free to use it now. Active development has been taken over by bartekgorny.

Django-schedule A calendaring/scheduling application, featuring: one-time and recurring events calendar exceptions (occurrences changed or cancelled)

Tony Hauber 814 Dec 26, 2022
Vertigo is an application used to schedule @code4tomorrow classes.

Vertigo Vertigo is an application used to schedule @code4tomorrow classes. It uses the Google Sheets API and is deployed using AWS. Documentation Lear

Ben Nguyen 4 Feb 10, 2022
Crontab jobs management in Python

Plan Plan is a Python package for writing and deploying cron jobs. Plan will convert Python code to cron syntax. You can easily manage you

Shipeng Feng 1.2k Dec 28, 2022
Another Scheduler is a Kubernetes controller that automatically starts, stops, or restarts pods from a deployment at a specified time using a cron annotation.

Another Scheduler Another Scheduler is a Kubernetes controller that automatically starts, stops, or restarts pods from a deployment at a specified tim

Diego Najar 66 Nov 19, 2022
The easiest way to automate your data

Hello, world! 👋 We've rebuilt data engineering for the data science era. Prefect is a new workflow management system, designed for modern infrastruct

Prefect 10.9k Jan 04, 2023
Python job scheduling for humans.

schedule Python job scheduling for humans. Run Python functions (or any other callable) periodically using a friendly syntax. A simple to use API for

Dan Bader 10.4k Jan 02, 2023
A Python concurrency scheduling library, compatible with asyncio and trio.

aiometer aiometer is a Python 3.6+ concurrency scheduling library compatible with asyncio and trio and inspired by Trimeter. It makes it easier to exe

Florimond Manca 182 Dec 26, 2022
Automate SQL Jobs Monitoring with python

Automate_SQLJobsMonitoring_python Using python 3rd party modules we can automate

Aejaz Ayaz 1 Dec 27, 2021
A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Lightweight Cluster/Cloud VM Job Management 🚀 Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SS

29 Dec 12, 2022
Clepsydra is a mini framework for task scheduling

Intro Clepsydra is a mini framework for task scheduling All parts are designed to be replaceable. Main ideas are: No pickle! Tasks are stored in reada

Andrey Tikhonov 15 Nov 04, 2022
Python-Repeated-Timer is an open-source & highly performing timer using only standard-libraries.

Python Repeated Timer Python-Repeated-Timer is an open-source & highly performing timer using only standard-libraries.

TACKHYUN JUNG 3 Oct 09, 2022
A powerful workflow engine implemented in pure Python

Spiff Workflow Summary Spiff Workflow is a workflow engine implemented in pure Python. It is based on the excellent work of the Workflow Patterns init

Samuel 1.3k Jan 08, 2023
Ffxiv-blended-job-icons - All action icons for each class/job are blended together to create new backgrounds for each job/class icon!

ffxiv-blended-job-icons All action icons for each class/job are blended together to create new backgrounds for each job/class icon! I used python to c

Jon Strutz 2 Jul 07, 2022