Python 3.6+ toolbox for submitting jobs to Slurm

Overview

CircleCI Code style: black Pypi conda-forge

Submit it!

What is submitit?

Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally.

An example is worth a thousand words: performing an addition

From inside an environment with submitit installed:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The Job class also provides tools for reading the log files (job.stdout() and job.stderr()).

If what you want to run is a command, turn it into a Python function using submitit.helpers.CommandFunction, then submit it. By default stdout is silenced in CommandFunction, but it can be unsilenced with verbose=True.

Find more examples here!!!

Submitit is a Python 3.6+ toolbox for submitting jobs to Slurm. It aims at running python function from python code.

Install

Quick install, in a virtualenv/conda environment where pip is installed (check which pip):

  • stable release:
    pip install submitit
    
  • stable release using conda:
    conda install -c conda-forge submitit
    
  • master branch:
    pip install git+https://github.com/facebookincubator/[email protected]#egg=submitit
    

You can try running the MNIST example to check that everything is working as expected (requires sklearn).

Documentation

See the following pages for more detailled information:

  • Examples: for a bunch of examples dealing with errors, concurrency, multi-tasking etc...
  • Structure and main objects: to get a better understanding of how submitit works, which files are created for each job, and the main objects you will interact with.
  • Checkpointing: to understand how you can configure your job to get checkpointed when preempted and/or timed-out.
  • Tips and caveats: for a bunch of information that can be handy when working with submitit.
  • Hyperparameter search with nevergrad: basic example of nevergrad usage and how it interfaces with submitit.

Goals

The aim of this Python3 package is to be able to launch jobs on Slurm painlessly from inside Python, using the same submission and job patterns than the standard library package concurrent.futures:

Here are a few benefits of using this lightweight package:

  • submit any function, even lambda and script-defined functions.
  • raises an error with stack trace if the job failed.
  • requeue preempted jobs (Slurm only)
  • swap between submitit executor and one of concurrent.futures executors in a line, so that it is easy to run your code either on slurm, or locally with multithreading for instance.
  • checkpoints stateful callables when preempted or timed-out and requeue from current state (advanced feature).
  • easy access to task local/global rank for multi-nodes/tasks jobs.
  • same code can work for different clusters thanks to a plugin system.

Submitit is used by FAIR researchers on the FAIR cluster. The defaults are chosen to make their life easier, and might not be ideal for every cluster.

Non-goals

  • a commandline tool for running slurm jobs. Here, everything happens inside Python. To this end, you can however use Hydra's submitit plugin (version >= 1.0.0).
  • a task queue, this only implements the ability to launch tasks, but does not schedule them in any way.
  • being used in Python2! This is a Python3.6+ only package :)

Comparison with dask.distributed

dask is a nice framework for distributed computing. dask.distributed provides the same concurrent.futures executor API as submitit:

from distributed import Client
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=1, cores=2, memory="2GB")
cluster.scale(2)  # this may take a few seconds to launch
executor = Client(cluster)
executor.submit(...)

The key difference with submitit is that dask.distributed distributes the jobs to a pool of workers (see the cluster variable above) while submitit jobs are directly jobs on the cluster. In that sense submitit is a lower level interface than dask.distributed and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through dask.distributed.

Contributors

By chronological order: Jérémy Rapin, Louis Martin, Lowik Chanussot, Lucas Hosseini, Fabio Petroni, Francisco Massa, Guillaume Wenzek, Thibaut Lavril, Vinayak Tantia, Andrea Vedaldi, Max Nickel, Quentin Duval (feel free to contribute and add your name ;) )

License

Submitit is released under the MIT License.

Comments
  • Import error

    Import error

    This bug is baffling me, I'm sure this is user error because I normally have no issues with your code. I have not figured out what is different than my other submissions, but maybe you've seen this before?

    .../python3.8/site-packages/submitit/core/_submit.py", line 7, in <module>
        from .submission import submitit_main
    ImportError: attempted relative import with no known parent package
    
    bug 
    opened by jgbos 26
  • Fixing deadlock when command prints a lot to stderr

    Fixing deadlock when command prints a lot to stderr

    Currently, only stdout is read on the fly. If the stderr pipe fills up, then, the subprocess will deadlock when trying to write to stderr. As the parent process, only reads to stdout, and waits for the process to finish, this will never resolve. This uses instead the select function to find which file descriptor can be read from.

    This also adds a unit test for this specific case.

    CLA Signed 
    opened by adefossez 9
  • Job is not ending - Bypassing signal SIGTERM

    Job is not ending - Bypassing signal SIGTERM

    Hello,

    I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the Finished script log), but the slurm job itself does not finish. Instead, I get these Bypassing signal messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.

    Here are logs from my neverending job :(

    03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,093) - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    

    To provide more information, this happens when I run run job arrays:

    jobs = executor.map_array(fn, configs)
    job2cfg = {job: cfg for job, cfg in zip(jobs, configs)}
    list(tqdm(submitit.helpers.as_completed(jobs), total=len(jobs)))
    

    Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):

    ❯ tail /home/olab/kirstain/masking/log_test/71604_*/71604_*_0_log.err
    ==> /home/olab/kirstain/masking/log_test/71604_0/71604_0_0_log.err <==
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Flattening
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 292793 blocks to write
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Batching
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 3 shards to write
    100%|██████████| 3/3 [00:00<00:00, 3093.14it/s]
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Writing 3 shards to /home/olab/kirstain/masking/data/5/1024/shards/ArXiv
    100%|██████████| 3/3 [00:08<00:00,  2.78s/it]
    03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished script
    03/31/2022 20:53:18 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_10/71604_10_0_log.err <==
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,179) - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_11/71604_11_0_log.err <==
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,093) - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:50 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_12/71604_12_0_log.err <==
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_13/71604_13_0_log.err <==
    100%|██████████| 65/65 [00:00<00:00, 10539.67it/s]
    03/31/2022 21:06:43 - INFO - masking.scripts.shard_corpus - Writing 65 shards to /home/olab/kirstain/masking/data/5/1024/shards/Wikipedia_en
    100%|██████████| 65/65 [01:37<00:00,  1.50s/it]
    03/31/2022 21:08:22 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 21:08:23 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 21:08:23,557) - Bypassing signal SIGTERM
    03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:08:23,561) - Bypassing signal SIGTERM
    03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:08:31 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_1/71604_1_0_log.err <==
    03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:59:04,562) - Bypassing signal SIGTERM
    03/31/2022 20:59:04 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,084) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71606.0 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,085) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71606 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_2/71604_2_0_log.err <==
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Batching
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - We have 11 shards to write
    100%|██████████| 11/11 [00:00<00:00, 3551.76it/s]
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Writing 11 shards to /home/olab/kirstain/masking/data/5/1024/shards/Books3
    100%|██████████| 11/11 [00:33<00:00,  3.09s/it]
    03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 21:20:33,823) - Bypassing signal SIGTERM
    03/31/2022 21:20:33 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:20:34 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_3/71604_3_0_log.err <==
    submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_4/71604_4_0_log.err <==
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,096) - Bypassing signal SIGCONT
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 71609.0 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,106) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71609 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_5/71604_5_0_log.err <==
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,092) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71610.0 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    submitit WARNING (2022-03-31 21:25:40,093) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    slurmstepd: error: *** JOB 71610 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_6/71604_6_0_log.err <==
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,089) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71611.0 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    submitit WARNING (2022-03-31 21:25:40,091) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    slurmstepd: error: *** JOB 71611 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_7/71604_7_0_log.err <==
    100%|██████████| 13/13 [00:00<00:00, 5749.26it/s]
    03/31/2022 20:53:43 - INFO - masking.scripts.shard_corpus - Writing 13 shards to /home/olab/kirstain/masking/data/5/1024/shards/OpenWebText2
    100%|██████████| 13/13 [00:27<00:00,  2.08s/it]
    03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:54:11,320) - Bypassing signal SIGTERM
    03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:54:11,321) - Bypassing signal SIGTERM
    03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:54:14 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_8/71604_8_0_log.err <==
    submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
    03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,086) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71613.0 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,088) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71613 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_9/71604_9_0_log.err <==
    100%|██████████| 21/21 [00:00<00:00, 4838.52it/s]
    03/31/2022 20:56:39 - INFO - masking.scripts.shard_corpus - Writing 21 shards to /home/olab/kirstain/masking/data/5/1024/shards/Pile-CC
    100%|██████████| 21/21 [00:42<00:00,  2.01s/it]
    03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:57:21,672) - Bypassing signal SIGTERM
    03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:57:21,674) - Bypassing signal SIGTERM
    03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:57:24 - INFO - submitit - Job completed successfully
    
    opened by yuvalkirstain 7
  • How to comment a slurm variable?

    How to comment a slurm variable?

    Hi All,

    I observe the following error which is due to added: "#SBATCH --gpus-per-node=4" line in the generated slurm script.

    Error : submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

    Can developers/users of submitit guide me on where to comment/delete the above line in the slum script before it is submitted by the sbatch command?

    Thanks, Amit Ruhela

    opened by aruhela 7
  • Asyncio methods for job

    Asyncio methods for job

    asyncio has a lot of prebuild tools for dealing with asynchronous execution (like submitit jobs). gather allows to deal with parts of the jobs failing, as_completed is available out of the box, timeouts can be added, and we can transparently combine jobs with other async stuff easily.

    async also sounds cooler than blocking :D

    I added tests for the single task job cases as I didn't see other tests for the multi task code. But it might be worth adding these too.

    CLA Signed 
    opened by Mortimerp9 7
  • Temporary saved file already exists

    Temporary saved file already exists

    Hi,

    Thank you for this amazing tool! I just started using it recently. I'm encountering some weird error and I was hoping you could help me fix it. Here is the error log:

    submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
    slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
    slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
    submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
    submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
    38544484_16: Job is pending execution
    submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
    Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
    
    because of A temporary saved file already exists.
    submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
    Traceback (most recent call last):
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
        submitit_main()
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
        process_job(args.folder)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
        raise error
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
        utils.cloudpickle_dump(("success", result), tmppath)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
        cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
        self.env._requeue(countdown)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
        subprocess.check_call(["scontrol", "requeue", jid])
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
    /bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
    submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
    A temporary saved file already exists.
    
    because of A temporary saved file already exists.
    submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
    Traceback (most recent call last):
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
        submitit_main()
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
        process_job(args.folder)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
        raise error
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
        with utils.temporary_save_path(paths.result_pickle) as tmppath:  # save somewhere else, and move
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
        return next(self.gen)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
        assert not tmppath.exists(), "A temporary saved file already exists."
    AssertionError: A temporary saved file already exists.
    srun: error: learnfair0292: task 0: Exited with exit code 1
    srun: launch/slurm: _step_signal: Terminating StepId=38544509.1
    

    My analysis of the error is as follows. The temporary save file error is thrown in process_job here. One possible reason why this could happen is if the tmppath was created previously in the try block, but there was a failure before the context ended.

    This could happen either in the utils.cloudpickle_dump() call or in logger.info(). However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp that contains the following information ('success', None). So is the error with logger? Or am I completely off here?

    I'm running a job array with 1024 jobs and 128 slurm_array_parallelism. The code run by the jobs actually completed and the results were saved. So I don't think this is an error in the python function I ran.

    opened by srama2512 7
  • Adding SnapshotManager

    Adding SnapshotManager

    This allows users to create a snapshot of the current git repository and launch the job from this snapshot. This can prevent jobs that are slow to start or re-queued from picking up local changes

    CLA Signed 
    opened by lematt1991 7
  • [To be discussed] Add option to submit within a batch context

    [To be discussed] Add option to submit within a batch context

    the aim is to be automatically batch jobs which can be batched together, but submit whenever we need information: Eg: in nevergrad we send 40 jobs for evaluation, which could be packed together, and then whenever a job is finished we reschedule a new evaluation. Currently it's impossible to do with a batch context (or any other option), but with this change, it would be possible, by running the optimization within a batch context. This way initial submissions are packed, and sent whenever we start checking there status.

    CLA Signed 
    opened by jrapin 6
  • TypeError: an integer is required (got type bytes)

    TypeError: an integer is required (got type bytes)

    Since upgrading to python 3.8 I can't access my old jobs' submission pickle (error below).

    The problem might be related to this issue or this one but I have no clue what it means.

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-17-0248393e65dd> in <module>
         30         if state != "COMPLETED":
         31             continue
    ---> 32         row = job.submission().kwargs
         33         row["scores"] = job.result()
         34         row["exp_name"] = exp_name
    
    ~/dev/ext/submitit/submitit/core/core.py in submission(self)
        206             self.paths.submitted_pickle.exists()
        207         ), f"Cannot find job submission pickle: {self.paths.submitted_pickle}"
    --> 208         return utils.DelayedSubmission.load(self.paths.submitted_pickle)
        209 
        210     def cancel_at_deletion(self, value: bool = True) -> "Job[R]":
    
    ~/dev/ext/submitit/submitit/core/utils.py in load(cls, filepath)
        133     @classmethod
        134     def load(cls: Type["DelayedSubmission"], filepath: Union[str, Path]) -> "DelayedSubmission":
    --> 135         obj = pickle_load(filepath)
        136         # following assertion is relaxed compared to isinstance, to allow flexibility
        137         # (Eg: copying this class in a project to be able to have checkpointable jobs without adding submitit as dependency)
    
    ~/dev/ext/submitit/submitit/core/utils.py in pickle_load(filename)
        271     # this is used by cloudpickle as well
        272     with open(filename, "rb") as ifile:
    --> 273         return pickle.load(ifile)
        274 
        275 
    
    TypeError: an integer is required (got type bytes)
    

    Repro: Start a job with python 3.7 and then try to access it in python 3.8. In python 3.7

    Python 3.7.4 (default, Aug 13 2019, 20:35:49)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import submitit
    >>>
    >>> def add(a, b):
    ...     return a + b
    ...
    >>> executor = submitit.AutoExecutor(folder="log_test")
    >>> executor.update_parameters(timeout_min=1, slurm_partition="dev")
    >>> job = executor.submit(add, 5, 7)
    >>> print(job.job_id)
    33389760
    >>> job.submission()
    <submitit.core.utils.DelayedSubmission object at 0x7f42f5952bd0>
    

    In python 3.8

    Python 3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15)
    [GCC 7.5.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import submitit
    >>> job = submitit.SlurmJob("log_test", "33389760")
    >>> job.submission()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
        return utils.DelayedSubmission.load(self.paths.submitted_pickle)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
        obj = pickle_load(filepath)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
        return pickle.load(ifile)
    TypeError: an integer is required (got type bytes)
    >>> job.submission()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
        return utils.DelayedSubmission.load(self.paths.submitted_pickle)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
        obj = pickle_load(filepath)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
        return pickle.load(ifile)
    TypeError: an integer is required (got type bytes)
    
    opened by louismartin 6
  • Set additional slurm parameters

    Set additional slurm parameters

    Hello,

    I would like to know if it's possible to set additional slurm parameters (and how to set them), because I couldn't find this information in the documentation.

    For example, I have a few arguments that I usually set using srun, such as --account=myaccount --hint=nomultithread --distribution=block:block --exclusive, but I have no idea how to set them in submitit.

    Thank you in advance for your answer!

    opened by netw0rkf10w 6
  • [BUG] `Scontrol` Error when checkpointing / preemption on slurm

    [BUG] `Scontrol` Error when checkpointing / preemption on slurm

    Hi,

    For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint: FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

    Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

    • Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
    • If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
    • Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

    Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

    Thank you !

    opened by YannDubs 5
  • Can submitit manage chain dependencies?

    Can submitit manage chain dependencies?

    Hi, thanks for this awesome project!

    I started to write something similar but then realized that submitit exists and is much more advanced than my small project!

    However, I realized that the chain dependency (as it is implemented in dask.distributed) seems missing in submitit.

    Would it make sense to implement it? Or maybe it's already there?

    It should be quite easy to implement by using the f sbatch option --dependency.

    opened by eserie 0
  • Should we submit job on login node?

    Should we submit job on login node?

    Hi, I'm trying to use submitit to submit a job to my slurm cluster on gcp. In this case, does it make sense to run a submitit script from the login node? When I run the example script I see it execute on the local machine rather than on the 'compute' instances of the cluster. It does not seem to allocate an instance from the partition that I give either.

    opened by surajmenon72 0
  • No user code logging output is shown in logs

    No user code logging output is shown in logs

    Summary: I am not seeing expected logging info in SLURM log files.

    Given this submitit script:

    import submitit
    
    from src.the_module.the_func
    
    with open("log.yml", "rt", encoding="utf-8") as logconfig:
        config = yaml.load(logconfig.read(), Loader=yaml.FullLoader)
    logging.config.dictConfig(config)
    
    executor = submitit.AutoExecutor(folder="log_test")
    
    executor.update_parameters(timeout_min=1, slurm_partition="dev")
    job = executor.submit(the_func, the, args)
    
    output = job.result()
    

    the logging in src.the_module:

    LOGGER = logging.getLogger(__name__)
    ...
    LOGGER.info(...)
    ...
    

    the logging config in "log.yml":

    version: 1
    ...
    loggers:
      src.the_module:
        level: INFO
        handlers: [console, file]
      the_module:
        level: INFO
        handlers: [console, file]
    

    I do not see the_module INFO lines in "log_test/JOBID_0_log.out", only the default submitit INFO log lines and the job stdout. Is this even supposed to work that way or does logging have to be configured some other way in submitit?

    opened by fleimgruber 0
  • be tolerating about sacct error?

    be tolerating about sacct error?

    On my slurm cluster I haven't setup accounting yet. Is the following error msg related to that? Maybe the accounting option can be turned off to avoid this error message?

    I was running it with hydra.

    [2022-11-08 14:00:40,915][HYDRA] Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
    s 1., status may be inaccurate.
    Slurm accounting storage is disabled
    submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
    o exit status 1., status may be inaccurate.
    submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
    o exit status 1., status may be inaccurate.
    [2022-11-08 14:00:43,921][HYDRA] Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
    s 1., status may be inaccurate.
    
    opened by min-xu-ai 0
  • array_parallelism for LocalExecutor

    array_parallelism for LocalExecutor

    When using job arrays, SlurmExecutor may limit the number of concurrent jobs running via array_parallelism parameter. However, it seems to me LocalExecutor does not have the corresponding functionality. Would it be meaningful to add an option to LocalExecutor which limits the number of concurrently running jobs? Would it be too cumbersome to make PicklingExecutor._internal_process_submissions limit the concurrent jobs without problems?

    Use case

    • A big job is partitioned into many smaller jobs
    • The slurm queue is full with many pending jobs
    • Would like to use the same codebase using submitit in a separate compute environment without slurm, with minimal code changes

    Current solution

    • Use ThreadPoolExecutor, where the workers run a function which creates its own LocalExecutor, submit, and wait until finishes.
    • And the pool controls the number of concurrent jobs

    Suggestion

    • The following code to execute without running more than 5 jobs concurrently
    executor = LocalExecutor(folder=somewhere)
    executor.update_parameters(array_parallelism=5)
    
    with executor.batch():
        ...
    
    opened by se-ok 0
Releases(1.2.0)
  • 1.2.0(Feb 1, 2021)

    #1604 Load numpy first if available #1603 Don't rely on Slurm for detecting timeout vs preemption. Due to a regression in Slurm between 19.04 and 20.02. #1602 Fix quoting of paths in various places #1598 Snapshot manager to copy code before starting job

    Source code(tar.gz)
    Source code(zip)
  • 1.1.3(Oct 22, 2020)

Owner
Facebook Incubator
We work hard to contribute our work back to the web, mobile, big data, & infrastructure communities. NB: members must have two-factor auth.
Facebook Incubator
Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

Katana ML 295 Jan 06, 2023
Educational python for Neural Networks, written in pure Python/NumPy.

Educational python for Neural Networks, written in pure Python/NumPy.

127 Oct 27, 2022
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 02, 2023
Turns your machine learning code into microservices with web API, interactive GUI, and more.

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Machine Learning Tooling 2.8k Jan 02, 2023
Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

42 Dec 23, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 02, 2022
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
This is a Cricket Score Predictor that predicts the first innings score of a T20 Cricket match using Machine Learning

This is a Cricket Score Predictor that predicts the first innings score of a T20 Cricket match using Machine Learning. It is a Web Application.

Developer Junaid 3 Aug 04, 2022
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
Time Series Prediction with tf.contrib.timeseries

TensorFlow-Time-Series-Examples Additional examples for TensorFlow Time Series(TFTS). Read a Time Series with TFTS From a Numpy Array: See "test_input

Zhiyuan He 476 Nov 17, 2022
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
Deploy AutoML as a service using Flask

AutoML Service Deploy automated machine learning (AutoML) as a service using Flask, for both pipeline training and pipeline serving. The framework imp

Chris Rawles 221 Nov 04, 2022
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization ├── LICEN

Ashish Tele 6 Feb 06, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 438 Dec 17, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

Emmanuel ASIFIWE 1 Nov 23, 2021
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021