Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Overview

Syne Tune

Release Python Version License

This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend options (local backend to evaluate them locally; SageMaker to evaluate them as separate SageMaker training jobs; another backend with fast startup times is also in the making).

Installing

To install Syne Tune from pip, you can simply do:

pip install syne-tune

This will install a bare-bone version. If you want in addition to install our own Gaussian process based optimizers, Ray Tune or Bore optimizer, you can run pip install syne-tune[X] where X can be

  • gpsearchers: For built-in Gaussian process based optimizers
  • raytune: For Ray Tune optimizers
  • benchmarks: For installing all dependencies required to run all benchmarks
  • extra: For installing all the above
  • bore: For Bore optimizer

For instance, pip install syne-tune[gpsearchers] will install Syne Tune along with many built-in Gaussian process optimizers.

To install the latest version from git, run the following:

pip install git+https://github.com/awslabs/syne-tune.git

For local development, we recommend to use the following setup which will enable you to easily test your changes:

pip install --upgrade pip
git clone [email protected]:awslabs/syne-tune.git
cd syne-tune
pip install -e .[extra]

How to enable tuning and tuning script conventions

This section describes how to enable tuning an endpoint script. In particular, we describe:

  1. how hyperparameters are transmitted from the “tuner” to the user script function
  2. how the user communicates metrics to the “tuner” script (which depends on a backend implementation)
  3. how does the user enables checkpointing to pause/resume trial tuning jobs?

Hyperparameters. Hyperparameters are passed through command line arguments as in SageMaker. For instance, for a hyperparameters num_epochs:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--num_epochs', type=int, required=True)
args, _ = parser.parse_known_args()
for i in range(1, args.num_epochs + 1):
  ... # do something

Communicating metrics. You should call a function to report metrics after each epochs or at the end of the trial. For example:

from syne_tune.report import report
for epoch in range(1, num_epochs + 1):
   # ... do something
   train_acc = compute_accuracy()
   report(train_acc=train_acc, epoch=epoch)

reports artificial results obtained in a dummy loop. In addition to user metrics, Syne Tune will automatically add the following metrics:

  • st_worker_timestamp: the time stamp when report was called
  • st_worker_time: the total time spent when report was called since the creation of the reporter
  • st_worker_cost (only when running on SageMaker): the dollar-cost spent since the creation of the reporter

Model output and checkpointing (optional). Since trials may be paused and resumed (either by schedulers or when using spot-instances), the user has the possibility to checkpoint intermediate results. Model outputs and checkpoints must be written into a specific local path given by the command line argument st_checkpoint_dir. Saving/loading model checkpoint from this directory enables to save/load the state when the job is stopped/resumed (setting the folder correctly and uniquely per trial is the responsibility of the backend), see checkpoint_example.py to see a fully working example of a tuning script with checkpoint enabled.

Under the hood, we use SageMaker checkpoint mechanism to enable checkpointing when running tuning remotely or when using the SageMaker backend. Checkpoints are saved in s3://{s3_bucket}/syne-tune/{tuner-name}/{trial-id}/, where s3_bucket can be configured (defaults to default_bucket of the session).

We refer to checkpoint_example.py for a complete example of a script with checkpoint enabled.

Many other examples of scripts that can be tuned are are available in examples/training_scripts.

Launching a tuning job

Tuning options. At a high-level a tuning consists in a tuning-loop that evaluates different trial in parallel and only let the top ones continue. This loop continues until a stopping criterion is met (for instance a maximum wallclock-time) and each time a worker is available asks a scheduler (an HPO algorithm) to decide which trial should be evaluated next. The execution of the trial is done on a backend. The pseudo-code of an HPO loop is as follow:

def hpo_loop(hpo_algorithm, backend):
    while not_done():
        if worker_is_free():
            config = hpo_algorithm.suggest()
            backend.start_trial(config)
        for result in backend.fetch_new_results():
            decision = hpo_algorithm.on_trial_result(result)
            if decision == "stop":
                backend.stop_trial(result.trial)

By changing the backend, users can decide whether the trial should be evaluated in a local machine, whether the trial should be executed on SageMaker with a separate training job or whether the trial should be evaluated on a cluster of multiple machines (available as a separate package for now).

Below is a minimal example showing how to tune a script train_height.py with Random-search:

from pathlib import Path
from syne_tune.search_space import randint
from syne_tune.backend.local_backend import LocalBackend
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.tuner import Tuner

config_space = {
    "steps": 100,
    "width": randint(0, 20),
    "height": randint(-100, 100)
}

# path of a training script to be tuned
entry_point = Path(__file__).parent / "training_scripts" / "height_example" / "train_height.py"

# Local back-end
backend = LocalBackend(entry_point=str(entry_point))

# Random search without stopping
scheduler = FIFOScheduler(
    config_space,
    searcher="random",
    mode="min",
    metric="mean_loss",
)

tuner = Tuner(
    backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=30),
    n_workers=4,
)

tuner.run()

An important part of this script is the definition of config_space, the configuration space (or search space). This tutorial provides some advice on this choice.

Using the local backend LocalBackend(entry_point=...) allows to run the trials (4 at the same time) on the local machine. If instead, users prefer to evaluate trials on SageMaker, then SageMaker backend can be used which allow to tune any SageMaker Framework (see launch_height_sagemaker.py for an example), here is one example to run a PyTorch estimator on a GPU

from sagemaker.pytorch import PyTorch
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role

backend = SagemakerBackend(
    # we tune a PyTorch Framework from Sagemaker
    sm_estimator=PyTorch(
        entry_point="path_to_your_entrypoint.py",
        instance_type="ml.p2.xlarge",
        instance_count=1,
        role=get_execution_role(),
        max_run=10 * 60,
        framework_version='1.7.1',
        py_version='py3',
    ),
)

Note that Syne Tune code is sent with the SageMaker Framework so that the import syne_tune.report that imports the reporter works when executing the training script, as such there is no need to install Syne Tune in the docker image of the SageMaker Framework.

In addition, users can decide to run the tuning loop on a remote instance. This is helpful to avoid the need of letting a developer machine run and to benchmark many seed/model options.

tuner = RemoteLauncher(
    tuner=Tuner(
        backend=backend,
        scheduler=scheduler,
        n_workers=n_workers,
        tuner_name="height-tuning",
        stop_criterion=StoppingCriterion(max_wallclock_time=600),
    ),
    # Extra arguments describing the ressource of the remote tuning instance and whether we want to wait
    # the tuning to finish. The instance-type where the tuning job runs can be different than the
    # instance-type used for evaluating the training jobs.
    instance_type='ml.m5.large',
)

tuner.run(wait=False)

In this case, the tuning loop is going to be executed on a ml.m5.large instance instead of running locally. Both backends can be used when using the remote launcher (if you run with the Sagemaker backend the tuning loop will happen on the instance type specified in the remote launcher and the trials will be evaluated on the instance(s) configured in the SageMaker framework, this may include several instances in case of distributed training). In the case where the remote launcher is used with a SageMaker backend, a SageMaker job is created to execute the tuning loop which then schedule a new SageMaker training job for each configuration to be evaluated. The options and use-case in this table:

Tuning loop Trial execution Use-case example
Local Local Quick tuning for cheap models, debugging. launch_height_local.py
Local SageMaker Avoid saturating machine with trial computation with expensive trial, possibly use distributed training, enable debugging the tuning loop on a local machine. launch_height_sagemaker.py
SageMaker Local Run remotely to benchmark many HPO algo/seeds options, possibly with a big machine with multiple CPUs or GPUs. launch_height_sagemaker_remotely.py
SageMaker SageMaker Run remotely to benchmark many HPO algo/seeds options, enable distributed training or heavy computation. launch_height_sagemaker_remotely.py with distribute_trials_on_SageMaker=True

To summarize, to evaluate trial execution locally, users should use LocalBackend, to evaluate trials on SageMaker users should use the SageMakerBackend which allows to tune any SageMaker Estimator, see launch_height_local.py or launch_height_sagemaker.py for examples. To run a tuning loop remotely, RemoteLauncher can be used, see launch_height_sagemaker_remotely.py for an example.

Output of a tuning job.

Every tuning experiment generates three files:

  • results.csv.zip contains live information of all the results that were seen by the scheduler in addition to other information such as the decision taken by the scheduler, the wallclock time or the dollar-cost of the tuning (only on SageMaker).
  • tuner.dill contains the checkpoint of the tuner which include backend, scheduler and other information. This can be used to resume a tuning experiment, use Spot instance for tuning or perform fine-grain analysis of the scheduler state.
  • metadata.json contains the time-stamp when the Tuner start to effectively run. It also contains possible user metadata information.

For instance, the following code:

tuner = Tuner(
   backend=backend,
   scheduler=scheduler,
   n_workers=4,
   tuner_name="height-tuning",
   stop_criterion=StoppingCriterion(max_wallclock_time=600),
   metadata={'description': 'just an example'},
)
tuner.run()

runs a tuning by evaluating 4 configurations in parallel with a given backend/scheduler and stops after 600s. Tuner appends a unique string to ensure unicity of tuner name (with the above example the id of the experiment may be height-tuning-2021-07-02-10-04-37-233). Results are updated every 30 seconds by default which is configurable.

Experiment data can be retrieved at a later stage for further analysis with the following command:

tuning_experiment = load_experiment("height-tuning-2021-07-02-10-04-37-233")
tuning_experiment = load_experiment(tuner.name) # equivalent

The results obtained load_experiment have the following schema.

class ExperimentResult:
    name: str
    results: pandas.DataFrame
    metadata: Dict
    tuner: Tuner

Where metadata contains the metadata provided by the user ({'description': 'just an example'} in this case) as well as st_tuner_creation_timestamp which stores the time-stamp when the tuning actually started.

Output of a tuning job when running tuning on SageMaker. When the tuning runs remotely on SageMaker, the results are stored at a regular cadence to s3://{s3_bucket}/syne-tune/{tuner-name}/, where s3_bucket can be configured (defaults to default_bucket of the session). For instance, if the above experiment is run remotely, the following path is used for checkpointing results and states:

s3://sagemaker-us-west-2-{aws_account_id}/syne-tune/height-tuning-2021-07-02-10-04-37-233/results.csv.zip

Multiple GPUs. If your instance has multiple GPUs, the local backend can run different trials in parallel, each on its own GPU (with the option LocalBackend(rotate_gpus=True), which is activated by default). When a new trial starts, it is assigned to a free GPU if possible. In case of ties, the GPU with fewest prior assignments is chosen. If the number of workers is larger than the number of GPUs, several trials will run as subprocesses on the same GPU. If the number of workers is smaller or equal to the number of GPUs, each trial occupies a GPU on its own, and trials can start without delay. Reasons to choose rotate_gpus=False include insufficient GPU memory or the training evaluation code making good use of multiple GPUs.

Examples

Once you have a tuning script, you can call Tuner with any scheduler to perform your HPO. You will find the following examples in examples/ folder:

Running on SageMaker

If you want to launch experiments on SageMaker rather than on your local machine, you will need access to AWS and SageMaker on your machine.

Make sure that:

  • awscli is installed (see this link)
  • docker is installed and running (see this link)
  • A SageMaker role have been created (see this page for instructions if you created a SageMaker notebook in the past, this role should have been created for you).
  • AWS credentials have been set properly (see this link).

Note: all those conditions are already met if you run in a SageMaker notebook, they are only relevant if you run in your local machine or on another environment.

The following command should run without error if your credentials are available:

python -c "import boto3; print(boto3.client('sagemaker').list_training_jobs(MaxResults=1))"

Or run the following example that evaluates trials on SageMaker.

python examples/launch_height_sagemaker.py

Syne Tune allows you to launch HPO experiments remotely on SageMaker, instead of them running on your local machine. This is particularly interesting for running many experiments in parallel. Here is an example:

python examples/launch_height_sagemaker_remotely.py

If you run this for the first time, it will take a while, building a docker image with the Syne Tune dependencies and pushing it to ECR. This has to be done only once, even if Syne Tune source code is modified later on.

Assuming that launch_height_sagemaker_remotely.py is working for you now, you should note that the script returns immediately after starting the experiment, which is running as a SageMaker training job. This allows you to run many experiments in parallel, possibly by using the command line launcher.

If running this example fails, you are probably not setup to build docker images and push them to ECR on your local machine. Check that aws-cli is installed and that docker is running on your machine. After checking that those conditions are met (consider using a SageMaker notebook if not since AWS access and docker are configured automatically), you can try to building the image again by running with the following:

cd container
bash build_syne_tune_container.sh

To run on SageMaker, you can also use any custom docker images available on ECR. See launch_height_sagemaker_custom_image.py for an example on how to run with a script with a custom docker image.

Benchmarks

Syne Tune comes with a range of benchmarks for testing and demonstration. Turning your own tuning problem into a benchmark is simple and comes with a number of advantages. As detailed in this tutorial, you can use the CL launcher launch_hpo.py in order to start one or more experiments, adjusting many parameters of benchmark, back-end, tuner, or scheduler from the command line. The simpler launch_benchmarks.py can also be used to launch experiments.

Once tunings experiments are finished, show_experiment_results.py gives an example of how results can be retrieved and plotted.

Tutorials

Do you want to know more? Here are a number of short tutorials.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Comments
  • Allow for independence between number of trials and number of combinations.

    Allow for independence between number of trials and number of combinations.

    TL;DR: Please allow to run experiments with more (or less) trials than the number of combinations.

    Right now when running a hyperparameter tuning job with AMT I get an error message when the number of trials exceeds the number of enumerable combinations; here being the discrete combinations of integer and categorical parameter values.

    But this is limiting as the exploration of the search space is noisy and more than one trial may be needed to understand the variance and to establish a stable mean.

    As a workaround I am starting tuning jobs with the Random strategy with an additional hyper parameter "dummy" that is continuous. This way I can specify the number of trials I need. But this makes it harder to use this data basis for future warmstarts to narrow down scenarios. Further it forces me to allow the "dummy" parameter in my training script.

    Example: I want to know if adding a non-linearity and additional capacity (Pooler) on top of a BERT-like model will yield better performance as result, or if the extra capacity will make the model lazier and not use the transformer blocks below. I also want to see if this assessment changes when adding more transformer layers.

    So I have two categorical variables. Layers: [1, 4, 8] and scale-of-classifier: [0, 0.5, 1.0, 2.0]. These are just 3*4 combinations. But given the noisy nature of a NN training a single data point per combination has next to no meaning. To produce the understanding below I used about 100 data points with the workaround from above.

    If I could just specify the categorical parameters and the number of trails (for GridSearch/Random) my appreciation will follow you until the end of your hopefully long and fulfilling life.

    image image

    opened by marianokamp 23
  • Experiment Results Contain Random Rows

    Experiment Results Contain Random Rows

    In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.

    See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy. I've added a simple example to reproduce this behavior.

    image

    from pathlib import Path
    
    from sagemaker.pytorch import PyTorch
    
    from syne_tune.backend import SageMakerBackend
    from sagemaker import get_execution_role
    from syne_tune.optimizer.baselines import RandomSearch
    from syne_tune import Tuner
    from syne_tune.config_space import randint
    from syne_tune import StoppingCriterion
    from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
    
    entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
    assert entry_point.is_file(), 'File unknown'
    mode = "min"
    metric = "mean_loss"
    instance_type = 'ml.c5.4xlarge'
    instance_count = 1
    instance_max_time = 999
    n_workers = 20
    
    config_space = {
        "steps": 1,
        "width": randint(0, 20),
        "height": randint(-100, 100)
    }
    
    backend = SageMakerBackend(
        sm_estimator=PyTorch(
            entry_point=str(entry_point),
            instance_type=instance_type,
            instance_count=instance_count,
            role=get_execution_role(),
            max_run=instance_max_time,
            py_version='py3',
            framework_version='1.6',
        ),
        metrics_names=[metric],
    )
    
    # Random search without stopping
    scheduler = FIFOScheduler(
        config_space=config_space,
        searcher='random',
        mode=mode,
        metric=metric,
    )
    
    tuner = Tuner(
        trial_backend=backend,
        scheduler=scheduler,
        stop_criterion=StoppingCriterion(max_wallclock_time=300),
        n_workers=n_workers,
    )
    
    tuner.run()
    
    
    bug 
    opened by wistuba 17
  • Promotion Logic Bug

    Promotion Logic Bug

    There seems to be a problem with the Hyperband promotion logic.

    How to reproduce: Add type="promotion" to https://github.com/awslabs/syne-tune/blob/main/benchmarking/nursery/benchmark_automl/baselines.py#L69

    Run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

      File "/syne-tune/benchmarking/nursery/benchmark_automl/benchmark_main.py", line 209, in <module>
        tuner.run()
      File "/syne-tune/syne_tune/tuner.py", line 240, in run
        raise e
      File "/syne-tune/syne_tune/tuner.py", line 175, in run
        new_done_trial_statuses, new_results = self._process_new_results(
      File "/syne-tune/syne_tune/tuner.py", line 345, in _process_new_results
        done_trials_statuses = self._update_running_trials(
      File "/syne-tune/syne_tune/tuner.py", line 465, in _update_running_trials
        decision = self.scheduler.on_trial_result(trial=trial, result=result)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 779, in on_trial_result
        task_info = self.terminator.on_task_report(trial_id, result)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 1124, in on_task_report
        rung_sys.on_task_report(trial_id, result, skip_rungs=skip_rungs)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband_promotion.py", line 221, in on_task_report
        assert resource == milestone, (
    AssertionError: trial_id 1: resource = 4 > 3 milestone. Make sure to report time attributes covering all milestones```
    bug 
    opened by wistuba 16
  • Refactor surrogates in blackbox repository

    Refactor surrogates in blackbox repository

    Currently, surrogates may return inconsistent metric curves (e.g., elapsed_time not monotonic w.r.t. fidelity). It is also unclear how seed is treated in a surrogate.

    Will use multi-variate regression natively supported in scikit-learn. We currently already use that w.r.t. num_objectives. The input of the model will be the HP config only. The old way can still be used, but won't be the default.

    Will also sort out the situation with seed.

    enhancement 
    opened by mseeger 15
  • Grid search in syne-tune

    Grid search in syne-tune

    Hey folks, would you be interested in grid search implemented in syne-tune? I had a few offline discussions with some of you already, and it seems that you are not against grid search added to syne-tune, but want to keep a record of that here.

    Additionally, would you have any pointers as to what would be the best way to add grid search to syne-tune?

    enhancement 
    opened by iaroslav-ai 15
  • SageMaker ResourceLimitExceeded

    SageMaker ResourceLimitExceeded

    Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

    Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

    try:
        # manage tuning jobs
    except:
       # raise error
    finally:
       # stop any trials still running
    
    enhancement 
    opened by austinmw 15
  • Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

    Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

    Hello! When running https://github.com/awslabs/syne-tune/blob/main/docs/tutorials/basics/scripts/launch_sagemaker_backend.py (python docs/tutorials/basics/scripts/launch_sagemaker_backend.py) on the main branch I get an error within the spawned SageMaker training jobs:

    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    

    I'm including the full log below. I’m not certain if it’s due to my AWS environment setup (although I am generally able to run SageMaker training jobs) or an issue with the code, could you please have a look?

    Best wishes, Adam

    Full log:

    showing log of sagemaker job: traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4
    bash: cannot set terminal process group (-1): Inappropriate ioctl for device
    bash: no job control in this shell
    2022-01-18 16:34:35,020 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
    2022-01-18 16:34:35,023 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:35,035 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
    2022-01-18 16:34:36,465 sagemaker_pytorch_container.training INFO     Invoking user training script.
    2022-01-18 16:34:37,061 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,076 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,090 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,103 sagemaker-training-toolkit INFO     Invoking user script
    Training Env:
    {
        "additional_framework_parameters": {},
        "channel_input_dirs": {},
        "current_host": "algo-1",
        "framework_module": "sagemaker_pytorch_container.training:main",
        "hosts": [
            "algo-1"
        ],
        "hyperparameters": {
            "batch_size": 126,
            "weight_decay": 0.7744002774231975,
            "st_checkpoint_dir": "/opt/ml/checkpoints",
            "st_instance_count": 1,
            "n_units_2": 322,
            "dataset_path": "./",
            "n_units_1": 107,
            "dropout_2": 0.20979101632756325,
            "dropout_1": 0.4715702331554363,
            "epochs": 81,
            "learning_rate": 0.0029903699075321814,
            "st_instance_type": "ml.m4.10xlarge"
        },
        "input_config_dir": "/opt/ml/input/config",
        "input_data_config": {},
        "input_dir": "/opt/ml/input",
        "is_master": true,
        "job_name": "traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4",
        "log_level": 20,
        "master_hostname": "algo-1",
        "model_dir": "/opt/ml/model",
        "module_dir": "s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz",
        "module_name": "traincode_report_withcheckpointing",
        "network_interface_name": "eth0",
        "num_cpus": 40,
        "num_gpus": 0,
        "output_data_dir": "/opt/ml/output/data",
        "output_dir": "/opt/ml/output",
        "output_intermediate_dir": "/opt/ml/output/intermediate",
        "resource_config": {
            "current_host": "algo-1",
            "hosts": [
                "algo-1"
            ],
            "network_interface_name": "eth0"
        },
        "user_entry_point": "traincode_report_withcheckpointing.py"
    }
    Environment variables:
    SM_HOSTS=["algo-1"]
    SM_NETWORK_INTERFACE_NAME=eth0
    SM_HPS={"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975}
    SM_USER_ENTRY_POINT=traincode_report_withcheckpointing.py
    SM_FRAMEWORK_PARAMS={}
    SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
    SM_INPUT_DATA_CONFIG={}
    SM_OUTPUT_DATA_DIR=/opt/ml/output/data
    SM_CHANNELS=[]
    SM_CURRENT_HOST=algo-1
    SM_MODULE_NAME=traincode_report_withcheckpointing
    SM_LOG_LEVEL=20
    SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
    SM_INPUT_DIR=/opt/ml/input
    SM_INPUT_CONFIG_DIR=/opt/ml/input/config
    SM_OUTPUT_DIR=/opt/ml/output
    SM_NUM_CPUS=40
    SM_NUM_GPUS=0
    SM_MODEL_DIR=/opt/ml/model
    SM_MODULE_DIR=s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz
    SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz","module_name":"traincode_report_withcheckpointing","network_interface_name":"eth0","num_cpus":40,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"traincode_report_withcheckpointing.py"}
    SM_USER_ARGS=["--batch_size","126","--dataset_path","./","--dropout_1","0.4715702331554363","--dropout_2","0.20979101632756325","--epochs","81","--learning_rate","0.0029903699075321814","--n_units_1","107","--n_units_2","322","--st_checkpoint_dir","/opt/ml/checkpoints","--st_instance_count","1","--st_instance_type","ml.m4.10xlarge","--weight_decay","0.7744002774231975"]
    SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
    SM_HP_BATCH_SIZE=126
    SM_HP_WEIGHT_DECAY=0.7744002774231975
    SM_HP_ST_CHECKPOINT_DIR=/opt/ml/checkpoints
    SM_HP_ST_INSTANCE_COUNT=1
    SM_HP_N_UNITS_2=322
    SM_HP_DATASET_PATH=./
    SM_HP_N_UNITS_1=107
    SM_HP_DROPOUT_2=0.20979101632756325
    SM_HP_DROPOUT_1=0.4715702331554363
    SM_HP_EPOCHS=81
    SM_HP_LEARNING_RATE=0.0029903699075321814
    SM_HP_ST_INSTANCE_TYPE=ml.m4.10xlarge
    PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
    Invoking script with the following command:
    /opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975
    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    2022-01-18 16:34:38,444 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
    Command "/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975"
    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    
    opened by talesa 14
  • Bug with Seeded Searchers

    Bug with Seeded Searchers

    Opened on behalf of @timyber:

    When we are setting a fixed seed, random sampling always have the same behaviors. and It would be running out of search space if we run a large number of training jobs. This can be reproduced by testing large budget (e.g. max_training_jobs: 100, batch_size: 1) and setting the seed to a fixed value.

    bug 
    opened by wistuba 12
  • Numeric and Log-Scale Choice

    Numeric and Log-Scale Choice

    There is no equivalent of choice for numeric values. E.g., in the FCNet blackbox the learning rate is defined as 'hp_init_lr': choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]). This will not allow model-based approaches to encode this hyperparameter correctly. Would be great to identify them as numeric and also indicate whether log transform is needed.

    enhancement 
    opened by wistuba 10
  • Gridsearcher issue 2

    Gridsearcher issue 2

    Issue #, if available: #378

    Description of changes: Added support for continuous hyperparameters to Gridsearch, added a unit test for it as well

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mina-ghashami 9
  • [WIP] Integrate YAHPO Gym

    [WIP] Integrate YAHPO Gym

    Description of changes:

    First draft for including YAHPO Gym as a BlackBoxRecipe. This is not entirely straightforward and I might need some input from @geoalgo on how to progress.

    Currently, yahpo has a nested structure: /scenario/instance where scenario is a problem set and all instances within a scenario share the same search space. (A scenario is e.g. xgboost and the instances are different datasets) In the current design, the user would call

    bb = load_blackbox("YAHPO")[<scenario>]
    bb.set_instance(<instance>)
    

    If we unnest this, this would (in total) be around 850 instances.

    @geoalgo Could you perhaps do a pass / help me think about how to integrate the different designs? I guess we might want to have one Recipe per scenario as you do in the icml_2020 recipe? Would this bloat the Recipes?

    I will list a few open to-do's:

    • [ ] Check what needs to be serialized / How to distribute the .onnx neural networks.
    • [ ] Check where the YAHPO setup (pointer to data dir etc.) needs to happen.
    • [ ] Add an example script.
    opened by pfistfl 9
  • Different searchers suggest same initial random configs. New methods …

    Different searchers suggest same initial random configs. New methods …

    …in baselines

    Issue #, if available:

    Description of changes: Ensures that all searchers return the same random initial configs when started with the same seed. Also:

    • New classes in baselines.py
    • Split searcher.py (got too large)
    • Make sure that BOHB schedulers do not return duplicates

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mseeger 0
  • Allow get_config to return the same config more than once

    Allow get_config to return the same config more than once

    Issue #, if available: 415

    Description of changes: Introduces flag allow_duplicates to searchers, which allows to return the same config more than once. Also contains a new test on searchers, whether they properly implement allow_duplicates=False (the default).

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mseeger 0
  • pytest: Add pytest-xdist

    pytest: Add pytest-xdist

    feat/parallelize-tests

    Why

    With pytest-xdist we can parallelize the test suite. This results in a quick win for a faster feedback loop.

    Numbers

    Below are the results of the test suite on my machine (an i9-13900k):

    | command | run # | real | user | sys | | --- | --- | --- | --- | --- | | pytest | 1 | 0m35.141s | 1m12.942s | 2m6.306s | | pytest | 2 | 0m37.991s | 2m18.906s | 2m49.160s | | pytest | 3 | 0m36.207s | 1m36.687s | 2m39.535s | | pytest -n 1 --dist loadgroup | 1 | 0m35.896s | 1m25.656s | 2m17.317s | | pytest -n 1 --dist loadgroup | 2 | 0m38.354s | 2m2.959s | 3m7.145s | | pytest -n 1 --dist loadgroup | 3 | 0m38.792s | 2m14.633s | 3m8.681s | | pytest -n 2 --dist loadgroup | 1 | 0m25.270s | 2m24.090s | 3m14.851s | | pytest -n 2 --dist loadgroup | 2 | 0m28.600s | 3m51.649s | 4m6.193s | | pytest -n 2 --dist loadgroup | 3 | 0m28.093s | 3m43.806s | 3m49.875s | | pytest -n 3 --dist loadgroup | 1 | 0m22.370s | 2m59.735s | 3m32.893s | | pytest -n 3 --dist loadgroup | 2 | 0m19.252s | 1m55.877s | 2m37.809s | | pytest -n 3 --dist loadgroup | 3 | 0m22.168s | 2m56.645s | 3m25.640s | | pytest -n 4 --dist loadgroup | 1 | 0m20.715s | 2m50.816s | 3m20.745s | | pytest -n 4 --dist loadgroup | 2 | 0m20.518s | 2m48.112s | 3m36.456s | | pytest -n 4 --dist loadgroup | 3 | 0m20.832s | 3m2.586s | 3m28.027s |

    The average of each run's real times are:

    | command | real | | --- | --- | | pytest | 0m36.446s | | pytest -n 1 --dist loadgroup | 0m37.681s | | pytest -n 2 --dist loadgroup | 0m27.321s | | pytest -n 3 --dist loadgroup | 0m21.263s | | pytest -n 4 --dist loadgroup | 0m20.688s |

    Going beyond four processes doesn't seem to yield any further improvements (likely because some portions of the test suite involve parallelized operations).

    How

    I added pytest-xdist to the dev extra requirements. I also updated the pytest.ini file with addopts to configure the test suite to run with four processes and run tests with the same group on the same worker (to avoid resource contention). The test_cholesky_factorization test was refactored into two smaller tests which both increase their timeout with respect to input size. Additionally, tests which involved parallelized operations were added to an xdist_group named parallel to ensure they are run on the same worker (to avoid resource contention).

    While the standard GitHub runners only have two cores (the unit-tests.yml workflow has been edited accordingly), local development can benefit from this.


    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by ConnorBaker 3
  • add dependabot

    add dependabot

    Issue #, if available:

    Description of changes:

    Add dependabot to keep dependencies up-to-date

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by wesk 3
  • Files seem to be corrupted (hash mismatch)

    Files seem to be corrupted (hash mismatch)

    Hello all,

    I updated to the most recent library version and noticed an issue in using the benchmarks. I am getting the following: Files seem to be corrupted (hash mismatch), which keeps persisting across multiple code reruns. The issue was fixed in PR #428 initially.

    The hash is consistent between the calls, however, it seems not to match the hardcoded one and not to be the same between different operating systems. For example for PD1:

    Hardcoded one in syne-tune:

    bd5b599179b1c5163d146a26dd2d559e5cb561f491ef48a22503e821651fd4d1

    On Windows (11) I get:

    f693fed481e344267c3a897eb8e629056e93304ee21e6e90955029c9804cdfda

    On Linux (CentOS 7.9) I get:

    ca162d81cacadb1e177ec319e65d68f812140bbc5864b0dceac28bbcca328a70

    I am able to overcome the problem by hardcoding the hash to my local specific value, but it seems the function that calculates the hash is not working as intended maybe, unless I am doing something wrong.

    bug 
    opened by ArlindKadra 4
  • feat: Add `py.typed` file to package so type annotations are exposed

    feat: Add `py.typed` file to package so type annotations are exposed

    Hello all!

    Would you be interested in adding a py.typed file to your package?

    Per PEP-561 (https://peps.python.org/pep-0561/), library authors who want to support type-checking of their code must add a py.typed file to their package and include it as part of the package data so it's redistributed.

    Doing so would allow downstream users of the library to benefit from the inline type annotations you have, freeing them of the need to create and maintain type stubs.

    opened by ConnorBaker 4
Releases(v0.3.3)
  • v0.3.3(Dec 19, 2022)

    [0.3.3] - 2022-12-19

    We release version 0.3.3 which you can install with pip install syne-tune[extra].

    Thanks to all contributors (sorted by chronological commit order): @mseeger, @mina-ghashami, @aaronkl, @jgolebiowski, @Valavanca, @TrellixVulnTeam, @geoalgo, @wistuba, @mlblack

    Added

    • Revamped documentation hosted at https://syne-tune.readthedocs.io
    • New tutorial: Benchmarking in Syne Tune
    • Added section on backends in Basics of Syne Tune tutorial
    • Control of re-creating of blackboxes by checking and storing hash codes
    • New benchmark: Transformer on WikiText-2
    • Support SageMaker managed warm pools in SageMaker backend
    • Improvements for benchmarking with YAHPO blackboxes
    • Support points_to_evaluate in BORE
    • SageMaker backend supports delete_checkpoints=True

    Changed

    • GridSearch supports all domain types now
    • BlackboxSurrogate of blackbox repository supports different modes
    • Add timeout to unit tests
    • New unit tests which run schedulers for longer, using simulator backend

    Fixed

    • HyperbandScheduler: does_pause_resume for all types
    • ASHA with type="promotion" did not work when checkpointing not implemented
    • Fixed preprocessing of PD1 blackbox
    • SageMaker backend reports on true number of busy workers (fixes issue #250)
    • Fix issue with uploading/syncing to S3 of YAHPO blackbox
    • Fix YAHPO surrogate evaluations in the presence of inactive hyperparameters
    • Fix treatment of Status.paused in TuningStatus and Tuner
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Oct 14, 2022)

    Added

    • New tutorial: How to Contribute a New Scheduler
    • New tutorial: Multi-Fidelity Hyperparameter Optimization
    • YAHPO benchmarks integrated into blackbox repository
    • PD1 benchmarks integrated into blackbox repository
    • New HPO algorithm: Hyper-Tune
    • New HPO algorithm: Differential Evolution Hyperband (DEHB)
    • New experimental HPO algorithm: Neuralband
    • New HPO algorithm: Grid search (categorical variables only)
    • BOTorch searcher
    • MOBSTER algorithm supports independent GPs at each rung level
    • Support for launching experiments in benchmarking/commons, for local, SageMaker, and simulator back-end
    • New benchmark: Fine-tuning Hugging Face transformers
    • Add IPython util function to display results as parallel categories plot
    • New hyperparameter types ordinal, logordinal
    • Support no checkpointing in BlackboxRepositoryBackend
    • Plateau rule as StoppingCriterion
    • Automate PyPI releases: python-publish.yml
    • Add license hook

    Changed

    • Replace PyTorch MLP by sklearn in BORE (better performance)
    • AWS dependencies moved out of core into aws
    • New dependencies yahpo

    Fixed

    • In SageMaker back-end, trials with low IDs received reports several times. This is fixed
    • Fixing issue with checkpoint_s3_uri usage
    • Fix mode in BOTorch searcher when maximizing
    • Avoid experiment abort due to throttling of SageMaker job launching
    • Surrogate model for lcbench defaults to 1-NN now
    • Fix conditional imports, so Syne Tune can be run with reduced dependencies
    • Fix lcbench blackbox (ignore first and last fidelity)
    • Fix bug in BlackboxSimulatorBackend for pause/resume scheduling (issue #304)
    • Revert wait_trial_completion_when_stopping to False
    • Terminate with error when tuning sees an exception
    • Docker Building Fixed by Adding Line Breaks At End of Requirements Files
    • Control Decision for Running Trials When Stopping Criterion is Met
    • Fix mode MSR and HB+BB
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Sep 16, 2022)

Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
An official source code for paper Deep Graph Clustering via Dual Correlation Reduction, accepted by AAAI 2022

Dual Correlation Reduction Network An official source code for paper Deep Graph Clustering via Dual Correlation Reduction, accepted by AAAI 2022. Any

yueliu1999 109 Dec 23, 2022
Dynamic hair modeling from monocular videos using deep neural networks

Dynamic Hair Modeling The source code of the networks for our paper "Dynamic hair modeling from monocular videos using deep neural networks" (SIGGRAPH

53 Oct 18, 2022
Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

pytorch_clip_bbox: Implementation of the CLIP guided bbox ranking for Object Detection. Pytorch based library to rank predicted bounding boxes using t

Sergei Belousov 50 Nov 27, 2022
This is a vision-based 3d model manipulation and control UI

Manipulation of 3D Models Using Hand Gesture This program allows user to manipulation 3D models (.obj format) with their hands. The project support bo

Cortic Technology Corp. 43 Oct 23, 2022
A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Pliable Pixels 6 Jan 12, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.6k Jan 08, 2023
This repo contains the implementation of YOLOv2 in Keras with Tensorflow backend.

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).

Huynh Ngoc Anh 1.7k Dec 24, 2022
Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides

Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides Project | This repo is the officia

CVSM Group - email: <a href=[email protected]"> 33 Dec 28, 2022
Python Actor concurrency library

Thespian Actor Library This library provides the framework of an Actor model for use by applications implementing Actors. Thespian Site with Documenta

Kevin Quick 177 Dec 11, 2022
Deep Learning Head Pose Estimation using PyTorch.

Hopenet is an accurate and easy to use head pose estimation network. Models have been trained on the 300W-LP dataset and have been tested on real data with good qualitative performance.

Nataniel Ruiz 1.3k Dec 26, 2022
Multi-Scale Geometric Consistency Guided Multi-View Stereo

ACMM [News] The code for ACMH is released!!! [News] The code for ACMP is released!!! About ACMM is a multi-scale geometric consistency guided multi-vi

Qingshan Xu 118 Jan 04, 2023
LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021 We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrie

Hieu Duong 7 Jan 12, 2022
TensorFlow2 Classification Model Zoo playing with TensorFlow2 on the CIFAR-10 dataset.

Training CIFAR-10 with TensorFlow2(TF2) TensorFlow2 Classification Model Zoo. I'm playing with TensorFlow2 on the CIFAR-10 dataset. Architectures LeNe

Chia-Hung Yuan 16 Sep 27, 2022
Official implementation of "MetaSDF: Meta-learning Signed Distance Functions"

MetaSDF: Meta-learning Signed Distance Functions Project Page | Paper | Data Vincent Sitzmann*, Eric Ryan Chan*, Richard Tucker, Noah Snavely Gordon W

Vincent Sitzmann 100 Jan 01, 2023
Haze Removal can remove slight to extreme cases of haze affecting an image

Haze Removal can remove slight to extreme cases of haze affecting an image. Its most typical use is for landscape photography where the haze causes low contrast and low saturation, but it can also be

Grace Ugochi Nneji 3 Feb 15, 2022
[ICCV2021] Official Pytorch implementation for SDGZSL (Semantics Disentangling for Generalized Zero-Shot Learning)

Semantics Disentangling for Generalized Zero-shot Learning This is the official implementation for paper Zhi Chen, Yadan Luo, Ruihong Qiu, Zi Huang, J

25 Dec 06, 2022
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Order of the Overflow 46 Sep 13, 2022
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

63 Aug 11, 2022
Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Symplectic Double Pendulum Simulator Double pendulum simulator using a symplectic Euler's method. The program calculates the momentum and position of

Scott Marino 1 Jan 12, 2022