XManager: A framework for managing machine learning experiments 🧑‍🔬

XManager is a platform for packaging, running and keeping track of machine learning experiments. It currently enables one to launch experiments locally or on Google Cloud Platform (GCP). Interaction with experiments is done via XManager's APIs through Python launch scripts.

To get started, install the prerequisites, XManager itself and follow the tutorial to create and run a launch script.

See CONTRIBUTING.md for guidance on contributions.


The codebase assumes Python 3.7+.

Install Docker

If you use xmanager.xm.PythonDocker to run XManager experiments, you need to install Docker.

  1. Follow the steps to install Docker.

  2. And if you are a Linux user, follow the steps to enable sudoless Docker.

Install Bazel

If you use xmanager.xm_local.BazelContainer or xmanager.xm_local.BazelBinary to run XManager experiments, you need to install Bazel.

  1. Follow the steps to install Bazel.

Create a GCP project

If you use xm_local.Caip (Cloud AI Platform) to run XManager experiments, you need to have a GCP project in order to be able to access CAIP to run jobs.

  1. Create a GCP project.

  2. Install gcloud.

  3. Associate your Google Account (Gmail account) with your GCP project by running:

    gcloud auth login
    gcloud auth application-default login
    gcloud config set project $GCP_PROJECT
  4. Set up gcloud to work with Docker by running:

    gcloud auth configure-docker
  5. Enable Google Cloud Platform APIs.

  6. Create a staging bucket in us-central1 if you do not already have one. This bucket should be used to save experiment artifacts like TensorFlow log files, which can be read by TensorBoard. This bucket may also be used to stage files to build your Docker image if you build your images remotely.

    gsutil mb -l us-central1 gs://$GOOGLE_CLOUD_BUCKET_NAME

    Add GOOGLE_CLOUD_BUCKET_NAME to the environment variables or your .bashrc:


Install XManager

pip install git+https://github.com/deepmind/xmanager.git

Or, alternatively, a PyPI project is also available.

pip install xmanager

Writing XManager launch scripts

A snippet for the impatient 🙂
# Contains core primitives and APIs.
from xmanager import xm
# Implementation of those core concepts for what we call 'the local backend',
# which means all executables are sent for execution from this machine,
# independently of whether they are actually executed on our machine or on GCP.
from xmanager import xm_local
# Creates an experiment context and saves its metadata to the database, which we
# can reuse later via `xm_local.list_experiments`, for example. Note that
# `experiment` has tracking properties such as `id`.
with xm_local.create_experiment(experiment_title='cifar10') as experiment:
  # Packaging prepares a given *executable spec* for running with a concrete
  # *executor spec*: depending on the combination, that may involve building
  # steps and / or copying the results somewhere. For example, a
  # `xm.python_container` designed to run on `Kubernetes` will be built via
  #`docker build`, and the new image will be uploaded to the container registry.
  # But for our simple case where we have a prebuilt Linux binary designed to
  # run locally only some validations are performed -- for example, that the
  # file exists.
  # `executable` contains all the necessary information needed to launch the
  # packaged blob via `.add`, see below.
  [executable] = experiment.package([
          # What we are going to run.
          # Where we are going to run it.
  # Let's find out which `batch_size` is best -- presumably our jobs write the
  # results somewhere.
  for batch_size in [64, 1024]:
    # `add` creates a new *experiment unit*, which is usually a collection of
    # semantically united jobs, and sends them for execution. To pass an actual
    # collection one may want to use `JobGroup`s (more about it later in the
    # documentation, but for our purposes we are going to pass just one job.
        # The `a.out` we packaged earlier.
        # We are using the default settings here, but executors have plenty of
        # arguments available to control execution.
        # Time to pass the batch size as a command-line argument!
        args={'batch_size': batch_size},
        # We can also pass environment variables.
        env_vars={'HEAPPROFILE': '/tmp/a_out.hprof'},
  # The context will wait for locally run things (but not for remote things such
  # as jobs sent to GCP, although they can be explicitly awaited via
  # `wait_for_completion`).

The basic structure of an XManager launch script can be summarized by these steps:

  1. Create an experiment and acquire its context.

    from xmanager import xm
    from xmanager import xm_local
    with xm_local.create_experiment(experiment_title='cifar10') as experiment:
  2. Define specifications of executables you want to run.

    spec = xm.PythonContainer(
  3. Package your executables.

    from xmanager import xm_local
    [executable] = experiment.package([
  4. Define your hyperparameters.

    import itertools
    batch_sizes = [64, 1024]
    learning_rates = [0.1, 0.001]
    trials = list(
      dict([('batch_size', bs), ('learning_rate', lr)])
      for (bs, lr) in itertools.product(batch_sizes, learning_rates)
  5. Define resource requirements for each job.

    requirements = xm.JobRequirements(T4=1)
  6. For each trial, add a job / job groups to launch them.

    for hyperparameters in trials:

Now we should be ready to run the launch script.

To learn more about different executables and executors follow 'Components'.

Run XManager

xmanager launch ./xmanager/examples/cifar10_tensorflow/launcher.py

In order to run multi-job experiments, the --xm_wrap_late_bindings flag might be required:

xmanager launch ./xmanager/examples/cifar10_tensorflow/launcher.py -- --xm_wrap_late_bindings


Executable specifications

XManager executable specifications define what should be packaged in the form of binaries, source files, and other input dependencies required for job execution. Executable specifications are reusable are generally platform-independent.


Container defines a pre-built Docker image located at a URL (or locally).


xm.container is a shortener for packageable construction.

assert xm.container(
) == xm.Packageable(


BazelBinary defines a Bazel binary target identified by a label.


xm.bazel_binary is a shortener for packageable construction.

assert xm.bazel_binary(
) == xm.Packageable(


PythonContainer defines a Python project that is packaged into a Docker container.

    entrypoint: xm.ModuleName('

    # Optionals.
    path: '/path/to/python/project/',  # Defaults to the current directory of the launch script.
    base_image: '[:
    docker_instructions: ['RUN ...', 'COPY ...', ...],

A simple form of PythonContainer is to just launch a Python module with default docker_intructions.


That specification produces a Docker image that runs the following command:

python3 -m cifar10 fixed_arg1 fixed_arg2

An advanced form of PythonContainer allows you to override the entrypoint command as well as the Docker instructions.

      'python3 -m cifar10 $@',
      'COPY pre_process.sh pre_process.sh',
      'RUN chmod +x ./pre_process.sh',
      'COPY cifar10.py',
      'COPY post_process.sh post_process.sh',
      'RUN chmod +x ./post_process.sh',

That specification produces a Docker image that runs the following commands:

python3 -m cifar10 fixed_arg1 fixed_arg2

IMPORTANT: Note the use of $@ which accepts command-line arguments. Otherwise, all command-line arguments are ignored by your entrypoint.

xm.python_container is a shortener for packageable construction.

assert xm.python_container(
) == xm.Packageable(


XManager executors define a platform where the job runs and resource requirements for the job.

Each executor also has a specification which describes how an executable specification should be prepared and packaged.

Cloud AI Platform (CAIP)

The Caip executor declares that an executable will be run on the CAIP platform.

The Caip executor takes in a resource requirements object.

        cpu=1,  # Measured in vCPUs.
        ram=4 * xm.GiB,
        T4=1,  # NVIDIA Tesla T4.
        cpu=1,  # Measured in vCPUs.
        ram=4 * xm.GiB,
        TPU_V2=8,  # TPU v2.

As of June 2021, the currently supported accelerator types are:

  • P100
  • V100
  • P4
  • T4
  • A100
  • TPU_V2
  • TPU_V3

IMPORTANT: Note that for TPU_V2 and TPU_V3 the only currently supported count is 8.

Caip Specification

The CAIP executor allows you specify a remote image repository to push to.



The local executor declares that an executable will be run on the same machine from which the launch script is invoked.

Kubernetes (experimental)

The Kubernetes executor declares that an executable will be run on a Kubernetes cluster. As of October 2021, Kubernetes is not fully supported.

The Kubernetes executor pulls from your local kubeconfig. The XManager command-line has helpers to set up a Google Kubernetes Engine (GKE) cluster.

pip install caliban==0.4.1
xmanager cluster create

# cleanup
xmanager cluster delete

You can store the GKE credentials in your kubeconfig:

gcloud container clusters get-credentials <cluster-name>
Kubernetes Specification

The Kubernetes executor allows you specify a remote image repository to push to.


Job / JobGroup

A Job represents a single executable on a particular executor, while a JobGroup unites a group of Jobs providing a gang scheduling concept: Jobs inside them are scheduled / descheduled simultaneously. Same Job and JobGroup instances can be added multiple times.


A Job accepts an executable and an executor along with hyperparameters which can either be command-line arguments or environment variables.

Command-line arguments can be passed in list form, [arg1, arg2, arg3]:

binary arg1 arg2 arg3

They can also be passed in dictionary form, {key1: value1, key2: value2}:

binary --key1=value1 --key2=value2

Environment variables are always passed in Dict[str, str] form:

export KEY=VALUE

Jobs are defined like this:

[executable] = xm.Package(...)

executor = xm_local.Caip(...)

        'batch_size': 64,
        'NCCL_DEBUG': 'INFO',


A JobGroup accepts jobs in a kwargs form. The keyword can be any valid Python identifier. For example, you can call your jobs 'agent' and 'observer'.

agent_job = xm.Job(...)
observer_job = xm.Job(...)

xm.JobGroup(agent=agent_job, observer=observer_job)
  • Can't change the number of vCPUs

    Can't change the number of vCPUs

    I'm trying to launch a job that requires multiple CPU cores to run faster, for that I make the executor as follows


    setting vcpu_count to 1, 8, 32 and 64 doesn't change the actual number of vCPUs allocated for the task. I check the number of CPUs by running

    import multiprocessing

    and also running this in the debug terminal of the job cat /proc/cpuinfo | grep processor | wc -l. In all cases these two commands return 4 regardless of the changing requirements.


    • The job launches and executes to completion. Although very slow.
    • During build (after the image is pushed to the container registry) I get this warning message
    W0510 14:00:15.198342 140373868750400 http.py:139] Encountered 403 Forbidden with reason "PERMISSION_DENIED"

    Followed immediately by

    I0510 14:00:15.200866 140373858600512 base.py:80] Creating CustomJob
    • The launched jobs don't show up under the Training Pipelines tab but rather the Custom Jobs tab in Vertex AI -> Training
    opened by AbubakrHassan 7
  • Tensorboard instance is not found when running examples/cifar10_tensorflow

    Tensorboard instance is not found when running examples/cifar10_tensorflow

    When running examples/cifar10_tensorflow the job launches fine and trains to completion. however the tensorboard link created shows a page that says

    Not found: TensorboardExperiment projects/****/locations/us-central1/tensorboards/2824407877244944384/experiments/7194241469736026112 is not found.

    Logs from building the job

    I0331 10:24:09.242159 139812284593984 docker_lib.py:67] Local docker: {'Platform': {'Name': 'Docker Engine - Community'}, 'Components': [{'Name': 'Engine', 'Version': '20.10.2', 'Details': {'ApiVersion': '1.41', 'Arch': 'amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00', 'Experimental': 'false', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'KernelVersion': '5.15.15-1rodete2-amd64', 'MinAPIVersion': '1.12', 'Os': 'linux'}}, {'Name': 'containerd', 'Version': '1.4.3', 'Details': {'GitCommit': '269548fa27e0089a8b8278fc4fc781d7f65a939b'}}, {'Name': 'runc', 'Version': '1.0.0-rc92', 'Details': {'GitCommit': 'ff819c7e9184c13b7c2607fe6c30ae19403a7aff'}}, {'Name': 'docker-init', 'Version': '0.19.0', 'Details': {'GitCommit': 'de40ad0'}}], 'Version': '20.10.2', 'ApiVersion': '1.41', 'MinAPIVersion': '1.12', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'Os': 'linux', 'Arch': 'amd64', 'KernelVersion': '5.15.15-1rodete2-amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00'}
    FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-6
    RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi
    RUN apt-get update && apt-get install -y git netcat
    RUN python -m pip install --upgrade pip
    COPY cifar10_tensorflow/requirements.txt /cifar10_tensorflow/requirements.txt
    RUN python -m pip install -r cifar10_tensorflow/requirements.txt
    COPY cifar10_tensorflow/ /cifar10_tensorflow
    RUN chown -R 1000:root /cifar10_tensorflow && chmod -R 775 /cifar10_tensorflow
    WORKDIR cifar10_tensorflow
    COPY entrypoint.sh ./entrypoint.sh
    RUN chown -R 1000:root ./entrypoint.sh && chmod -R 775 ./entrypoint.sh
    ENTRYPOINT ["./entrypoint.sh"]
    Size of Docker input: 7.1 kB
    Building Docker image, please wait...
    I0331 10:24:10.163763 139812284593984 docker_lib.py:67] Local docker: {'Platform': {'Name': 'Docker Engine - Community'}, 'Components': [{'Name': 'Engine', 'Version': '20.10.2', 'Details': {'ApiVersion': '1.41', 'Arch': 'amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00', 'Experimental': 'false', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'KernelVersion': '5.15.15-1rodete2-amd64', 'MinAPIVersion': '1.12', 'Os': 'linux'}}, {'Name': 'containerd', 'Version': '1.4.3', 'Details': {'GitCommit': '269548fa27e0089a8b8278fc4fc781d7f65a939b'}}, {'Name': 'runc', 'Version': '1.0.0-rc92', 'Details': {'GitCommit': 'ff819c7e9184c13b7c2607fe6c30ae19403a7aff'}}, {'Name': 'docker-init', 'Version': '0.19.0', 'Details': {'GitCommit': 'de40ad0'}}], 'Version': '20.10.2', 'ApiVersion': '1.41', 'MinAPIVersion': '1.12', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'Os': 'linux', 'Arch': 'amd64', 'KernelVersion': '5.15.15-1rodete2-amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00'}
    I0331 10:24:10.164260 139812284593984 docker_lib.py:89] Building Docker image
    [+] Building 55.8s (16/16) FINISHED                                                                                                                                          
     => [internal] load build definition from Dockerfile                                                                                                                    0.2s
     => => transferring dockerfile: 694B                                                                                                                                    0.0s
     => [internal] load .dockerignore                                                                                                                                       0.2s
     => => transferring context: 2B                                                                                                                                         0.0s
     => [internal] load metadata for gcr.io/deeplearning-platform-release/tf2-gpu.2-6:latest                                                                                0.7s
     => [ 1/11] FROM gcr.io/deeplearning-platform-release/[email protected]:d9bf7c2069ff4bec9d9fc6d30fb286f1646124d04012d9932ee59d58eaca9ac4                               0.0s
     => [internal] load build context                                                                                                                                       0.1s
     => => transferring context: 8.03kB                                                                                                                                     0.0s
     => CACHED [ 2/11] RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi                                                                                              0.0s
     => [ 3/11] RUN apt-get update && apt-get install -y git netcat                                                                                                        15.9s
     => [ 4/11] RUN python -m pip install --upgrade pip                                                                                                                    16.5s
     => [ 5/11] COPY cifar10_tensorflow/requirements.txt /cifar10_tensorflow/requirements.txt                                                                               0.5s
     => [ 6/11] RUN python -m pip install -r cifar10_tensorflow/requirements.txt                                                                                           17.7s
     => [ 7/11] COPY cifar10_tensorflow/ /cifar10_tensorflow                                                                                                                0.5s
     => [ 8/11] RUN chown -R 1000:root /cifar10_tensorflow && chmod -R 775 /cifar10_tensorflow                                                                              1.0s
     => [ 9/11] WORKDIR cifar10_tensorflow                                                                                                                                  0.3s
     => [10/11] COPY entrypoint.sh ./entrypoint.sh                                                                                                                          0.2s
     => [11/11] RUN chown -R 1000:root ./entrypoint.sh && chmod -R 775 ./entrypoint.sh                                                                                      0.7s
     => exporting to image                                                                                                                                                  1.4s
     => => exporting layers                                                                                                                                                 1.0s
     => => writing image sha256:1fb33a18a65d7efd4fcec00ef688ec2ac5502851be5d36bcc9a7b5cf342da775                                                                            0.0s
     => => naming to gcr.io/***/cifar10_tensorflow:20220331-102410-116512                                                                                  0.0s
     => => naming to gcr.io/***/cifar10_tensorflow:latest                                                                                                  0.0s
    I0331 10:25:06.734303 139812284593984 docker_lib.py:98] Building docker image: Done
    I0331 10:25:06.775659 139812284593984 docker_lib.py:67] Local docker: {'Platform': {'Name': 'Docker Engine - Community'}, 'Components': [{'Name': 'Engine', 'Version': '20.10.2', 'Details': {'ApiVersion': '1.41', 'Arch': 'amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00', 'Experimental': 'false', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'KernelVersion': '5.15.15-1rodete2-amd64', 'MinAPIVersion': '1.12', 'Os': 'linux'}}, {'Name': 'containerd', 'Version': '1.4.3', 'Details': {'GitCommit': '269548fa27e0089a8b8278fc4fc781d7f65a939b'}}, {'Name': 'runc', 'Version': '1.0.0-rc92', 'Details': {'GitCommit': 'ff819c7e9184c13b7c2607fe6c30ae19403a7aff'}}, {'Name': 'docker-init', 'Version': '0.19.0', 'Details': {'GitCommit': 'de40ad0'}}], 'Version': '20.10.2', 'ApiVersion': '1.41', 'MinAPIVersion': '1.12', 'GitCommit': '8891c58', 'GoVersion': 'go1.13.15', 'Os': 'linux', 'Arch': 'amd64', 'KernelVersion': '5.15.15-1rodete2-amd64', 'BuildTime': '2020-12-28T16:15:28.000000000+00:00'}
    I0331 10:25:20.892401 139812284593984 docker_lib.py:107] {"status":"The push refers to repository [gcr.io/***/cifar10_tensorflow]"}
    {"status":"Pushing","progressDetail":{"current":512,"total":528},"progress":"[================================================\u003e  ]     512B/528B","id":"ecbb601dd983"}
    {"status":"Pushing","progressDetail":{"current":512,"total":528},"progress":"[================================================\u003e  ]     512B/528B","id":"0490c7aeabf0"}
    {"status":"Pushing","progressDetail":{"current":512,"total":7094},"progress":"[===\u003e                                               ]     512B/7.094kB","id":"ddfab15718d9"}
    {"status":"Pushing","progressDetail":{"current":512,"total":7094},"progress":"[===\u003e                                               ]     512B/7.094kB","id":"57c7da5da29e"}
    {"status":"Pushing","progressDetail":{"current":11776,"total":7094},"progress":"[==================================================\u003e]  11.78kB","id":"ddfab15718d9"}
    {"status":"Pushing","progressDetail":{"current":3072,"total":528},"progress":"[==================================================\u003e]  3.072kB","id":"0490c7aeabf0"}
    {"status":"Pushing","progressDetail":{"current":11776,"total":7094},"progress":"[==================================================\u003e]  11.78kB","id":"57c7da5da29e"}
    {"status":"Layer already exists","progressDetail":{},"id":"5f70bf18a086"}
    {"status":"Pushing","progressDetail":{"current":31984,"total":2986836},"progress":"[\u003e                                                  ]  31.98kB/2.987MB","id":"a43c37333595"}
    {"status":"Pushing","progressDetail":{"current":3072,"total":528},"progress":"[==================================================\u003e]  3.072kB","id":"ecbb601dd983"}
    {"status":"Pushing","progressDetail":{"current":1428239,"total":2986836},"progress":"[=======================\u003e                           ]  1.428MB/2.987MB","id":"a43c37333595"}
    {"status":"Pushing","progressDetail":{"current":2741798,"total":2986836},"progress":"[=============================================\u003e     ]  2.742MB/2.987MB","id":"a43c37333595"}
    {"status":"Pushing","progressDetail":{"current":3273728,"total":2986836},"progress":"[==================================================\u003e]  3.274MB","id":"a43c37333595"}
    {"status":"Pushing","progressDetail":{"current":512,"total":39},"progress":"[==================================================\u003e]     512B","id":"479d29ce9800"}
    {"status":"Pushing","progressDetail":{"current":2560,"total":39},"progress":"[==================================================\u003e]   2.56kB","id":"479d29ce9800"}
    {"status":"Pushing","progressDetail":{"current":512,"total":19820},"progress":"[=\u003e                                                 ]     512B/19.82kB","id":"f4bfb05d8c99"}
    {"status":"Pushing","progressDetail":{"current":28160,"total":19820},"progress":"[==================================================\u003e]  28.16kB","id":"f4bfb05d8c99"}
    {"status":"Pushing","progressDetail":{"current":413696,"total":38721310},"progress":"[\u003e                                                  ]  413.7kB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":2009600,"total":38721310},"progress":"[==\u003e                                                ]   2.01MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":3582464,"total":38721310},"progress":"[====\u003e                                              ]  3.582MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":5180416,"total":38721310},"progress":"[======\u003e                                            ]   5.18MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":6753280,"total":38721310},"progress":"[========\u003e                                          ]  6.753MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":8331264,"total":38721310},"progress":"[==========\u003e                                        ]  8.331MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":10291712,"total":38721310},"progress":"[=============\u003e                                     ]  10.29MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":12274176,"total":38721310},"progress":"[===============\u003e                                   ]  12.27MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":14240256,"total":38721310},"progress":"[==================\u003e                                ]  14.24MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"e5a69fe43a97"}
    {"status":"Pushing","progressDetail":{"current":16206336,"total":38721310},"progress":"[====================\u003e                              ]  16.21MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":17779200,"total":38721310},"progress":"[======================\u003e                            ]  17.78MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":19745280,"total":38721310},"progress":"[=========================\u003e                         ]  19.75MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"ed55b6190435"}
    {"status":"Pushing","progressDetail":{"current":21318144,"total":38721310},"progress":"[===========================\u003e                       ]  21.32MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":23284224,"total":38721310},"progress":"[==============================\u003e                    ]  23.28MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":25250304,"total":38721310},"progress":"[================================\u003e                  ]  25.25MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":27216384,"total":38721310},"progress":"[===================================\u003e               ]  27.22MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":29182464,"total":38721310},"progress":"[=====================================\u003e             ]  29.18MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":30768640,"total":38721310},"progress":"[=======================================\u003e           ]  30.77MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":32359424,"total":38721310},"progress":"[=========================================\u003e         ]  32.36MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"87ec19f85372"}
    {"status":"Pushing","progressDetail":{"current":33952256,"total":38721310},"progress":"[===========================================\u003e       ]  33.95MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":35525120,"total":38721310},"progress":"[=============================================\u003e     ]  35.53MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":37110272,"total":38721310},"progress":"[===============================================\u003e   ]  37.11MB/38.72MB","id":"f634932f0fdf"}
    {"status":"Pushing","progressDetail":{"current":38809600,"total":38721310},"progress":"[==================================================\u003e]  38.81MB","id":"f634932f0fdf"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"8c3b041fd87c"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"0ac428b7127a"}
    {"status":"Layer already exists","progressDetail":{},"id":"b3ab95a574c8"}
    {"status":"Layer already exists","progressDetail":{},"id":"d1b010151b48"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"370688903f01"}
    {"status":"Layer already exists","progressDetail":{},"id":"b80bc089358e"}
    {"status":"Layer already exists","progressDetail":{},"id":"11bc9b36546a"}
    {"status":"Layer already exists","progressDetail":{},"id":"fffe44800c74"}
    {"status":"Mounted from deeplearning-platform-release/tf2-gpu.2-6","progressDetail":{},"id":"76d62c4c37cc"}
    {"status":"Layer already exists","progressDetail":{},"id":"1175e7a0a8e0"}
    {"status":"Layer already exists","progressDetail":{},"id":"992f2c95dad2"}
    {"status":"Layer already exists","progressDetail":{},"id":"91b2ad1e9845"}
    {"status":"Layer already exists","progressDetail":{},"id":"178f9673d3c0"}
    {"status":"Layer already exists","progressDetail":{},"id":"3298591378da"}
    {"status":"Layer already exists","progressDetail":{},"id":"963f45082214"}
    {"status":"Layer already exists","progressDetail":{},"id":"59edb8a95299"}
    {"status":"Layer already exists","progressDetail":{},"id":"b79b505a5328"}
    {"status":"Layer already exists","progressDetail":{},"id":"6083edd74f0c"}
    {"status":"Layer already exists","progressDetail":{},"id":"4236d5cafaa0"}
    {"status":"Layer already exists","progressDetail":{},"id":"da29c29e84ca"}
    {"status":"Layer already exists","progressDetail":{},"id":"924dcf5e7282"}
    {"status":"Layer already exists","progressDetail":{},"id":"1526a09df7d6"}
    {"status":"Layer already exists","progressDetail":{},"id":"f35a9ab279de"}
    {"status":"Layer already exists","progressDetail":{},"id":"6cd83fbc36a4"}
    {"status":"Layer already exists","progressDetail":{},"id":"a7a59823f7fd"}
    {"status":"Layer already exists","progressDetail":{},"id":"a86b3e862105"}
    {"status":"Layer already exists","progressDetail":{},"id":"9ad794ce6bea"}
    {"status":"Layer already exists","progressDetail":{},"id":"9f54eef41275"}
    {"status":"Layer already exists","progressDetail":{},"id":"d533033842c0"}
    {"status":"20220331-102410-116512: digest: sha256:c059dbb502a1b915aef8b10e0e1dd4e9a241d23adb19431ffad552a4edfeb3b9 size: 9127"}
    Your image URI is: gcr.io/***/cifar10_tensorflow:20220331-102410-116512
    E0331 10:25:27.511640750 3964344 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:29.811606183 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:30.695179886 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:31.677614770 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:32.550932107 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:34.109284446 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:35.012729765 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    W0331 10:25:36.525629 139812189455936 http.py:139] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
    I0331 10:25:36.528207 139812155360832 base.py:80] Creating CustomJob
    I0331 10:25:37.428091 139812155360832 base.py:127] CustomJob created. Resource name: projects/****/locations/us-central1/customJobs/1290022358253305856
    I0331 10:25:37.428335 139812155360832 base.py:128] To use this CustomJob in another session:
    I0331 10:25:37.428413 139812155360832 base.py:129] custom_job = aiplatform.CustomJob.get('projects/***/locations/us-central1/customJobs/1290022358253305856')
    I0331 10:25:37.429027 139812155360832 jobs.py:1412] View Custom Job:
    I0331 10:25:37.429559 139812155360832 jobs.py:1415] View Tensorboard:
    Job launched at: https://console.cloud.google.com/ai/platform/locations/us-central1/training/1290022358253305856?project=***
    E0331 10:25:37.646365894 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:38.566072246 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:39.474203776 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:40.424329133 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:41.956427682 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    E0331 10:25:42.839551026 3964345 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
    I0331 10:25:43.589789 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    W0331 10:25:44.369634 139812189455936 http.py:139] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
    I0331 10:25:44.371166 139812138391104 base.py:80] Creating CustomJob
    I0331 10:25:45.369434 139812138391104 base.py:127] CustomJob created. Resource name: projects/***/locations/us-central1/customJobs/7194241469736026112
    I0331 10:25:45.369657 139812138391104 base.py:128] To use this CustomJob in another session:
    I0331 10:25:45.369731 139812138391104 base.py:129] custom_job = aiplatform.CustomJob.get('projects/***/locations/us-central1/customJobs/7194241469736026112')
    I0331 10:25:45.369875 139812138391104 jobs.py:1412] View Custom Job:
    I0331 10:25:45.370020 139812138391104 jobs.py:1415] View Tensorboard:
    Job launched at: https://console.cloud.google.com/ai/platform/locations/us-central1/training/7194241469736026112?project=***
    Waiting for local jobs to complete. Press Ctrl+C to terminate them and exit
    I0331 10:25:51.482136 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    I0331 10:25:54.744579 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    I0331 10:26:02.614101 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    I0331 10:26:17.016368 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    I0331 10:26:24.918474 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    I0331 10:27:01.445312 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    I0331 10:27:08.947669 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    I0331 10:28:24.716295 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    I0331 10:28:31.778189 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    I0331 10:30:33.545330 139812138391104 jobs.py:1127] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 access the interactive shell terminals for the custom job:
    I0331 10:30:42.651297 139812155360832 jobs.py:1127] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 access the interactive shell terminals for the custom job:
    I0331 10:31:09.588982 139812155360832 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/1290022358253305856 current state:
    I0331 10:31:12.343992 139812138391104 jobs.py:178] CustomJob projects/***/locations/us-central1/customJobs/7194241469736026112 current state:
    opened by AbubakrHassan 6
  • Examples Timeout

    Examples Timeout

    Hi! I'm trying to use xmanager and while the setup went well all of the examples are timing out before even running the network. Any ideas what the error could be?

    Thank you in advance for your help

    opened by joaogui1 5
  • error: googleapiclient.errors.UnknownApiNameOrVersion: name: us-central1-aiplatform  version: v1

    error: googleapiclient.errors.UnknownApiNameOrVersion: name: us-central1-aiplatform version: v1

    When I run 'xmanager launch ./xmanager/examples/cifar10_torch/launcher.py' i get the following error: googleapiclient.errors.UnknownApiNameOrVersion: name: us-central1-aiplatform version: v1

    Please, could you guide me?

    opened by JohanSamir 5
  • Use Existing Service Account

    Use Existing Service Account

    I am trying to use XManager with Vertex AI but do not have permissions to create a new service account. I noticed that the service account name is hard-coded to "xmanager" here:


    Is it possible to add an option or parameter so that we can specify an existing service account name for XManager to use? Thanks.

    opened by sanath-2024 2
  • Codelab instructions say

    Codelab instructions say "install XManager" but the command clones "Raksha"

    The codelab.ipynb currently has the following instructions:

    Download and install XManager

    !git clone https://github.com/google-research/raksha.git ~/xmanager
    !pip install ~/xmanager

    It's unclear how https://github.com/google-research/raksha.git is related to XManager; is that line supposed to be cloning https://github.com/deepmind/xmanager.git instead?

    Also, why not use one of the following commands as directed by the README.md instructions?

    pip install git+https://github.com/deepmind/xmanager.git


    pip install xmanager

    Happy to make a PR (or CL) to update this, but just wanted to get clarity if this is intentional, and if so, what the rationale is.


    opened by mbrukman 2
  • `pip install xmanager==0.2.0` yields import error (previous version works OK)

    `pip install xmanager==0.2.0` yields import error (previous version works OK)

    To reproduce:

    pip install xmanager==0.2.0


    from xmanager import xm

    Yields error:

    TypeError                                 Traceback (most recent call last)
    /tmp/ipykernel_30726/3915442554.py in <module>
    ----> 1 from xmanager import xm
    ~/xmanager/xmanager/xm/__init__.py in <module>
         19 from xmanager.xm import job_operators
         20 from xmanager.xm.compute_units import *
    ---> 21 from xmanager.xm.core import *
         22 from xmanager.xm.executables import *
         23 from xmanager.xm.job_blocks import *
    ~/xmanager/xmanager/xm/core.py in <module>
    --> 531 class Experiment(abc.ABC):
        532   """Experiment contains a family of jobs run on the same snapshot of code.
    ~/xmanager/xmanager/xm/core.py in Experiment()
        659       *,  # parameters after “*” are keyword-only parameters
        660       identity: str = ''
    --> 661   ) -> asyncio.Future[ExperimentUnit]:
        662     ...
    TypeError: 'type' object is not subscriptable

    However, installing the previous version 0.1.5 works OK.

    This failed on both http://colab.research.google.com and on a GCP Vertex AI managed Jupyter notebook.

    opened by letsbuild 2
  • enable_web_access for Vertex AI jobs

    enable_web_access for Vertex AI jobs

    Hi, is there a way to launch Vertex AI jobs with enabled access to an interactive shell (https://cloud.google.com/vertex-ai/docs/training/monitor-debug-interactive-shell#vertexai_enable_web_access-python)?

    opened by DzvinkaYarish 2
  • ValueError when target name contains a `.`

    ValueError when target name contains a `.`

    The BazelContainer documentation uses image.tar as an example target name, which actually returns a ValueError.

    Traceback (most recent call last):
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/.venv/bin/xmanager", line 33, in <module>
        sys.exit(load_entry_point('xmanager', 'console_scripts', 'xmanager')())
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/cli/cli.py", line 65, in entrypoint
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/.venv/lib/python3.9/site-packages/absl/app.py", line 312, in run
        _run_main(main, args)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/.venv/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/cli/cli.py", line 41, in main
        app.run(m.main, argv=argv)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/.venv/lib/python3.9/site-packages/absl/app.py", line 312, in run
        _run_main(main, args)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/.venv/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/launcher.py", line 7, in main
        [executable] = experiment.package(
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm/core.py", line 636, in package
        return cls._async_packager.package(packageables)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm/async_packager.py", line 114, in package
        executables = self._package_batch(packageables)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm_local/packaging/router.py", line 112, in package
        bazel_kinds = bazel_service.fetch_kinds(bazel_labels)
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm_local/packaging/bazel_tools.py", line 186, in fetch_kinds
        labels = [_assemble_label(_lex_label(label)) for label in labels]
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm_local/packaging/bazel_tools.py", line 186, in <listcomp>
        labels = [_assemble_label(_lex_label(label)) for label in labels]
      File "/Users/ryo.takahashi/ghq/github.com/deepmind/xmanager/xmanager/xm_local/packaging/bazel_tools.py", line 160, in _lex_label
        raise ValueError(f'{label} is not an absolute Bazel label')
    ValueError: //path/to/target:label.tar is not an absolute Bazel label

    This error is caused by _LABEL_LEXER. This regex does not allow the inclusion of a single . which is not an expansion, so the label name in the example will not match. https://github.com/deepmind/xmanager/blob/18652570332e284a6b2c184e6ab943ca56f6a11a/xmanager/xm_local/packaging/bazel_tools.py#L149-L152

    The immediate solution that comes to mind is to allow containing . in the regex, then look for a consecutive . in post-processing like the following:

    diff --git a/xmanager/xm_local/packaging/bazel_tools.py b/xmanager/xm_local/packaging/bazel_tools.py
    index 694f001..4dc52b0 100644
    --- a/xmanager/xm_local/packaging/bazel_tools.py
    +++ b/xmanager/xm_local/packaging/bazel_tools.py
    @@ -147,7 +147,7 @@ def _build_multiple_targets(
     # Expansions (`...`, `*`) are not allowed.
    -_NAME_RE = '[^:/.*]+'
    +_NAME_RE = '[^:/*]+'
     _LABEL_LEXER = re.compile(
     _LexedLabel = Tuple[List[str], str]
    @@ -156,8 +156,10 @@ _LexedLabel = Tuple[List[str], str]
     def _lex_label(label: str) -> _LexedLabel:
       """Splits the label into packages and target."""
       match = _LABEL_LEXER.match(label)
    -  if match is None:
    -    raise ValueError(f'{label} is not an absolute Bazel label')
    +  for g in match.groups('packages'):
    +    print('group:', g)
    +    if '..' in g:
    +      raise ValueError(f'{label} is not an absolute Bazel label')
       groups = match.groupdict()
       packages: Optional[str] = groups['packages']
       target: Optional[str] = groups['target']
    opened by reiyw 1
  • Better support for Kubernetes

    Better support for Kubernetes

    This is to gauge interest if better Kubernetes support would be useful. Please comment if this would be useful to you and ideally explain your use case a bit.

    opened by dfurrer 1
  • pinned sqlalchemy and alembic dependencies are more than two years old

    pinned sqlalchemy and alembic dependencies are more than two years old

    sqlalchemy is pinned to 1.2.19, which was released in April of 2019.

    alembic is pinned to 1.4.3, which was released in September of 2020.

    This was already brought up in https://github.com/deepmind/xmanager/issues/28.

    Old dependencies like this make it difficult for xmanager to coexist with other packages that keep their dependencies up to date - for instance, the hyperparameter optimization package Optuna.

    opened by kalaracey 0
  • FileExistsError on launch

    FileExistsError on launch

    I am trying to run the xmanager with following script.

    from __future__ import annotations
    from xmanager import xm
    import os
    from xmanager import xm_local
    def main(_):
        with xm_local.create_experiment(experiment_title='cifar102') as experiment:
          # path = os.path.join(os.path.dirname(__file__), "learned_optimization")
          path = os.path.join(os.path.dirname(__file__), "./")
          spec = xm.PythonContainer(
          [executable] = experiment.package([
                  # What we are going to run.
                  # Where we are going to run it.
              # Time to pass the batch size as a command-line argument!
              # args={'batch_size': 16},
              args={'--cfg': "configs/run_cub.yaml"},
              # We can also pass environment variables.
              env_vars={'HEAPPROFILE': '/tmp/a_out.hprof'},
    if __name__ == '__main__':

    I get following fileexists error which did not occur when i first ran the code. I am not sure how to solve this issue.

    Error: File "/home/gulzain/gen_39/lib/python3.9/site-packages/xmanager/xm_local/packaging/cloud.py", line 133, in _package_python_container image = build_image.build( File "/home/gulzain/gen_39/lib/python3.9/site-packages/xmanager/cloud/build_image.py", line 119, in build docker_lib.prepare_directory(staging, python_path, dirname, entrypoint, File "/home/gulzain/gen_39/lib/python3.9/site-packages/xmanager/cloud/docker_lib.py", line 56, in prepare_directory shutil.copytree(source_directory, File "/usr/lib/python3.9/shutil.py", line 568, in copytree return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks, File "/usr/lib/python3.9/shutil.py", line 467, in _copytree os.makedirs(dst, exist_ok=dirs_exist_ok) File "/usr/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: '/tmp/tmpm5m_0mb_/'

    I have python3.9 and latest pip version of xmanager.

    opened by gulzainali98 0
  • ResourceExhausted: 429 The following quota metrics exceed quota limits

    ResourceExhausted: 429 The following quota metrics exceed quota limits

    Hi! Thanks for building this amazing project. Recently I'm running script on xmanager+vertex.AI on TPU v2 and v3, but I keep getting this error:

    google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_tpu_v2

    The error is thrown at this line - https://github.com/deepmind/xmanager/blob/v0.2.0/xmanager/cloud/vertex.py#L181.

    Below are the sanity checks that I've done:

    • I found the service account here can be loaded nicely, tho it would soon be assigned to `None here as I'm requesting TPU v2 or v3.
    • tensorboard is set to empty string.
    • the self.location, self.project, pools and auth.get_bucket() all look good. where the location is us-central1, and pools showing --
    [machine_spec {
      machine_type: "cloud-tpu"
      accelerator_type: TPU_V2
      accelerator_count: 8

    I've enabled the three APIs mentioned in the readme (IAM, Cloud AI Platform, Container Registry), additionally Vertex API and Cloud Resource Manager API was enabled. I also checked the Quota page on the console, which looks fine as well. Doesn't look like I'm overusing the resources as described in the error message "exceed quota limits".

    It's been bugging me for quite a few days, and would be really appreciated if anyone could suggest what's possibly going on there. Thanks in advance!

    opened by crystina-z 1
  • Error running cifar10_tensorflow_tpu example

    Error running cifar10_tensorflow_tpu example

    I'm trying to run the cifar10_tensorflow_tpu example on GCP and got this error:

      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 445, in result
        return self.__get_result()
      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/home/koles/.local/lib/python3.9/site-packages/xmanager/xm/core.py", line 824, in launch
        await experiment_unit.add(job, args, identity=identity)
      File "/home/koles/.local/lib/python3.9/site-packages/xmanager/xm_local/experiment.py", line 211, in _launch_job_group
        launch_result = await self._submit_jobs_for_execution(job_group)
      File "/home/koles/.local/lib/python3.9/site-packages/xmanager/xm_local/experiment.py", line 83, in _submit_jobs_for_execution
        vertex_handles = vertex.launch(self._experiment_title,
      File "/home/koles/.local/lib/python3.9/site-packages/xmanager/cloud/vertex.py", line 335, in launch
        job_name = get_default_client().launch(
      File "/home/koles/.local/lib/python3.9/site-packages/xmanager/cloud/vertex.py", line 181, in launch
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/jobs.py", line 1026, in wait_for_resource_creation
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/base.py", line 1246, in _wait_for_resource_creation
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/base.py", line 214, in _raise_future_exception
        raise self._exception
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/base.py", line 226, in _complete_future
        future.result()  # raises
      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 438, in result
        return self.__get_result()
      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 52, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/base.py", line 316, in wait_for_dependencies_and_invoke
        result = method(*args, **kwargs)
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform/jobs.py", line 1496, in run
        self._gca_resource = self.api_client.create_custom_job(
      File "/home/koles/.local/lib/python3.9/site-packages/google/cloud/aiplatform_v1/services/job_service/client.py", line 794, in create_custom_job
        response = rpc(
      File "/home/koles/.local/lib/python3.9/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
        return wrapped_func(*args, **kwargs)
      File "/home/koles/.local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 52, in error_remapped_callable
        raise exceptions.from_grpc_error(exc) from exc
    google.api_core.exceptions.NotFound: 404 custom_job.job_spec.service_account must be specified when uploading to TensorBoard.

    I followed the xmanager setup instructions and then run the example from a clean GCP VM:

    xmanager launch examples/cifar10_tensorflow_tpu/launcher.py

    Thank you for the help.

    opened by akolesnikov 0
  • installing xmanager from pip fails in colab environment

    installing xmanager from pip fails in colab environment

    python -m pip install xmanager fails with the following error in colab (18.06) environment:

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    tensorflow 2.8.0 requires tf-estimator-nightly==2.8.0.dev2021122109, which is not installed.
    pandas-gbq 0.13.3 requires google-cloud-bigquery[bqstorage,pandas]<2.0.0dev,>=1.11.1, but you have google-cloud-bigquery 2.34.3 which is incompatible.
    google-cloud-translate 1.5.0 requires google-cloud-core<2.0dev,>=1.0.0, but you have google-cloud-core 2.3.0 which is incompatible.
    google-cloud-firestore 1.7.0 requires google-cloud-core<2.0dev,>=1.0.3, but you have google-cloud-core 2.3.0 which is incompatible.
    google-cloud-datastore 1.8.0 requires google-cloud-core<2.0dev,>=1.0.0, but you have google-cloud-core 2.3.0 which is incompatible.
    Successfully installed async-generator-1.10 docker-5.0.3 google-cloud-aiplatform-1.12.1 google-cloud-bigquery-2.34.3 google-cloud-core-2.3.0 google-cloud-resource-manager-1.4.1 google-cloud-storage-2.3.0 google-crc32c-1.3.0 google-resumable-media-2.3.2 grpc-google-iam-v1-0.12.4 immutabledict-2.2.1 kubernetes-23.3.0 proto-plus-1.20.3 protobuf-3.20.1 pyyaml-6.0 sqlalchemy-1.2.19 websocket-client-1.3.2 xmanager-0.1.5
    opened by proppy 2
  • JOB_STATE_FAILED for cifar10_tensorflow

    JOB_STATE_FAILED for cifar10_tensorflow

    I am unable to launch an example script. Following is the command and console output/Error. I am running the command from PyCharm terminal. The job is launched but fails immediately with "JOB_STATE_FAILED" error.

    % sudo xmanager launch ./examples/cifar10_tensorflow/launcher.py

    Console output + Error (a part of it): [+] Building 0.5s (16/16) FINISHED
    => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 694B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for gcr.io/deeplearning-platform-release/tf2-gpu.2-6:latest 0.4s => [ 1/11] FROM gcr.io/deeplearning-platform-release/[email protected]:<"a bunch of HEX digits"> 0.0s => [internal] load build context 0.0s => => transferring context: 8.07kB 0.0s => CACHED [ 2/11] RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi 0.0s => CACHED [ 3/11] RUN apt-get update && apt-get install -y git netcat 0.0s => CACHED [ 4/11] RUN python -m pip install --upgrade pip 0.0s => CACHED [ 5/11] COPY cifar10_tensorflow/requirements.txt /cifar10_tensorflow/requirements.txt 0.0s => CACHED [ 6/11] RUN python -m pip install -r cifar10_tensorflow/requirements.txt 0.0s => CACHED [ 7/11] COPY cifar10_tensorflow/ /cifar10_tensorflow 0.0s => CACHED [ 8/11] RUN chown -R 1000:root /cifar10_tensorflow && chmod -R 775 /cifar10_tensorflow 0.0s => CACHED [ 9/11] WORKDIR cifar10_tensorflow 0.0s => CACHED [10/11] COPY entrypoint.sh ./entrypoint.sh 0.0s => CACHED [11/11] RUN chown -R 1000:root ./entrypoint.sh && chmod -R 775 ./entrypoint.sh 0.0s => exporting to image 0.0s => => exporting layers
    ... {"status":"Waiting","progressDetail":{},"id": .... {"status":"Layer already exists","progressDetail":{},"id": .... Your image URI is: Job launched at: https://console.cloud.google.com/ai/platform/locations//training/ current state: JobState.JOB_STATE_QUEUED current state: JobState.JOB_STATE_PENDING current state: JobState.JOB_STATE_FAILED

    opened by nayakanuj 0
