OpenDILab RL Kubernetes Custom Resource and Operator Lib

Last update: Dec 29, 2022

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.

kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Comments

在 Pod 内增加集群信息
希望以 dijob replica 方式提交时，每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序，增加以下几个环境变量：

replica 中所有 pod 的 FQDN，依据启动顺序排序

当前 pod 的 FQDN

当前 pod 的顺序编号

DI-engine 中会根据这些变量实现对应的网络连接，attach-to 的生成逻辑可以从 di-orchestrator 中移除
enhancement
opened by sailxjx 3

add tasks to dijob spec

1. goal

There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

2. design *

Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

After change, the dijob can be defined as follow:

apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
  name: job-with-tasks
spec:
  priority: "normal"  # job priority, which is a reserved field for allocator
  backoffLimit: 0  # restart count
  cleanPodPolicy: "Running"  # the policy to clean pods after job completion
  preemptible: false  # job is preemtible or not
  minReplicas: 2  
  maxReplicas: 5
  tasks:
  - replicas: 1
    name: "learner"
    type: learner
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label learner xxx
          resources:
            requests:
              cpu: "1"
              nvidia.com/gpu: 1
        restartPolicy: Never
  - replicas: 1
    name: "evaluator"
    type: evaluator
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label evaluator xxx
        restartPolicy: Never
  - replicas: 2
    name: "collector"
    type: collector
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label collector xxx
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job created.
    reason: JobPending
    status: "False"
    type: Pending
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job is starting since all pods are created.
    reason: JobStarting
    status: "False"
    type: Starting
  phase: Starting
  profilings: {}
  readyReplicas: 0
  replicas: 4
  taskStatus:
    learner:
      Pending: 1
    evaluator:
      Pending: 1
    collector:
      Pending: 2
  reschedules: 0
  restarts: 0

task definition:

type Task struct {
	Name string `json:"name,omitempty"`

	Type TaskType `json:"type,omitempty"`

	Replicas int32 `json:"replicas,omitempty"`

	Template corev1.PodTemplateSpec `json:"template,omitempty"`
}

type TaskType string

const (
	TaskTypeLearner TaskType = "learner"

	TaskTypeCollector TaskType = "collector"

	TaskTypeEvaluator TaskType = "evaluator"

	TaskTypeNone TaskType = "none"
)

status.taskStatus definition:

type DIJobStatus struct {
  // Phase defines the observed phase of the job
  // +kubebuilder:default=Pending
  Phase Phase `json:"phase,omitempty"`

  // ...
  
  // map for different task statuses. key: task.name, value: TaskStatus
  TaskStatus map[string]TaskStatus

  // ...
}

// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32

enhancement

opened by konnase 1

new version for di-engine new architecture
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 1
v0.2.0
[x] split webhook and operator

[x] add dockerfile.dev

[x] update CleanPolicyALL to CleanPolicyAll

[x] remove k8s service related operations from server, and operator is responsible for managing services

[x] add e2e test

enhancement
opened by konnase 1
refactor job spec
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

enhancement
opened by konnase 0
Release/v1.0
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 0
fix: job failed submit when collector/learner missed

job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.
bug

opened by konnase 0
Feat/job create event
add event handler for dijob, and mark job as Created when job submitted

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

version -> v0.2.1

enhancement
opened by konnase 0
allocate的一些问题

1.目前的allocator的逻辑，对于不可被抢占的job的初始分配，仅利用minreplicas修改replicas属性，那job的pods部署到哪个节点是完全由K8S决定吗？而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么？和是否能被调度是不是等价的？ 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现，这部分内容什么时候可以补充？ 4.文档中存在许多与最新代码不符合的地方，比如DIJob.Spec.Group属性在代码中已经被移除，文档中提到的job.spec.minreplicas属性代码中也没有，而是在JobInfo中。可以更新一下文档吗？感谢！

opened by RZ-Q 3

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)
bugs fix

judge which task a pod belongs to according to task name instead of task type (https://github.com/opendilab/DI-orchestrator/pull/27)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.2(Jul 21, 2022)
bugs fix

global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)

wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)

incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.1(Jul 4, 2022)
update status replicas and task status

add volumes to job spec

update status CompletionTimestamp when job completed

see details in https://github.com/opendilab/DI-orchestrator/pull/22
Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.0(Jun 30, 2022)
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

see details in https://github.com/opendilab/DI-orchestrator/pull/21
Source code(tar.gz)
Source code(zip)
di-manager.yaml(374.01 KB)
v1.0.0(Mar 23, 2022)
features

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface see https://github.com/opendilab/DI-orchestrator/pull/18

Source code(tar.gz)
Source code(zip)
di-manager.yaml(350.52 KB)
v0.2.2(Dec 15, 2021)
bug fix

resolve bug that job failed to submit when collector/learner missed (https://github.com/opendilab/DI-orchestrator/pull/14)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.1(Oct 12, 2021)
feature

add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.0(Sep 28, 2021)
change orchestrator image repository

version -> v0.2.0

Source code(tar.gz)
Source code(zip)
v0.2.0-rc.0(Sep 6, 2021)
split webhook and operator

add dockerfile.dev

update CleanPolicyALL to CleanPolicyAll

remove k8s service related operations from server, and operator is responsible for managing services

add e2e test

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 8, 2021)
Features

Define DIJob CRD to support DI jobs' submission

Define AggregatorConfig CRD to support aggregator definition

Add webhook to validate DIJob submission

Provide http service for DI jobs to request for DI modules

Docs to introduce DI-orchestrator architecture

Source code(tar.gz)
Source code(zip)

Owner

OpenDILab

Open sourced Decision Intelligence (DI)

GitHub Repository

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

5 Jun 16, 2022

Testing and Estimation of structural breaks in Stata

xtbreak estimating and testing for many known and unknown structural breaks in time series and panel data. For an overview of xtbreak test see xtbreak

13 Jun 19, 2022

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

6 Nov 18, 2022

A tool for calculating distortion parameters in coordination complexes.

OctaDist Octahedral distortion calculator: A tool for calculating distortion parameters in coordination complexes. https://octadist.github.io/ Registe

12 Oct 04, 2022

Code for "Continuous-Time Meta-Learning with Forward Mode Differentiation" (ICLR 2022)

Continuous-Time Meta-Learning with Forward Mode Differentiation ICLR 2022 (Spotlight) - Installation - Example - Citation This repository contains the

25 Oct 20, 2022

An off-line judger supporting distributed problem repositories

Thaw 中文 | English Thaw is an off-line judger supporting distributed problem repositories. Everyone can use Thaw release problems with license on GitHu

2 Jan 09, 2022

A full-fledged version of Pix2Seq

Stable-Pix2Seq A full-fledged version of Pix2Seq What it is. This is a full-fledged version of Pix2Seq. Compared with unofficial-pix2seq, stable-pix2s

205 Dec 27, 2022

Deep deconfounded recommender (Deep-Deconf) for paper "Deep causal reasoning for recommendations"

Deep Causal Reasoning for Recommender Systems The codes are associated with the following paper: Deep Causal Reasoning for Recommendations, Yaochen Zh

22 Oct 15, 2022

COD-Rank-Localize-and-Segment (CVPR2021)

COD-Rank-Localize-and-Segment (CVPR2021) Simultaneously Localize, Segment and Rank the Camouflaged Objects Full camouflage fixation training dataset i

52 Dec 20, 2022

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

112 Dec 31, 2022

Writeups for the challenges from DownUnderCTF 2021

cloud Challenge Author Difficulty Release Round Bad Bucket Blue Alder easy round 1 Not as Bad Bucket Blue Alder easy round 1 Lost n Found Blue Alder m

161 Dec 31, 2022

pq is a jq-like Pickle file viewer

pq PQ is a jq-like viewer/processing tool for pickle files. howto # pq '' file.pkl {'other': 456, 'test': 123} # pq 'table' file.pkl |other|test| | 45

3 Mar 15, 2022

⚓ Eurybia monitor model drift over time and securize model deployment with data validation

View Demo · Documentation · Medium article 🔍 Overview Eurybia is a Python library which aims to help in : Detecting data drift and model drift Valida

172 Dec 27, 2022

Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

161 Dec 20, 2022

PyTorch implementation of the wavelet analysis from Torrence & Compo

Continuous Wavelet Transforms in PyTorch This is a PyTorch implementation for the wavelet analysis outlined in Torrence and Compo (BAMS, 1998). The co

262 Dec 21, 2022

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model. Designed sample dashboard with insights and recommendation for

2 Apr 07, 2022

A simple code to perform canny edge contrast detection on images.

CECED-Canny-Edge-Contrast-Enhanced-Detection A simple code to perform canny edge contrast detection on images. A simple code to process images using c

3 Feb 15, 2022

An implementation of shampoo

shampoo.pytorch An implementation of shampoo, proposed in Shampoo : Preconditioned Stochastic Tensor Optimization by Vineet Gupta, Tomer Koren and Yor

69 Sep 10, 2022

Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP 2021.

The Stem Cell Hypothesis Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP

5 Jul 08, 2022

This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision"

RUAS This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision" A prelimin

2 May 05, 2022

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Related tags

Overview

DI Orchestrator

Prerequisites

Install DI Orchestrator

Submit DIJob

User Guide

Contributing

Comments

1. goal

2. design *

release notes

features

release notes

features

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)

bugs fix

v1.1.2(Jul 21, 2022)

bugs fix

v1.1.1(Jul 4, 2022)

v1.1.0(Jun 30, 2022)

v1.0.0(Mar 23, 2022)

features

v0.2.2(Dec 15, 2021)

bug fix

v0.2.1(Oct 12, 2021)

feature

v0.2.0(Sep 28, 2021)

v0.2.0-rc.0(Sep 6, 2021)

v0.1.0(Jul 8, 2021)

Features

Owner

OpenDILab

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Testing and Estimation of structural breaks in Stata

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

A tool for calculating distortion parameters in coordination complexes.

Code for "Continuous-Time Meta-Learning with Forward Mode Differentiation" (ICLR 2022)

An off-line judger supporting distributed problem repositories

A full-fledged version of Pix2Seq

Deep deconfounded recommender (Deep-Deconf) for paper "Deep causal reasoning for recommendations"

COD-Rank-Localize-and-Segment (CVPR2021)

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Writeups for the challenges from DownUnderCTF 2021

pq is a jq-like Pickle file viewer

⚓ Eurybia monitor model drift over time and securize model deployment with data validation

Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

PyTorch implementation of the wavelet analysis from Torrence & Compo

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

A simple code to perform canny edge contrast detection on images.

An implementation of shampoo

Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP 2021.

This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision"