当前位置:网站首页>Best practices for exclusive resource pool use -notebook and training task linkage
Best practices for exclusive resource pool use -notebook and training task linkage
2022-07-18 21:33:00 【Hua Weiyun】
In this scenario , We use Notebook To carry the commissioning stage 、ModelArts Training tasks to carry the batch operation stage .
Features of the scene :
Notebook Provide computing resources for debugging through container instances . Because of the security requirements of Huawei cloud , The interactive environment cannot directly provide root jurisdiction , Must be “ma-user”(user_id:1000) land 、 Use resources .
Notebook Provide a series of prefabricated images , Mirroring has provided the foundation cuda drive 、conda And python、 AI engine mainstream version (TensorFlow、PyTorch) And its dependence .
Notebook Provide image storage capability , Support customers to solidify the modifications and changes to the image environment .
Notebook Provide ssh Landing capability , Support customers to use VS Code Wait for the plug-in to log in remotely 、 Commissioning .
Notebook And development environment can be achieved through NFS To share a folder , Support real-time synchronization of files in the corresponding folder .
Scenario dependency :
Have purchased ModelArts Exclusive training pool , Complete the initialization of the development environment ,NAS VPC Has been properly configured .
Permissions have been set properly , Have SFS Turbo Common permissions and fine-grained permissions of services “sfsturbo:*:dataAction”.
Practice process :
Commissioning stage :
1. establish Notebook example
Access link :https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dev-container

Click on “ establish ” Button , And use the exclusive pool resources to create Notebook example . Mirror select common mirror “pytorch 1.8”, Storage options “ Elastic file service SFS”. Create the configuration as shown in the screenshot :

notes : If you want to use ssh Login and VS Code Tool remote development , Please refer to the blog :https://bbs.huaweicloud.com/blogs/280541
open Notebook example , And open terminal.
2. Download code 、 Data and conduct commissioning
Code using pytorch The official target is ImageNet Example ,github Links such as :https://github.com/pytorch/examples; Data use public cat and dog data sets ( Cut version ).
The download and decompression commands are as follows :
mkdir -p dog_cat_case cd dog_cat_case wget https://raw.githubusercontent.com/pytorch/examples/main/imagenet/main.py wget https://ma-sa.obs.cn-north-4.myhuaweicloud.com/yangzilong/demos/dog_cat_1w.zip unzip dog_cat_1w.zip |
Installation environment , After debugging the code , stay Notebook You can use the following commands to run training tasks :
cd /home/ma-user/work/dog_cat_case /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/ |
3. Save the running environment
Follow the following procedure to save the current Notebook Environmental Science .

Fill in the name and remarks :

Corresponding notebook Examples may 3-10 Minutes are not available , It takes patience . After saving , You can go to SWR View the corresponding image in the corresponding organization , You can also find the corresponding image in the training .
notes :
/cache And /home/ma-user/work The contents of the two directories need extra attention ./cache The directory will be reloaded during training , Cause the content to be overwritten ;/home/ma-user/work The contents of the directory are notebook In the , Will not be saved in the container image . Suggest /cache Do not store any data in the directory ,/home/user/work The directory stores temporary data and code .
Batch run phase :
1. Create training tasks
Access link :https://console.huaweicloud.com/modelarts/?region=cn-north-4#/training
Pay attention to the following items when creating parameters :
a. Mirror image tag: Mirror image tag Fill in the just saved Notebook Image name
b. Start command : and Notebook Of terminal The test is consistent , namely :
cd /home/ma-user/work/dog_cat_case /home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/ |
c. Mounting path of cloud storage : and Notebook The working directory attached in is consistent , namely “/home/ma-user/work”.

2. View the execution details of training tasks
Click the task name on the task list page to enter the details page . The page will display the log of the task and GPU Utilization .
If logs will be synchronously stored in SFS Plate , Can be in Notebook Use in tensorboard( Use the official pytorch Mirror image ) Real time analysis of logs .
3. Rebuild training tasks
According to research and development needs , Pair code 、 Configuration changes ( Use Notebook, or ECS mount SFS The way of dish ( Be careful linux Of uid Need to be 1000)), And rebuild training tasks . The reconstruction task method is shown in the screenshot :

边栏推荐
- R语言使用dplyr包的arrange函数进行dataframe排序、arrange函数基于一个字段(变量)进行降序排序实战
- 【黄啊码】MySQL入门—2、使用数据定义语言(DDL)操作数据库
- [cloud native] Devops (IV): integrated sonar Qube
- VMware 恢复快照出现 无法创建 5040 MB 的匿名分页文件: 系统资源不足,无法完成请求的服务
- Ten million level data MySQL distinct group by
- R language ggplot2 visualization: use the gghistogram function of ggpubr package to visualize the histogram, use the add parameter to add the mean dotted line, vertical line and horizontal axis to the
- Leetcode high frequency question: image intersection and union ratio IOU calculation method and hand tearing code
- Word cloud graph, word frequency graph, specially statistics the word cloud word frequency of some keywords
- 【黄啊码】为什么我建议您选择go,而不选择php?
- [C language brush leetcode] 134 Gas station (m)
猜你喜欢

浅学js中的关系运算符
![[the pro test is valid]npm warn config global ` --global`, `--local` are deprecated Use `--location=global` instead.](/img/07/9f43b0b8ea0c887900d4daa351f1f8.png)
[the pro test is valid]npm warn config global ` --global`, `--local` are deprecated Use `--location=global` instead.

GD32F4(6):晶振引发串口乱码

MySQL --- 多表查询 - 表与表之间的关系

Based on servlet project -- blog system

Initial redis (know redis and common commands)

Design and sharing of inclinometer based on single chip microcomputer
![Leetcode high frequency question: three unordered arrays a, B, C with length N, find the total number of combinations of (I, J, K) with a[i] + b[j] + c[k] = 64](/img/b0/8ed026a2fab2a3e6c95be2a6a424d9.png)
Leetcode high frequency question: three unordered arrays a, B, C with length N, find the total number of combinations of (I, J, K) with a[i] + b[j] + c[k] = 64

Embedded development: seven techniques for accelerating firmware development

乐视成了反内卷之王:员工过上了没有996的神仙日子!
随机推荐
Epic-kbs9 industrial computer brushing document
R语言ggplot2可视化:使用ggpubr包的ggballoonplot函数可视化分面气球图(可视化由两个分类变量组成的列联表)、facet.by参数指定分面变量
World Tour Finals 2019 D - special boxes
Linux服务器上备份mysql数据库(详细教程)
[UI app mobile UI framework]
scrapy 快速下载
Sword finger offer 53 - I. find the number I in the sorted array
Sword finger offer 55 - ii balanced binary tree
[cache] introduction of a new cache caffeine cache
R语言dplyr包进行数据分组聚合统计变换(Aggregating transforms)、计算dataframe数据的分组分位数(quantile)
Mathematical modeling does not know latex typesetting | it teaches you how to use beautiful latex formulas gracefully in word
Sql笔记
Leetcode high frequency question: three unordered arrays a, B, C with length N, find the total number of combinations of (I, J, K) with a[i] + b[j] + c[k] = 64
[my advanced journey of OpenGL learning] find in NDK development_ Where is the system dynamic library found in library?
Unity about some possible reasons and solutions for using addforce of rigidbody but it doesn't work
Layoffs are coming
[enterprise wechat self built application development]
华为云在线课堂AI技术领域课程“深度学习”学习心得体会---第二周
Practical project: problems encountered in data access layer
I2C通信协议编程实现