K8S Cluster Provisioning on Azure¶
With the following guide, you can build up a MARO cluster in k8s mode on Azure and run your training job in a distributed environment.
Prerequisites¶
Install docker and Configure docker to make sure it can be managed as a non-root user
Download AzCopy, then move the AzCopy executable to /bin folder or add the directory location of the AzCopy executable to your system path:
# Take AzCopy version 10.6.0 as an example
# Linux
tar xvf ./azcopy_linux_amd64_10.6.0.tar.gz; cp ./azcopy_linux_amd64_10.6.0/azcopy /usr/local/bin
# MacOS (may required MacOS Security & Privacy setting)
unzip ./azcopy_darwin_amd64_10.6.0.zip; cp ./azcopy_darwin_amd64_10.6.0/azcopy /usr/local/bin
# Windows
# 1. Unzip ./azcopy_windows_amd64_10.6.0.zip
# 2. Add the path of ./azcopy_windows_amd64_10.6.0 folder to your Environment Variables
# Ref: https://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10
Cluster Management¶
Create a cluster with a deployment
# Create a k8s cluster maro k8s create ./k8s-azure-create.yml
Scale the cluster
# Scale nodes with 'Standard_D4s_v3' specification to 2 maro k8s node scale my_k8s_cluster Standard_D4s_v3 2
Check VM Size to see more node specifications.
Delete the cluster
# Delete a k8s cluster maro k8s delete my_k8s_cluster
Run Job¶
Push your training image
# Push image 'my_image' to the cluster maro k8s image push my_k8s_cluster --image-name my_image
Push your training data
# Push data under './my_training_data' to a relative path '/my_training_data' in the cluster # You can then assign your mapping location in the start-job deployment maro k8s data push my_k8s_cluster ./my_training_data/* /my_training_data
Start a training job with a deployment
# Start a training job with a start-job deployment maro k8s job start my_k8s_cluster ./k8s-start-job.yml
Or, schedule batch jobs with a deployment
# Start a training schedule with a start-schedule deployment maro k8s schedule start my_k8s123_cluster ./k8s-start-schedule.yml
Get the logs of the job
# Logs will be exported to current directory maro k8s job logs my_k8s_cluster my_job_1
List the current status of the job
# List current status of jobs maro k8s job list my_k8s_cluster my_job_1
Stop a training job
# Stop a training job maro k8s job stop my_k8s_cluster my_job_1
Sample Deployments¶
k8s-azure-create¶
mode: k8s
name: my_k8s_cluster
cloud:
infra: azure
location: eastus
resource_group: my_k8s_resource_group
subscription: my_subscription
user:
admin_public_key: "{ssh public key with 'ssh-rsa' prefix}"
admin_username: admin
master:
node_size: Standard_D2s_v3
k8s-start-job¶
mode: k8s
name: my_job_1
components:
actor:
command: ["bash", "{project root}/my_training_data/actor.sh"]
image: my_image
mount:
target: "{project root}"
num: 5
resources:
cpu: 2
gpu: 0
memory: 2048m
learner:
command: ["bash", "{project root}/my_training_data/learner.sh"]
image: my_image
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048m
k8s-start-schedule¶
mode: k8s
name: my_schedule_1
job_names:
- my_job_2
- my_job_3
- my_job_4
- my_job_5
components:
actor:
command: ["bash", "{project root}/my_training_data/actor.sh"]
image: my_image
mount:
target: "{project root}"
num: 5
resources:
cpu: 2
gpu: 0
memory: 2048m
learner:
command: ["bash", "{project root}/my_training_data/learner.sh"]
image: my_image
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048m