K8S Cluster Provisioning on Azure

With the following guide, you can build up a MARO cluster in k8s mode on Azure and run your training job in a distributed environment.

Prerequisites

# Take AzCopy version 10.6.0 as an example

# Linux
tar xvf ./azcopy_linux_amd64_10.6.0.tar.gz; cp ./azcopy_linux_amd64_10.6.0/azcopy /usr/local/bin

# MacOS (may required MacOS Security & Privacy setting)
unzip ./azcopy_darwin_amd64_10.6.0.zip; cp ./azcopy_darwin_amd64_10.6.0/azcopy /usr/local/bin

# Windows
# 1. Unzip ./azcopy_windows_amd64_10.6.0.zip
# 2. Add the path of ./azcopy_windows_amd64_10.6.0 folder to your Environment Variables
# Ref: https://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10

Cluster Management

  • Create a cluster with a deployment

    # Create a k8s cluster
    maro k8s create ./k8s-azure-create.yml
    
  • Scale the cluster

    # Scale nodes with 'Standard_D4s_v3' specification to 2
    maro k8s node scale my_k8s_cluster Standard_D4s_v3 2
    

    Check VM Size to see more node specifications.

  • Delete the cluster

    # Delete a k8s cluster
    maro k8s delete my_k8s_cluster
    

Run Job

  • Push your training image

    # Push image 'my_image' to the cluster
    maro k8s image push my_k8s_cluster --image-name my_image
    
  • Push your training data

    # Push data under './my_training_data' to a relative path '/my_training_data' in the cluster
    # You can then assign your mapping location in the start-job deployment
    maro k8s data push my_k8s_cluster ./my_training_data/* /my_training_data
    
  • Start a training job with a deployment

    # Start a training job with a start-job deployment
    maro k8s job start my_k8s_cluster ./k8s-start-job.yml
    
  • Or, schedule batch jobs with a deployment

    # Start a training schedule with a start-schedule deployment
    maro k8s schedule start my_k8s123_cluster ./k8s-start-schedule.yml
    
  • Get the logs of the job

    # Logs will be exported to current directory
    maro k8s job logs my_k8s_cluster my_job_1
    
  • List the current status of the job

    # List current status of jobs
    maro k8s job list my_k8s_cluster my_job_1
    
  • Stop a training job

    # Stop a training job
    maro k8s job stop my_k8s_cluster my_job_1
    

Sample Deployments

k8s-azure-create

mode: k8s
name: my_k8s_cluster

cloud:
  infra: azure
  location: eastus
  resource_group: my_k8s_resource_group
  subscription: my_subscription

user:
  admin_public_key: "{ssh public key with 'ssh-rsa' prefix}"
  admin_username: admin

master:
  node_size: Standard_D2s_v3

k8s-start-job

mode: k8s
name: my_job_1

components:
  actor:
    command: ["bash", "{project root}/my_training_data/actor.sh"]
    image: my_image
    mount:
      target: "{project root}"
    num: 5
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m
  learner:
    command: ["bash", "{project root}/my_training_data/learner.sh"]
    image: my_image
    mount:
      target: "{project root}"
    num: 1
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m

k8s-start-schedule

mode: k8s
name: my_schedule_1

job_names:
  - my_job_2
  - my_job_3
  - my_job_4
  - my_job_5

components:
  actor:
    command: ["bash", "{project root}/my_training_data/actor.sh"]
    image: my_image
    mount:
      target: "{project root}"
    num: 5
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m
  learner:
    command: ["bash", "{project root}/my_training_data/learner.sh"]
    image: my_image
    mount:
      target: "{project root}"
    num: 1
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m