Cloud · Data Engineering

Azure ML Distributed training: PyTorchConfiguration DDP NCCL

Symptoms:

Issue #1:

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Issue #2:

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXXXXX to authenticate.

Solutions:

Create control script: ctrl-train-gpu-cluster.py

from azureml.core import Workspace, Experiment, ScriptRunConfig, PyTorchConfiguration
# get compute target
target = ws.compute_targets['gpu-cluster-prd']
# get registered environment
env = ws.environments['AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu']
# get/create experiment
exp = Experiment(ws, 'train-gpu-cluster-prd')
# Specify process_count = node_count of the target compute cluster
distr_config = PyTorchConfiguration(communication_backend='Nccl', process_count=10, node_count=10)
# set up script run configuration
config = ScriptRunConfig(
    source_directory='.',
        script='main_dist.py',
        compute_target=target,
        environment=env,
        distributed_job_config=distr_config,
        arguments=[
                    '--saveEpoch',1,
                    '--maxEpoch',1],
                    )
# submit script to AML
run = exp.submit(config)
print(run.get_portal_url()) # link to ml.azure.com
run.wait_for_completion(show_output=True)

Create main_dist.py

import os
import argparse
import torch
import torch.distributed as dist
from azureml.core import Workspace, Dataset
from azureml.core.authentication import ServicePrincipalAuthentication
#To resolve Issue #2, perform Service Principal Authentication here
svc_pr = ServicePrincipalAuthentication(
    tenant_id="98dd2ee3-xxxx-xxxx-xxxx-0e5369a71eb1",
    service_principal_id="ca909551-xxxx-xxxx-xxxx-b465fbf31f6f",
    service_principal_password='~.xxxxx-c~xxx-p4FAiQGfP0DbgEOg53V.')
ws = Workspace(
    subscription_id="4616643d-xxxx-xxxx-xxxx-0467f4bac6bf",
    resource_group="XXXX",
    workspace_name="ml-ws-gpu",
    auth=svc_pr
    )
dataset = Dataset.get_by_name(ws, name='rpsf-trainset')
#dataset.download(target_path='./data/train', overwrite=True) 
# Create mountcontext and mount the dataset
mount_ctx = dataset.mount()  
mount_ctx.start()  
# Get the mount point
rpsf_trainset_path = mount_ctx.mount_point
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='training project')
    parser.add_argument('--maxEpoch',                 type=int,           default=30,            help='number of training epoches')
    parser.add_argument('--saveEpoch',                type=int,           default=3,             help='save model per saveEpoch')
    parser.add_argument('--training_data_path',       type=str,           default=trainset_path,     help='path for training and validation data')
    opt = parser.parse_args()
    # add parameters into setup_params
    setup_params = defaultdict(str)
    for arg in vars(opt):
        setup_params[arg] = getattr(opt, arg)
    # Get below info from Azure ML SDK
    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    learn_localization(rank,world_size,opt,setup_params);
def init_DDP(opt):
    # Let DDP initialize with default options
    dist.init_process_group('nccl')
def learn_localization(rank,world_size,opt,setup_params):
    opt.rank = rank
    opt.world_size = world_size
    init_DDP(opt)
...

References:

For GPU VM selection:

For Azure ML Compute Cluster VM selection, here’s the official documentation for VM series suggestion.

https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu

Found the lowest unit price is NC6_Promo series @ 0.4USD/Hour

https://azure.microsoft.com/en-us/updates/gpu-and-hpc-vm-price-promotion-now-available/

The GPU enabled NC-series is only available on certain regions. e.g. East US

https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines

For Issue #1, the code sample of using AzureML SDK:

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel

For Issue #2, Authentication:

We don’t want to do interactive authentication in the middle of script runs. Tested it’s best to use the Service Principal method:

https://medium.com/microsoftazure/how-to-authenticate-into-azure-machine-learning-using-the-r-sdk-3ff930697bd1