Symptoms:
Issue #1:
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Issue #2:
Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXXXXX to authenticate.
Solutions:
Create control script: ctrl-train-gpu-cluster.py
from azureml.core import Workspace, Experiment, ScriptRunConfig, PyTorchConfiguration
# get compute target
target = ws.compute_targets['gpu-cluster-prd']
# get registered environment
env = ws.environments['AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu']
# get/create experiment
exp = Experiment(ws, 'train-gpu-cluster-prd')
# Specify process_count = node_count of the target compute cluster
distr_config = PyTorchConfiguration(communication_backend='Nccl', process_count=10, node_count=10)
# set up script run configuration
config = ScriptRunConfig(
source_directory='.',
script='main_dist.py',
compute_target=target,
environment=env,
distributed_job_config=distr_config,
arguments=[
'--saveEpoch',1,
'--maxEpoch',1],
)
# submit script to AML
run = exp.submit(config)
print(run.get_portal_url()) # link to ml.azure.com
run.wait_for_completion(show_output=True)
Create main_dist.py
import os
import argparse
import torch
import torch.distributed as dist
from azureml.core import Workspace, Dataset
from azureml.core.authentication import ServicePrincipalAuthentication
#To resolve Issue #2, perform Service Principal Authentication here
svc_pr = ServicePrincipalAuthentication(
tenant_id="98dd2ee3-xxxx-xxxx-xxxx-0e5369a71eb1",
service_principal_id="ca909551-xxxx-xxxx-xxxx-b465fbf31f6f",
service_principal_password='~.xxxxx-c~xxx-p4FAiQGfP0DbgEOg53V.')
ws = Workspace(
subscription_id="4616643d-xxxx-xxxx-xxxx-0467f4bac6bf",
resource_group="XXXX",
workspace_name="ml-ws-gpu",
auth=svc_pr
)
dataset = Dataset.get_by_name(ws, name='rpsf-trainset')
#dataset.download(target_path='./data/train', overwrite=True)
# Create mountcontext and mount the dataset
mount_ctx = dataset.mount()
mount_ctx.start()
# Get the mount point
rpsf_trainset_path = mount_ctx.mount_point
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='training project')
parser.add_argument('--maxEpoch', type=int, default=30, help='number of training epoches')
parser.add_argument('--saveEpoch', type=int, default=3, help='save model per saveEpoch')
parser.add_argument('--training_data_path', type=str, default=trainset_path, help='path for training and validation data')
opt = parser.parse_args()
# add parameters into setup_params
setup_params = defaultdict(str)
for arg in vars(opt):
setup_params[arg] = getattr(opt, arg)
# Get below info from Azure ML SDK
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
learn_localization(rank,world_size,opt,setup_params);
def init_DDP(opt):
# Let DDP initialize with default options
dist.init_process_group('nccl')
def learn_localization(rank,world_size,opt,setup_params):
opt.rank = rank
opt.world_size = world_size
init_DDP(opt)
...
References:
For GPU VM selection:
For Azure ML Compute Cluster VM selection, here’s the official documentation for VM series suggestion.
https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
Found the lowest unit price is NC6_Promo series @ 0.4USD/Hour
https://azure.microsoft.com/en-us/updates/gpu-and-hpc-vm-price-promotion-now-available/
The GPU enabled NC-series is only available on certain regions. e.g. East US
https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines
For Issue #1, the code sample of using AzureML SDK:
For Issue #2, Authentication:
We don’t want to do interactive authentication in the middle of script runs. Tested it’s best to use the Service Principal method: