Cloud · Tech Misc

Troubleshooting Outlook add-in error with “Please sideload your add-in to see app body.”; Unable to right click for “Inspect”;

Symptom:

For our self developed Outlook add-in, with react.js and node.js, we deployed it through M365 to thousands of users. It works in most of the users device, but ~ 5 users reported that their add-in not functional for any button clicking, some with a white screen not loading any front end components; and some showing an error message “Please sideload your add-in to see app body.”

Further more, for these users, even unable to proceed a right clicking to show the menu nor further diagnose with the “Inspect” with the browser developer tool.

Solution:

Found the curial component is to install the webview2 on those users’ device.

WebView2 – Microsoft Edge Developer

After this installation, all affected users add-in works properly.

The info was obtained from MS support provided docs:

Browsers and webview controls used by Office Add-ins – Office Add-ins | Microsoft Learn

Debug Office Add-ins – Office Add-ins | Microsoft Learn

Cloud · Data Engineering

Logic App unable to call a backend service API with private endpoint (PEC), error: “BadRequest Http request failed as there is an error: The SSL connection could not be established.”

Background:

We have an Azure Logic App trying to call a Azure Language Service API. If both allowing public access, it’s working fine. But if we wish to setup the Private Endpoint Connection (PEC), for enterprise network isolation, then we hit a lot of errors, and here’s our workarounds.

Symptom:

BadRequest Http request failed as there is an error: The SSL connection could not be established.

Below screencap showing the Cert Subject Alternative Name only contains “*.cognitiveservice.azure.com”, but not containing a “*.privatelink.cognitiveservices.azure.com”; Thus it errors out a cert handshake error.

Solution:

we setup a DNS conditional forwarding in our internal DNS server: let the public DNS “*.cognitiveservice.azure.com” pointing to the private IP “10.xx.xx.xx”

In this way, DNS correct for cert, and IP correct for PEC, while achieving the blockage of public access to the Language Service.

Further follow up:

In case if your company don’t have a private DNS or the practice to setup conditional forwarding, the root cause is indeed Microsoft should have put the “*.privatelink.cognitiveservices.azure.com” into their cert trust list. To support the PEC setup natively. We have reported to MS product team, but not sure how long it takes to fix.

Cloud · Data Engineering

Azure ML Distributed training: PyTorchConfiguration DDP NCCL

Symptoms:

Issue #1:

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Issue #2:

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXXXXX to authenticate.

Solutions:

Create control script: ctrl-train-gpu-cluster.py

from azureml.core import Workspace, Experiment, ScriptRunConfig, PyTorchConfiguration
# get compute target
target = ws.compute_targets['gpu-cluster-prd']
# get registered environment
env = ws.environments['AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu']
# get/create experiment
exp = Experiment(ws, 'train-gpu-cluster-prd')
# Specify process_count = node_count of the target compute cluster
distr_config = PyTorchConfiguration(communication_backend='Nccl', process_count=10, node_count=10)
# set up script run configuration
config = ScriptRunConfig(
    source_directory='.',
        script='main_dist.py',
        compute_target=target,
        environment=env,
        distributed_job_config=distr_config,
        arguments=[
                    '--saveEpoch',1,
                    '--maxEpoch',1],
                    )
# submit script to AML
run = exp.submit(config)
print(run.get_portal_url()) # link to ml.azure.com
run.wait_for_completion(show_output=True)

Create main_dist.py

import os
import argparse
import torch
import torch.distributed as dist
from azureml.core import Workspace, Dataset
from azureml.core.authentication import ServicePrincipalAuthentication
#To resolve Issue #2, perform Service Principal Authentication here
svc_pr = ServicePrincipalAuthentication(
    tenant_id="98dd2ee3-xxxx-xxxx-xxxx-0e5369a71eb1",
    service_principal_id="ca909551-xxxx-xxxx-xxxx-b465fbf31f6f",
    service_principal_password='~.xxxxx-c~xxx-p4FAiQGfP0DbgEOg53V.')
ws = Workspace(
    subscription_id="4616643d-xxxx-xxxx-xxxx-0467f4bac6bf",
    resource_group="XXXX",
    workspace_name="ml-ws-gpu",
    auth=svc_pr
    )
dataset = Dataset.get_by_name(ws, name='rpsf-trainset')
#dataset.download(target_path='./data/train', overwrite=True) 
# Create mountcontext and mount the dataset
mount_ctx = dataset.mount()  
mount_ctx.start()  
# Get the mount point
rpsf_trainset_path = mount_ctx.mount_point
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='training project')
    parser.add_argument('--maxEpoch',                 type=int,           default=30,            help='number of training epoches')
    parser.add_argument('--saveEpoch',                type=int,           default=3,             help='save model per saveEpoch')
    parser.add_argument('--training_data_path',       type=str,           default=trainset_path,     help='path for training and validation data')
    opt = parser.parse_args()
    # add parameters into setup_params
    setup_params = defaultdict(str)
    for arg in vars(opt):
        setup_params[arg] = getattr(opt, arg)
    # Get below info from Azure ML SDK
    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    learn_localization(rank,world_size,opt,setup_params);
def init_DDP(opt):
    # Let DDP initialize with default options
    dist.init_process_group('nccl')
def learn_localization(rank,world_size,opt,setup_params):
    opt.rank = rank
    opt.world_size = world_size
    init_DDP(opt)
...

References:

For GPU VM selection:

For Azure ML Compute Cluster VM selection, here’s the official documentation for VM series suggestion.

https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu

Found the lowest unit price is NC6_Promo series @ 0.4USD/Hour

https://azure.microsoft.com/en-us/updates/gpu-and-hpc-vm-price-promotion-now-available/

The GPU enabled NC-series is only available on certain regions. e.g. East US

https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines

For Issue #1, the code sample of using AzureML SDK:

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel

For Issue #2, Authentication:

We don’t want to do interactive authentication in the middle of script runs. Tested it’s best to use the Service Principal method:

https://medium.com/microsoftazure/how-to-authenticate-into-azure-machine-learning-using-the-r-sdk-3ff930697bd1

Cloud · Data Engineering

Azure Synapse Workspace SQL Pool “Cannot connect to SQL Database” “Login failed for user ”.”

To utilize the Synapse Workspace native database as datawarehouse, we need to first create sql pool (not the “SQL on-demond”)

After creation complete, we need to refresh (Ctrl + F5) the browser or relogin into Synapse Workspace. Then we can see the new created SQL pool automatically appear as a data store.

However, if you select it as Source or Destination, it throws error below:

Cannot connect to SQL Database: ‘tcp:xxx.sql.azuresynapse.net,1433’, Database: ‘testdb’, User: ”. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access. Login failed for user ”., SqlErrorNumber=18456,Class=14,State=1, . Activity ID: xxx

Solution:

While tried to create a support ticket for this issue. Finally we get the hint:

Known issues

Creation of a new SQL pool operation might appear to have failed if your workspace name starts with a number. However, SQL pool is created successfully, the GRANT control setting would not have been applied correctly in this scenario. You need to follow the below steps to have the proper “Grant CONTROL” setting applied manually to your newly created SQL pools:
Click on “Launch Synapse Studio” in the portal
Click Data in the Left Navigation Panel and navigate to the SQL Pool that needs the setting in the Databases section
Invoke a “New SQL Script” on the SQL Pool
To create a new workspace managed identity user and to assign control permissions to the user run the following in the pool: CREATE USER [WorkspaceName] FROM EXTERNAL PROVIDER GRANT CONTROL TO [WorkspaceName]

After following the instruction, we able to connect the new SQL pool.