Unsupervised learning is a type of machine learning algorithm that is used for  discovering hidden patterns in data, when we don't have any labels. The most common unsupervised learning method is cluster analysis, that is used to group data according to similarity. Some practical applications of clustering include social network analysis, document classification, rideshare data analysis and customer or market segmentation.

The most popular clustering algorithm is K-means. The goal of the K-means algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided.

Market to customer segmentation is a key component in any business strategy. In the follow article I will show you how to deploy a market segmentation machine learning model for D365FO: 


1. Data Preparation and Loading Data into Azure Machine Learning

1.1 Export data from D365FO

In this case we will use the Sales and Marketing report in customer statistics – Top 100. (For this example, we are using the standard demo data)

1.2 Left the parameters by default

1.3 Export the report as a CSV file

1.3 Import data into Azure Machine Learning (Import the data set)

Now we will explore Azure Machine Learning Studio and build a clustering model using K-Means Clustering.


1.4 Go to https://studio.azureml.net

1.5 If you have signed into Azure ML previously then log in, if not sign up for an account (Free Workspace).

1.6  On the lower left corner, click New.

1.5  Go to the Dataset tab and select FROM LOCALFILE.

1.8 Upload the Top 100.csv from your computer.


2. Build an Azure Machine Learning experiment

2.1   Once the upload is completed on the lower left corner, click New.

2.1   Go to Experiment tab and select Blank Experiment

2.1   Drag Top 100.csv onto the canvas.

In the upper left corner, expand Saved Datasets and select My Dataset.

2.1   You can visualize the data by doing a right-click on the small circle below the module and click Visualize

2.1   You can then review the data. Look for the number of columns, number of rows, and what the data looks like for each of the rows.

2.1   Drag the Normalize Data module onto the canvas.  As part of data preparation, we need to Normalize or put it at the same scale the data and reduce the noise.

Connect the output of the Top 100 data set into the input of the Normalize Data module.


 2.7 Set the Clean Missing data module to the following properties:

Zscore converts all values to a z-score. The values in the column are transformed using the following formula:

2.8   At the Launch column selector, select the follow columns

2.0   Drag two Train Clustering Model module to the canvas and the K-Means Clustering module. Connect the modules as follows and select the Launch Column selector:

2.10   At the Launch column selector, select  All columns

2.11   On the left side K-Means Clustering module change the following properties:

Create trainer mode: Single Parameter

Number of Centroids: 4

Initialization: K-Means++

Random number seed: 12345

Metric: Euclidean

Iterations: 100

Note: The K-means++ algorithm is a variation of the standard K-means algorithm output from our algorithm to use smarter initializations. The K-means++ also has the potential to reduce the total running time. The clusters are modelled using a measure of similarity which is defined by metrics such as Euclidean or probabilistic distance. In this case we will select the Euclidean distance between points p and q, which is the length of the line segment connecting them. The numbers of Centroids represent the number of clusters. The Random seed is used to generate random numbers to initialize the centroids.

2.12   Drag two Convert to CSV modules to the canvas. Connect each Train Clustering module to each Convert to CSV module. Run the experiment by clicking the Run button. Your full experiment should now look like this:


3. Evaluate the Evaluate Clustering

Note: Clustering Evaluation is typically tricky because there's rarely any ground truth information that we can use for testing. So how do we evaluate whether clustering is good?

The short answer is no one agrees. But the longer answer is that researchers have developed several useful heuristics. The first, and perhaps most useful observation, is that we are often trying to find some meaningful latent pattern in our data via clustering. The most important thing is to determine if the value of K (the number of clusters) is optimal. To do that we use the elbow method that plots the cost of J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out.

3.1 Visualize the Train Clustering Model results dataset

Note: The ellipses that represents the clusters are almost perpendicular, and

the lengths are quite different as well as this indicates the separation between

those clusters are pretty good. On the other hand, if we see an overlap between the two ellipses, this indicates that we have a poor separation of the data into the clusters.


4. View the results of the experiment and create Jupyter Notebooks

4.1 From the Azure ML Workspace Right click on the right side Convert to CSV and under Results Dataset choose Results Data Set -> Open in new Notebook. Choose Python 3.

4.2 Once the Jupyter Notebook opens click on Run. In the In[]: line type the following: Frame.

4.3 Click Run. You should see the following:

Note: The Assignments show the cluster that every data row was assigned.


4.5. To plot the cluster, type the following in the next In[]:

5. Beyond Azure Machine Learning studio

If you want to plot the cluster in different colors and run the Elbow method, please use the follow repo at github: