You are getting close to your “Go Live Date”.  You have gone through multiple iterations of testing processes and you are now in the process of testing/building data migration/integration process.  You have to move millions of rows quickly; in addition, you have a daily integration that needs to move high number of rows in a very short time frame.   You are able push some data but not fast enough.  

If you are still following, you might have encountered this issue in the past or will encounter it in the future, maybe you had a similar issue as well.  In this blog I’ll cover a scenario where we need to move data, using ADF (Azure Data Factory) and how to optimize the data transfer.   I’ll also explain the impact of different configurations such as batch size and parallelism level.  I do hope that you will find this blog helpful.

 

Challenge

We need to move the data from one instance to another, the data that we are moving is a custom table.  This is a very simple table, not too many fields no plugins are getting triggered.  The amount of data that we need to move is about 36K rows.  We need to move it as fast as we can.

Solution

If we are moving row by row, and each row takes 1 second to process, it may take up to10 hours (36k seconds) to process all the rows.  Luckily, Dynamics 365 support batching processes and parallelism so we can speed up the process.  When we are planning to utizlie parallelism we should also look at the Power Platform throttling or API calls limitations. 

See links:

Service Protection API Limits

Requests limits and allocations

 

To figure out what is the best configuration to move data I researched with different scenarios.   In the screenshots below, you can see Azure Data factory configuration with the Dynamics 365 connector.

In the Sink tab, you can configure the batch size and max concurrent connections:

 

  

In the Setting tab, you can configure the degree of copy parallelism:

In case that you are not familiar with Azure Data Factory, here is a useful link:

Copy data from and to Dynamics 365 (Common Data Service) or Dynamics CRM by using Azure Data Factory

Azure Data Factory documentation

 

I ran multiple scenarios where I moved the data, you can see the results in the following table:

Batch size

Concurrent

Parallel

Rows

Seconds

Create rows per second

10

52

32

35985

98

367.1938776

50

52

32

35985

84

428.3928571

100

52

32

35985

86

418.4302326

100

52

20

35985

99

363.4848485

100

52

16

35985

113

318.4513274

 

 

Consider the scenario where I deal with a larger data set, can I use the same configuration?  Do I need to modify it?  Please see the following scenarios:

Batch size

Concurrent

Parallel

Rows

Seconds

Create rows per second

10

52

32

150000

1147

130.7759372

50

30

30

150000

1133

132.39188

100

10

16

150000

834

179.8561151

200

24

8

150000

717

209.2050209

200

24

12

150000

795

188.6792453

 

Conclusions

  1. Increasing the batch size, doesn’t always help. Batching size can also have different impact based on the latency and locations of where those processes are running from/to.
  2. You could see that there is a different between vey high amount of data vs small amount of data and how can you spread the process to run in parallel.
  3. Increasing parallelism can help a lot, but if we have too many processes in parallel and high amount of data, that will actually results in a slower rate of operations.  (see service API Protection limits)
  4. If you are encountering this problem, use the tables above as a guideline, measure how many records you are creating, look at the throttling and limitations documentations.  I highly recommend to make few tests to optimize your process.
  5. You can apply the same process of optimizing, when you use other ETL tools or build your own console applications.  For more documentations and configuration see this link: How to maximize throughput

 

Disclaimers

Please note that I oversimplify the process here, things to consider when we are moving data:

  1. Are we triggering plugins? Can we turn them off?
  2. What type of entities are we moving, how many attributes, for example creating a contact is more expensive compared to create a custom entity, creating an entity with 10 columns is more expensive compared to creating and entity with 2 columns (columns type is also a factor)?
  3. How much data already exist in the system can also be impactful.
  4. What type of operation are we doing: create/update/insert/delete?