Azure SQL DB Data Masking Performance Tuning

There has been an increasing interest among DataVeil users to perform static data masking on Azure SQL DB.

Although the DataVeil masking software for Azure SQL DB/MI is the same as it is for SQL Server on-prem, there are differences in how DataVeil manages the database, connection and logging settings between these platforms. This is because of the implementation differences in Azure SQL vs SQL Server on-prem.

Some differences:

  • Simple logging mode is not available on Azure SQL. This contributes significantly to logging pressure that must be considered in order to avoid performance problems on Azure.
  • Azure can drop connections on long-running queries whereas on-prem these can run indefinitely. On Azure this was once around 5 minutes. By observation, this now seems to have been extended to around 10 – 15 minutes. In any case, these are the sort of parameters and behaviors Microsoft can change from time to time that can impact performance or raise unexpected errors.
  • CLR assemblies cannot be installed on Azure SQL DB (although they can on MI). This is a very major disadvantage because it means that the DataVeil native library cannot be installed on Azure SQL DB. The use of DataVeil native library boosts masking performance in the mapping phase by around 10x (1,000%) and in some cases much more.

Considering the differences and restrictions of the Azure SQL DB platform, this article aims to provide some general guidance on how to configure and tune both DataVeil and Azure SQL DB for optimal masking throughput.

This article will begin with a summary of recommendations so that these are easy to find for busy people who just need this information quickly. Following this, the remainder of the article will provide further details and analysis with test result metrics that illustrate how performance can be affected by various factors.

Recommendations

These recommendations are based on masking performance tests performed in December 2023 on Azure SQL DB using General Purpose – Serverless Gen5 vCores.

Ensure Adequate vCores

Suggested ratio of DataVeil threads : vCores is 1 : 1.5.

For example, to support 8 DataVeil threads (the default), at least 12 vCores should be available.

It is possible to use a lower ratio of 1:1, such as 8 DataVeil threads for 8 vCores, however the masking process will run proportionately longer. Using 1:1, it is also likely that the CPU will be 100% busy for most of the time. This could become an issue for long running masking jobs that take several hours. For shorter masking jobs, taking less than an hour or two, this 1:1 ratio will probably be OK.

Conversely, if you have a lot of vCores that are sitting idle in your Azure account, then the number of configured DataVeil threads could be increased for more parallelism to achieve greater masking throughput. For example, if you have 24 vCores that are idle and available then the DataVeil thread count can be set to 16.

Ensure Adequate Free Data Space

During the masking process, DataVeil shall first generate all masked values and store these in temporary work tables (mapping tables). Therefore, there should be adequate free data space to accommodate these mapping tables.

The amount of free space required can be estimated as follows:

Each masked column will require its own primary key column and a masked value column. Additionally, if the DataVeil project has “Take snapshot of original values…” enabled in the Settings>Database>Recovery tab, then add the size of the original column that will be saved in the mapping table.

To summarize, a masked column will create a temporary mapping table containing a primary key column and either one or two columns the size of the original column, depending on whether the “Take snapshot of original values” setting is enabled.

Use Moderate DataVeil Masking and IO Buffers Sizes

Azure SQL does not like long-running queries. It is a well-known behavior that Azure will disconnect not only idle connections but even active queries that are long-running. This is typically from 5 to 15 minutes.

Therefore, it is recommended to ensure that all DataVeil masking and write queries should complete within 5 minutes. In this context, a query is the work done by DataVeil per batch of rows (buffer size).

During the masking phase, the query duration is related to the Masked Value Generation buffer size. The default is 800,000 rows and is recommended as an initial value.

During the write phase, the query duration is related to the Table IO UPDATE buffer size. The default is 500,000 rows and is recommended as an initial value.

You can determine how long each of these queries are taking by opening the settings at Settings>Tools>Options>Messages panel and enabling “Log progress of individual IO completions within masking and table writer tasks”. If it is determined that individual IO completions are taking longer than 5 minutes then it is recommended to reduce the corresponding buffer size.

log_io_progress_setting

Some example IO progress log messages are shown below:

INFO ...progress of Sentences mask for column [test].[sales].[table1_10M].[Comment]. Batch of 800,000 rows took 4,17 (mins,secs). Column mapping is 61% complete. Project is 82% complete.

INFO ...progress of writing table [test].[sales].[table1_10M]. Batch of 500,000 rows took 0,07 (mins,secs). Table write is 50% complete. Project is 74% complete.

For comparison, when using SQL Server on-prem, these buffer sizes can often be set much higher for improved throughput, provided there is sufficient log space for each query. Furthermore, for on-prem, DataVeil can use Simple logging mode that minimizes the actual log space required.

Replace Slow Masks

The DataVeil native library provides an enormous performance boost in masked value generation by 10x and sometimes much more. This is because the library is a CLR assembly that performs computational tasks efficiently.

Without the native library, as is the case with Azure SQL DB, these computations are performed by T-SQL and are very much slower.

Therefore, without the native library, the choice of masks can be a significant performance-related factor when masking:

  • Strings longer than around 15 characters, and..
  • Slower masks are used. These are described below.

Some DataVeil masks perform faster than others. Masks that perform more slowly are those that require more computation and operate on longer strings because their execution time is linearly proportional to the string length. The computational demand is further increased for masks that are format-preserving and even more if deterministic mode is used. Specifically, masks that are sensitive to these issues are (in order of slower to faster):

  • Randomize / Randomize Hex
  • Redact
  • Sentences

Alternative 1) A fast alternative to the above masks on Azure SQL DB would be to use a User SQL Value mask containing the code:

REPLICATE ('X', LEN(#_column_name))

This is a character fill that will preserve the original length so it is almost like redaction. The difference is that all characters will be redacted with the chosen char (shown as ‘X’ in the example above), including spaces, digits and special symbols. Therefore, consider whether this total redaction of all characters is acceptable for your requirements.

If the exact number of characters are not required in the masked values as in the original values and the original values are very long (say thousands of characters), then the masked value lengths could be set to a shorter fixed length. The example below generates strings with a fixed length of 200 characters:

REPLICATE ('X', 200)

Alternative 2) Use a DataSet mask

This would be a much faster alternative to the Sentences mask. You could create a CSV file containing the dummy replacement text and use a DataSet mask to reference these.

In summary, if you identify that some string type columns are taking an unacceptably long time to mask then consider:

  1. Will the REPLICATE solution above be acceptable? This would be the fastest solution.
  2. Otherwise, can you write a simple and fast T-SQL alternative to the REPLICATE function to use in a User SQL Value mask?
  3. Otherwise, will a DataSet mask using replacement text from a CSV file be acceptable?
  4. Otherwise, can the Sentences mask be used in non-deterministic mode? Sentences mask is faster than the Redact and Randomize masks because the latter two do additional processing for format-preservation. Furthermore, is it necessary to maintain the original length? For example, if the original string is 1,000 characters, would it be acceptable to instead define a Sentences mask that generates strings of only 20 – 40 characters in length? The shorter the sentences generated, the faster the Sentences mask will run.
  5. Otherwise, can you use non-deterministic mode on the masks?

If none of the above suggestions are acceptable and the performance on Azure SQL DB is unacceptable, then consider moving the database to a different secure staging platform where the masking can be performed. This could be Azure SQL MI or SQL Server on-prem. The platform should have DataVeil native library installed.

The table below shows a comparison of sample masking times of the Sentences mask and the suggested alternative masks. It also shows how the masking times differ for shorter and longer character lengths. The Write Time shown is how long it takes for Azure SQL to UPDATE the original table with the masked values.

Rows masked: 10,000,000

azure_sql_db_slow_char_mask_test_results
  1. The Sentences mask was configured as fixed length, non-deterministic.
  2. The User SQL Value mask was configured with the code: REPLICATE ('X', LEN(#_column_name))
  3. The DataSet masking used two user-supplied CSV file of 10,000 rows of sentences to be used as a masked values. One file had 15 char rows, the other had 200 char rows. The mask was configured as non-deterministic, random order.

SQL Server Cache

If you notice that queries were running quickly earlier in a masking execution, but as time progressed, the queries began to run increasingly slowly until they seemed unreasonably slow, then the cause may be related to SQL Server caching, perhaps not releasing unused space as quickly as it should. After all, DataVeil is performing massive updates to the database as quickly as possible.

This slowness affects the performance on Azure SQL, on-prem SQL Server and DataVeil native libraries.

The nuclear option to fix this is to perform the SQL Server command:

DBCC FREESYSTEMCACHE('ALL')

Caution: This command will have wide-ranging effects on the SQL Server instance and in the case of Azure elastic pools, it will affect other users of the elastic pool. Some of the effects are recompilation of all execution plans. However, if DataVeil is masking a database on a staging server or where you have exclusive access to the affected Azure/SQL Server resources, these side effects shouldn't be a problem. If other users are sharing the resources, then consult with them first. Never run DataVeil on a production server and never connect DataVeil to any production server.

Having said the above, the command comes with many options that can narrow the scope of effect. There are also other commands that have narrower scope. This is currently being investigated and this article will be updated with more specifics as they become known.

In the meantime, the command shown above will fix the issues of queries whose performance have drastically slowed over time.

Monitor the Performance Metrics

It is recommended to monitor the Azure SQL performance metrics for your masking project, at least during the masking project development and validation stage to ensure that adequate resources are allocated.

Performance metrics can be configured on Azure SQL DB by opening the SQL Database window, and then selecting the “Metrics” option under the “Monitoring” heading in the left panel. It is also convenient to save the metrics chart by pinning it to your Azure dashboard so that it can be easily referenced whenever needed.

The metrics should include at least:

  • CPU percentage (Avg)
  • Log IO percentage (Avg)

It is also useful to include:

  • Sessions count (Avg) – This will serve as confirmation of how many DataVeil threads are open.
  • Data IO percentage (Avg) – This will usually be small. It will become significant only if the masking project is very under-resourced such as when running far too many threads for too few vCores.

The Analysis section below shows several screen captures of Azure SQL metrics collected.

Analysis

A summary of the data masking throughput tests on Azure SQL DB is shown below:

Table used: 10 million rows with 24 columns
Columns masked: 22
Rows masked: 10,000,000
Total values masked: 220,000,000

azure_sql_db_vcore_masking_test_results

As a reminder, the mapping times shown are those of Azure SQL DB using T-SQL masking functions. You can expect most mapping times to be faster by around 10x on SQL Server on-prem or Azure SQL MI when using the DataVeil native library.

These tests used a variety of masks that would be typical in a real-world masking project (not just simple super-fast masks that often appear in other published benchmarks).  The masks used are shown below:

azure_sql_db_masking_test_config

Note that these masks were using non-deterministic mode except Redact mask that is always deterministic. Non-deterministic mode is recommended wherever acceptable for your requirements as it increases the degree of anonymization of the masked data.

Metrics Collected

The performance metrics for each of the tests are shown below:

DvThreads : vCores = 1 : 0.5

This result shows that 1 : 0.5 over-utilized CPU and yielded considerably degraded performance compared to 1 : 1.

Note: This test was slowest and only half of the horizontal axis of the graph would fit in the display. The horizontal scale has been kept the same for all test results for ease of comparison.

T4_1_05_chart

DvThreads : vCores = 1 : 1

This result shows that 1 : 1 fully utilized the CPU at 100% throughout the mapping phase. Performance was proportionately slower than 1 : 1.5.

T1_1_1_chart

DvThreads : vCores = 1 : 1.5

This result shows that 1 : 1.5 utilized resources well and was the best overall setting for throughput and performance vs cost.

T3_1_15_chart

DvThreads : vCores = 1 : 2

This result shows that 1 : 2 resulted in CPU that was consistently below 80%. This provided negligible improvement in masking performance or throughput over 1 : 1.5.

T2_1_2_chart

Conclusion

Recommended settings for Azure SQL DB static data masking with DataVeil:

  • Configure 1.5 vCores for every 1 DataVeil thread.
  • Use the default Masked Value Generation buffer size of 800,000 rows and Table IO UPDATE buffer size of 500,000. Reduce if individual query times exceed 5 minutes.
  • If overall masking time is excessive, even after using the above settings, then identify which individual masks are performing the slowest. If slowness is related to masking longer strings that use Randomize/Hex, Redact or Sentences masks then choose a faster masking option – either by adjusting the mask settings (e.g. use non-deterministic if acceptable) or using faster masks.
  • If masking performance seems to have consistently degraded during a masking run to the point of seeming unreasonably slow, consider clearing the SQL Server cache.

 

If you have any comments or feedback you'd like to share please contact us.