Performance

One of the most common types of question we receive is 'How can performance be improved?'

This section describes the factors that can most improve the performance of a DataVeil masking project.

In this context, performance shall mean how to execute a masking project in the minimum amount of time.
 

Use DataVeil version 4.8 or later

DataVeil 4.8 introduced significant performance enhancements. If you are using an earlier version then upgrading to at least 4.8 would be worthwhile in terms of performance improvement under the conditions described below.

The performance gains will depend on several factors and are cumulative. The performance gains can be between 20% and well over 200%. These gains will begin to become noticeable only when masking larger tables, i.e. those with at least a few million rows.

The main factors that will determine performance improvement:

* Larger tables will see greater performance improvements. i.e. A table smaller than around 3 million rows is unlikely to result in much or any improvement. A table with 500 million rows is likely to result in very significant improvement.

* Conditional masks. A column with masks using Where conditions will see further performance improvements. The greater the number of conditional masks in a single column, the greater the performance improvement for that column. Note: This is not to say that adding additional conditional masks will improve performance. It is simply saying that the same project with many conditional masks on large tables will perform much better in version 4.8 or later.

* Preserved rows. A column with preserved rows will perform better. This is especially true on large tables where only a single column is masked.
    

Use DataVeil Native Functions

If you are masking data on Oracle or SQL Server then you should install the corresponding DataVeil Native Library on the DBMS if at all possible.

This provides a performance boost during the masked value generation phase of approximately 2x for Oracle and 10x for SQL Server. The native libraries have no effect on the table write speeds.

Using the DataVeil native libraries, together with DataVeil version 4.8 or later, is probably the biggest performance boost possible, especially for SQL Server.
 

Adjust Masking Execution Settings

These settings are described in detail in The Execution Tab.
 

Maximum Connections

In general, increasing the Maximum Connections will reduce project execution time because it means more masking queries will be executed concurrently. Each masking query uses a separate connection for the duration of the query.

If the number of connections is configured lower than optimal then unused server resources shall sit idle and your project will take longer to execute than optimal.

If the number of connections is configured higher than optimal then resource contention shall occur on your server and the masking project will take longer to execute than optimal.

In general, a good starting point for this setting is 2x or 3x the number of available processors for the masking execution. i.e. processors not significantly busy with other tasks. Therefore, if you have a dedicated server with 4 processors then a good starting point for this setting is between 8 and 12.

Adjust this parameter on subsequent masking executions and observe the performance difference to find the optimal setting for your environment.
  

Masked Value Generation Batch Size

This represents how many rows shall be masked in a single query.

The default is 800,000. This may seems high but it does not cause problems in most modern environments. In fact, it may be possible to improve performance in the range of 10% - 30% by increasing this value. Increasing this value will cause pressure on undo log space that is held allocated for the duration of the query.

If your environment has ample resources and log space then you could try increasing this setting to 1,500,000 and possibly more. If this value is too high then you may experience undo log related errors.
  

Table IO

If using the UPDATE IO setting, you can try to increase the batch size to improve throughput.

The default is 500,000 rows and is generally a good starting point.

Similar undo log pressure considerations apply as with Masked Value Generation Batch Size described above.

 

Available Unique Index in Masked Tables

DataVeil requires an available unique index in any table that it masks.

If a column that is part of a unique index is masked then that index becomes unavailable for DataVeil's purposes of coordinating masking progress.

Therefore, if there are no available unique indexes then DataVeil shall need to create a temporary index. As part of this, DataVeil inserts a temporary column into the table being masked. This is a time-consuming process so it is worthwhile to consider how this can be avoided, such as:

Is a column of an existing unique index being masked that is not necessary to mask without compromising the result? E.g. Occasionally users mask arbitrary numeric ID columns that belong to a primary key that would have been ideal index candidates for DataVeil. Such columns are not normally sensitive, especially if all other sensitive columns have already been masked.

Does a column already exists that contains unique values but does not have a Unique index or constraint? If so then consider defining a unique index or constraint on this column. If it is not appropriate to alter the DBMS schema then you can add a 'create unique index' command as a DataVeil pre-masking script and a 'drop index' as a DataVeil post-masking script. This will avoid DataVeil having to insert and drop a temporary column.
 

Choice of Appropriate Mask for Long Values

The performance of some masks varies according to the length of the original value being masked.

A good example is the Randomize mask. This mask is not only compute-intensive but it performs in linear time according to the length of the original value being masked. For example, the Randomize mask shall take 5x as long to mask a 50 character value as it does to mask a 10 character value. This is also true for Randomize Hex. It is also true for Redact and Sentences, although these are faster because they do not perform the many computations that the Randomize masks do.

Therefore, if you have large text columns (e.g. thousands of characters wide) then avoid using the Randomize mask as this may take an unacceptably long time (many hours or days instead of minutes). Also avoid the Redact mask on very long values if possible. If you must preserve a realistic length then you may need to use the Redact mask but it would be strongly recommended to use the DataVeil Native Library to mask very long values to significantly reduce execution time.

Alternatives to consider:

Use the Sentences mask but supply a more manageable length of masked values to be generated. E.g. If you have a 5,000 character original value and it is acceptable to replace with a shorter value (say 300 characters) then configure a Sentences mask to generate random lengths of 250 - 350 characters.

Create you own specialized mask. If there is a more effective way for you to generate replacement values than the existing DataVeil masks can offer then you can write the SQL code to do this and either put it directly in a User SQL Value mask or put it in a User Defined Function (i.e. TSQL or PL/SQL) on your DBMS and call it using a User SQL Value mask.

See Large Objects for more examples.

 

Skip Pre-Write Recovery Snapshot

After all masked values have been generated and before DataVeil starts to overwrite the original data, DataVeil shall make a temporary copy of the original values to be masked (a 'snapshot'). This is done in case the masking run fails so that DataVeil can attempt to restore all the original values to leave the database in a consistent state.

Ordinarily, this does not take a long time because DataVeil copies only the columns being masked. However, some users who are masking very large tables and wide columns may notice a significant delay before the Table Writers start. If you are unsure of which steps are taking a long time then you can enable Log diagnostic messages to log more detailed progress messages.

If it becomes evident that this snapshot step is taking longer than you would like and you are sure that the masking project shall execute successfully (or don't mind performing a DBMS restore of your data if something goes wrong) then you can disable this snapshot under the tab Settings>Database>Recovery.
 

Unnecessary Use of Synchronize Duplicates

If a column has masks configured only for Deterministic masking then the determinism will implicitly preserve duplicates, i.e. duplicate values shall be masked consistently. Therefore, enabling the Synchronize duplicates option serves no further purpose other than introduce an unnecessary duplicates processing step. In such a case, consider disabling the Synchronize duplicates option for the column.

The Synchronize duplicates option is mainly needed for consistent masking of values within a column if at least one mask for that column is configured for non-deterministic masking.
 

Installation on Same Server

It is preferable to install DataVeil on the same server as the DBMS where data masking shall occur but it is not necessary and will generally not affect performance.

Data Masking

Installing DataVeil on the same server as the DBMS should have negligible effect on data masking performance. This is because the DataVeil process acts as a controller and only issues masking commands to the DBMS. Functions on the DBMS are executed to perform the actual masking tasks.

Note: It is not recommended to run DataVeil across a network when connecting to Oracle DBMS installed on a Linux/Unix server. Very slow/long latency has been observed in such cases and is believed to be related to Oracle JDBC drivers on *nix systems. This can also be observed when performing a command unrelated to DataVeil, such as a CREATE INDEX command, from an SQL Developer window across the network (the slowness will occur when the command is from SQL Developer because it uses JDBC but not from SQL Plus because it does not use JDBC).

Discovery

When a Discovery task is executed DataVeil downloads samples of the data for analysis. The data is kept in memory temporarily and discarded. This is necessary because it was not feasible to implement the discovery analysis algorithms to be executed on the DBMS. Therefore, there may be some benefit in running DataVeil on the same server as the DBMS as it will avoid transmitting data across the network; however, the default sample size is small enough to not cause any network issues.