DataSets

A DataSet is a CSV file that contains replacement data to be used by DataSet masks to overwrite sensitive values.

There are two parts to using DataSets in DataVeil:

1. The DataSet Definition

This defines the format of a CSV file and where to find it. This can be defined as either a DataSet definition under the DataSets tab or as a DataSet Component.

2. The DataSet Mask

These are masks that use a DataSet definition for reading a CSV file as a source of masked values for data masking. Please refer to the DataSet Mask.

 

The DataSet Definition

A CSV file that is used as a DataSet source can contain a logically unlimited number of columns in any order.

You do not need to delete unused columns from CSV files because the DataSet Mask can specify precisely which columns are to be selected from a CSV file. This makes it convenient to use existing CSV files without modification.

A DataSet definition defines the location of a CSV file and how to parse it, such as whether the file contains a heading row (CSV column names), separator character, etc.

DataSets defined under the main DataSets tab (shown below) are private to the current project and are saved in the current project file. If you would like the possibility of sharing the DataSet definition among multiple project files then you should create the definition as a DataSet Component instead.

To create a private DataSet definition for the current project, click on the 'Add...' button.

After selecting a CSV file, the following dialog shall appear. Enter the file format parameters and check that the file is being parsed correctly by reviewing the CSV File Preview view.
 

 

Path Type

Absolute

CSV Path shall be an absolute path.
 

Relative to project file

CSV Path shall be relative to the folder where the current project shall be saved. The path must be in the same folder or at any folder level below the project file.
 

Relative to DataSet Home

CSV Path shall be relative to DataSet Home. The path must be in the same folder or at any folder level below it.
 

Relative to Component Home

CSV Path shall be relative to Component Home. The path must be in the same folder or at any folder level below it.

 

Lifetime

When a DataSet mask is used, DataVeil copies the required data to the target DBMS. Whether the DataSet is deleted from the DBMS or retained after the masking execution is completed is known as the Lifetime.
 

Temporary Lifetime

After a masking execution is completed, the DataSet is automatically deleted from the DBMS.

This is the default.
 

Permanent Lifetime

Although the DataSet upload process is relatively fast, if you have many DataSets that are much larger than 100,000 rows and you intend to repeat many masking executions using the same DataSets then it may be worthwhile, in the interests of efficiency, to specify that these DataSets should be retained on the DBMS after a masking execution.

You can do this is from the Lifetime dialog. Click on the "Lifetime..." button and a similar dialog as shown below appears:

 

You can now select the "Keep as Table" radio button and specify a DBMS table name where this DataSet is to be saved. This table should not initially exist.

If your DBMS requires a schema location (such as on SQL Server) then the schema must already exist prior to execution. For this purpose it is convenient to create a schema where all permanent datasets shall be kept so that it is easy to drop them all later. In this example the schema shown is "dataveil".

On repeated masking executions, during initialization, DataVeil shall see that the table exists and will therefore not attempt to upload the DataSet. If the table already exists then DataVeil will not attempt to validate that the contents are of the specified DataSet.

Note: If you use Permanent Lifetime then please remember to manually DROP these DataSet tables from your DBMS after you have completed all your masking.

 

On Database

  Encode as Unicode

If this option is enabled then the data from the CSV file shall be used with Unicode encoding on the database, otherwise the default character set on the database shall be assumed.

The default is that 'Encode as Unicode' is enabled and this is generally the recommended setting.

The following are the maximum character lengths per CSV value according to how this parameter is set:

SQL Server/Azure: Default charset: 8000. Unicode: 4000.

Oracle: Default charset: 4000. Unicode: 2000.

MySQL: Default charset: 8000. Unicode: 4000.

 

CSV Parameters

Column names on row number

If the CSV file has a row that lists the column names of the data in the file then this should specify that row number where the first row in the file is numbered from 1.

If the file does not contain a row with column names then this value must be set to 0.

This value must be less than the value for the 'Data starts on row number' parameter.
 

Data starts on row number

This is the row number in the CVS file where the first line of data appears where the first row in the file is numbered from 1.

For example, if the first row in the CSV file is a column names row and the next row is where the data starts, then this value would be 2.

This value must be greater than the value for the 'Column names on row number' parameter.
  

Ignore leading white space

If this value is 'Y' then any white space at the start of a line, or immediately following a Separator, until the next non-white space character shall be ignored.

If this value is 'N' then white space shall be considered as part of the current column value.
  

Separator character

This is the character that terminates a column value and starts the next column value. The last column value in a row is terminated by a new line.

Printable characters (such as a comma ',') can be entered as a normal character.

Unprintable characters should be entered with a '\' followed by the decimal ASCII code. For example, the ASCII Group Separator character would be entered as '\29'.

The Tab character can be entered as '\t' or '\9'.
 

Quote character

Quotes that enclose data values are optional.

If quotes are used to enclose data values then this parameter should specify the Quote character used. Example the double quote (") or single quote (').

If quotes are not used to enclose data values in the CSV file then this parameter has no effect.

Quotes that are to be considered as part of a data value can either be stuffed with an additional quote or escaped using the 'Escape character'.

Example: The value Sammy's could be represented as "Sammy""s" or "Sammy\"s" (assuming the Escape character is \ ).
 

Escape character

This defines the character to 'escape' characters that would ordinarily be considered CSV control characters but are to be considered as plain text data instead. Please refer to the example under "Quote character" above.

 

CSV File Preview

The CVS File Preview Tab shows you how DataVeil shall parse the CSV file.

It is advisable to prepare CSV files to have a uniform number of columns. In other words, the Column Names row (if included) and every Data row should contain the same number of columns. Therefore, every such row should contain the same number of Separators.

If there are two consecutive Separators (i.e. not escaped and therefore not part of a column value) then the missing value (between the consecutive Separators) shall default to an empty string. Note: Oracle shall convert empty strings to NULL, whereas SQL Server preserves empty string values.

If a data row has fewer values than the number of CSV columns then the remaining columns for that row shall receive NULL values.

The number of CSV columns for a table shall be assumed to be the number of columns found in the Column Names row or the first Data row if a Column Names row was not included.

 

The DataSet Mask

After a DataSet definition has been created for a CSV file, as described above, it is possible to create DataSet masks that will use the CSV file as a source of masking values.

Please see the DataSet Mask for details.