This article describes plugins that analyze source systems and calculate various metrics. The analysis is done based on data snapshots.
The plugins are:
This plugin scans the data of the data sample in order to calculate various data quality metrics. These metrics can then be used for masking and synthetic data generation.
The purpose of this plugin is to identify fields with a limited number of distinct values (in the data sample) and save those values in a dedicated MTable, enabling their use in masking and synthetic data generation.
Once a field is identified as an Option Set, the property optionSet = true
is created for it. In addition, a separate MTable is generated for each data platform and schema to store the distinct values (and their distribution). The MTable has the following naming format:
catalog_field_option_set___<dataPlatform>_<schema>.csv
, (containing 3 underscores before the data platform name).
The below image is an example of such MTable:
The rules for identifying fields with a limited number of distinct values are:
Absolute Threshold
input parameter (which is set to 15 by default).Additional rules apply based on the plugin input parameters, as explained below.
This parameter defines the absolute threshold number of distinct values. If the relative number of distinct values per field, found in a data sample, exceeds the plugin’s threshold (0.05), it is then validated against the absolute threshold (15). For example:
The fieldTypeIncludeList
plugin input parameter controls which field data types are considered when checking for distinct values.
By default, this parameter is set to the STRING or INTEGER data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
This parameter allows to set up an override list of field names. These fields will be included in the plugin's validation algorithm, even if they are PII or belong to a small table (see the minSampleSize
property).
This parameter allows to set up an override list of field names. These fields will be excluded from the plugin's validation algorithm.
This parameter defines a limit to STRING size, to prevent handling text files or complex structures inside a field. The default value is 512 bytes.
This parameter allows to skip small tables by defining the minimum sample size required to verify whether a field qualifies as an Option Set. The default value is 100.
The purpose of this plugin is to calculate the percentage of NULL values per column, based on the data snapshot. This percentage is calculated for each column in non-empty tables. The default size of the data snapshot is configured in the plugins.discovery file as explained here.
As a result, when the calculated value exceeds the threshold, the Null Percentage property is added to the field's properties.
For example, when 30% of the values in a given field are null, the Null Percentage property will be added to this field with the value = 0.3. However, if 20% or less of the values in this field are null, then this property will not be added.
This plugin was valid until Fabric V8.1. In V8.2 it has been merged into the Data Quality Metrics plugin.
This article describes plugins that analyze source systems and calculate various metrics. The analysis is done based on data snapshots.
The plugins are:
This plugin scans the data of the data sample in order to calculate various data quality metrics. These metrics can then be used for masking and synthetic data generation.
The purpose of this plugin is to identify fields with a limited number of distinct values (in the data sample) and save those values in a dedicated MTable, enabling their use in masking and synthetic data generation.
Once a field is identified as an Option Set, the property optionSet = true
is created for it. In addition, a separate MTable is generated for each data platform and schema to store the distinct values (and their distribution). The MTable has the following naming format:
catalog_field_option_set___<dataPlatform>_<schema>.csv
, (containing 3 underscores before the data platform name).
The below image is an example of such MTable:
The rules for identifying fields with a limited number of distinct values are:
Absolute Threshold
input parameter (which is set to 15 by default).Additional rules apply based on the plugin input parameters, as explained below.
This parameter defines the absolute threshold number of distinct values. If the relative number of distinct values per field, found in a data sample, exceeds the plugin’s threshold (0.05), it is then validated against the absolute threshold (15). For example:
The fieldTypeIncludeList
plugin input parameter controls which field data types are considered when checking for distinct values.
By default, this parameter is set to the STRING or INTEGER data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
This parameter allows to set up an override list of field names. These fields will be included in the plugin's validation algorithm, even if they are PII or belong to a small table (see the minSampleSize
property).
This parameter allows to set up an override list of field names. These fields will be excluded from the plugin's validation algorithm.
This parameter defines a limit to STRING size, to prevent handling text files or complex structures inside a field. The default value is 512 bytes.
This parameter allows to skip small tables by defining the minimum sample size required to verify whether a field qualifies as an Option Set. The default value is 100.
The purpose of this plugin is to calculate the percentage of NULL values per column, based on the data snapshot. This percentage is calculated for each column in non-empty tables. The default size of the data snapshot is configured in the plugins.discovery file as explained here.
As a result, when the calculated value exceeds the threshold, the Null Percentage property is added to the field's properties.
For example, when 30% of the values in a given field are null, the Null Percentage property will be added to this field with the value = 0.3. However, if 20% or less of the values in this field are null, then this property will not be added.
This plugin was valid until Fabric V8.1. In V8.2 it has been merged into the Data Quality Metrics plugin.