TDM data generation creates synthetic entities based on either rules or AI. The synthetic data is populated into the LU tables, where an LU table can be populated with either source data or generated synthetic data. This article describes the implementation of rule-based data generation.
To support synthetic data generation, LU population must be based on Broadway flows rather than DB Queries or root functions. Hence, the sourceDbQuery Actor was enhanced (in Fabric 7.1) to support either of the following two population modes: a DB Select query from a data source or synthetic population. The population mode is set based on the ROWS_GENERATOR key, which is a session variable. When set to true, the sourceDbQuery Actor runs the data generation inner flow to generate synthetic records. The number of synthetic records created for each parent key is determined by the rowsGeneratorDistribution input argument of the sourceDbQuery Actor.
Verify that the LU tables' populations are based on Broadway flows in order to support synthetic data generation. Note that you need to use the populationRootTable.pop.flow for the main source LU table. For other LU tables, generate the default population flow.
Optional — edit the default number of generated synthetic records. The data generation process needs to 'know' how many records have to be generated on each LU table. For example, the number of addresses to be generated for a synthetic customer should be indicated.
The rowsGeneratorDistribution input argument of the sourceDbQuery Actor (named Query) in each LU table's population flow sets the number of generated records for each table. By default, it generates one record for the main LU table, and between 1 and 3 records are generated for the remaining LU tables. The values '1' and '3' are set in TABLE_DEFAULT_DISTRIBUTION_MIN and TABLE_DEFAULT_DISTRIBUTION_MAX TDM general parameters.
Edit options:
The following methods can be used to override the number range of generated records in a TDM implementation:
i. Edit the default number range of generated records (default is 1 to 3): Update the TABLE_DEFAULT_DISTRIBUTION_MIN and TABLE_DEFAULT_DISTRIBUTION_MAX parameters in the TDM DB. This edit will impact the number of generated records on all LU tables, except for the main source LU table.
Example: Run the following Update statements in order to set the number of generated records to be between 2 and 4:
UPDATE tdm_general_parameters set param_value = '2' where param_name = 'TABLE_DEFAULT_DISTRIBUTION_MIN';
UPDATE tdm_general_parameters set param_value = '4' where param_name = 'TABLE_DEFAULT_DISTRIBUTION_MAX';
ii. Edit the rowsGeneratorDistribution input argument in the LU population flow in order to set either a different number range (minimum and maximum values) or a distribution type (default is Uniform distribution) of generated records for a given LU table, if needed. For example, generate customers with 3 to 6 contracts. The data generation randomly generates a number of records within this range:
Set the type of the distributed value to integer:
Then, edit the distribution type and/or the minimum and maximum values, as seen in the below example:
Click here for more information about distribution types.
[lu name]_[lu table name]_number_of_records
Notes:
If the rowsGeneratorDistribution input argument is edited, but not set as an External parameter, the parameter cannot be overridden by the TDM task.
If the rowsGeneratorDistribution input argument is not edited, it is automatically generated behind the scenes as an external parameter with the following naming convention:
[lu name]_[lu table name]_number_of_records
For example: crm_address_number_of_records.
The external parameter enables the user to override the number range of generated records for each table in the TDM task. For example, customers should be generated with 2 to 4 addresses and 3 to 6 contracts each.
The sourceDbQuery Actor (automatically added to the LU population flow and named Query) runs an inner data generation flow to generate synthetic records for data generation tasks. Data generation flows must be created on each source LU table to support synthetic data generation.
A data generation flow must have the following naming convention:
${population name}.generator
For example: activity.pop.generator
Note that a synthetic data generation task execution sets the ROWS_GENERATOR key (session variable) to true, which triggers the execution of the data generation inner flow on each LU table.
From TDM 8.1 onwards, data generation flows are integrated with the Fabric Catalog to generate synthetic data based on field types. Additionally, TDM supports synthetic data generation without using the Fabric Catalog, in cases where the Catalog is not implemented in the TDM project.
The tdmSeqList and TDMSeqSrc2TrgMapping sequence tables must be populated before generating data generation flows.
This is required in order to include sequence generation within data generation flows for fields that are defined as sequences in the TDMSeqSrc2TrgMapping table. The generated flow sets the sequenceId input argument, which is created in the TDM DB for the generated ID with the following naming convention:
Gen_[the sequence name in TDMSeqSrc2TrgMapping]
Example:
The customer, contract and address tables of the CRM LU have the following sequence mapping:
The data generation flows of these tables create the gen_customer_id_seq, gen_address_id_seq, and gen_contract_id_seq DB sequences in the TDM DB and populate the customer_id, address_id, and contract_id fields based on the generated sequences.
In order to create data generation flows, run either:
I. TDMInitFlow flow. Set the CREATE_GENERATE_FLOWS input parameter to true. Note that this flow is designed to run only once, when creating an LU, and it also adds the TDM tables to the LU. If the LU already contains the TDM tables, it is recommended to run the createAllFromTemplates flow (see the below line) for adding the target tables to the LU.
II. createAllFromTemplates flow. Set the CREATE_GENERATE_FLOWS input parameter to true.
III. createGenerateDataTableFlows flow:
Deploy both the LU, for which you need to generate the data generation flows, and the TDM LU to Fabric debug server.
Open the createGenerateDataTableFlows flow imported from the TDM library.
Populate the LU_NAME and OVERRIDE_EXISTING_FLOWS input parameters.
Run the flow to create the data generation flows for the LU tables, except for the tables populated in the TDMFilterOutTargetTables, whose generator_filterout checkbox is checked (true). These data generation flows are created automatically in the GeneratorFlows subdirectory, under the Broadway directory of the LU.
The following data generation flows are created for each LU table:
Data generation flow. This flow has the following naming convention: ${population name}.generator
.
For example: contract.pop.generator
From TDM 8.1 onwards, TDM templates also create inner flows that set default values for table fields based on their types. This inner flow is called when the Fabric Catalog is not implemented in the TDM project. The inner flow has the following naming convention: ${table name}.typeDefaultsGenerator
.
For example: contract.typeDefaultsGenerator
Data generation flows are created with the following logic:
IDs:
The data generation flow sends the parent IDs to the child's population flow, based on the parent-child LU schema definition. For example, the Address LU table is the child of the Customer LU table. It is linked to the Customer LU table via the customer_id field. A new customer_id sequence is generated for the Customer LU table. The Address' data generation flow gets the parent_row as the input, and it maps the parent customer_id to the Address record.
IDs that are not linked to a parent LU table are populated by the Sequence Actors based on the fields mapped in TDMSeqSrc2TrgMapping.
Other fields are populated with synthetic data:
By default, this process calls the CatalogGeneratorRecord Actor to generate the field values based on the Fabric Catalog. If a field's classification is set in the Catalog, the generated value is based on the classification's data generator. Otherwise, a default value is generated based on the field type.
If no data is returned by the CatalogGeneratorRecord Actor (when the Fabric Catalog is not implemented), then the flow calls the ${table name}.typeDefaultsGenerator
inner flow to utilize data generation Actors according to the fields' data type. Note that these default data generation Actors are selected based on the mappings defined in the GenerateDataDefaultFieldTypeActors constTable (imported from the TDM library under the Shared Objects). This table can be edited to change the default data generators and should be updated before the data generation flows are created.
The output of the data generation flow contains a Map that includes a list of fields. These fields are sent to the related LU population flow and loaded into the LU table as a row column. Note that the data generation flow is called by a loop and returns a single record on each call. By default, the rowsGenerator Actor handles both the loop over the parent rows and the loop over the child IDs for each parent ID.
The following manual updates may be required for the data generation flows:
Replacement of the default data generation Actors with other data generators or custom inner flows — this process needs to be done in later flow stages, after the Prepare Generated Data stage (after calling either the CatalogGeneratorRecord Actor or the ${table name}.typeDefaultsGenerator
inner flow).
Overridden fields must be added to a Map. Once added, this Map should be sent as the last parameter to the Merge Maps of all Fields Actor within the data generation flow.
In the below example flow, the logic to generate the Associated_line, Associated_line_fmt, Contract_ref_id, and Description fields is overridden using MTables instead of the default generated values. These fields are added to a Map, and the Map is then sent to the Merge Maps of all Fields Actor:
In general, it is recommended to populate the PII fields in the data generation flow and to avoid overriding them with the Masking Actors in the LU population flow. Such population enables exposing PII fields as external business parameters for the data generation tasks without overriding their values (as set by the user) by the Masking Actors in the LU population.
It should be verified that the Masking Sensitive Data checkbox is clear for the Synthetic environment in the TDM Portal, in order to avoid masking PII fields in the LU population flows for data generation tasks.
TDM 8.1 has added a new Actor — GenerateConsistent. This Actor inherits from the Masking Actor but has its own category value — generate_consistent. Using the GenerateConsistent Actor in data generation flows ensures referential integrity across LUs for the generated field.
Notes:
The TDM execution process sets the generate_consistent key to true on data generation tasks.
The new Actor does not require having an input value since there is no original value for newly generated synthetic entities.
If a PII field exists across multiple LUs and is set in several records within the LUI, you should use the Masking Actor instead of the GenerateConsistent Actor. For example, a customer may have multiple contracts, each requiring a different name. The contracts exist in both the CRM and Billing LUs. To handle this, populate the Masking Actor's input parameters in the data generation flow as follows:
View the below example:
PII fields can differ in their occurrence across records and in the requirement for referential integrity (consistency). Each scenario requires a different implementation approach.
Example: Both the CRM and Billing LUs — representing separate systems — contain the First Name and Last Name fields. It is required to keep the same combination of the First and Last Names in both LUs for a given customer.
The following table describes the implementation recommendations for each scenario:
Click here for more information about the masking Actors.
If users wish to override the default values of parameters that are set in the TDM implementation, they should be able to set their own values. Such overriding is facilitated only if external business parameters, such as City and State, are to be added to data generation flows. The editor of the parameter depends on the parameter type. Spaces and special characters — except for an underscore — are not allowed in the External Name setting.
Click here for more information about integrating the TDM portal with the Broadway editors, as well as implementation guides for MTable and Distribution parameters.
There are several optional modes for executing the data generation inner flow:
Note that the data generation flow can be edited to be executed in different modes, if needed:
Click here for more information and examples related to the RowsGenerator Actor.
The rule-based data generation task runs in a dummy synthetic environment. This synthetic environment must be added to both Fabric and the TDM self-service application, and its name is defined by the SYNTHETIC_ENVIRONMENT Global. By default, this Global is set to Synthetic, but it can be modified to use a different name.
Notes:
TDM data generation creates synthetic entities based on either rules or AI. The synthetic data is populated into the LU tables, where an LU table can be populated with either source data or generated synthetic data. This article describes the implementation of rule-based data generation.
To support synthetic data generation, LU population must be based on Broadway flows rather than DB Queries or root functions. Hence, the sourceDbQuery Actor was enhanced (in Fabric 7.1) to support either of the following two population modes: a DB Select query from a data source or synthetic population. The population mode is set based on the ROWS_GENERATOR key, which is a session variable. When set to true, the sourceDbQuery Actor runs the data generation inner flow to generate synthetic records. The number of synthetic records created for each parent key is determined by the rowsGeneratorDistribution input argument of the sourceDbQuery Actor.
Verify that the LU tables' populations are based on Broadway flows in order to support synthetic data generation. Note that you need to use the populationRootTable.pop.flow for the main source LU table. For other LU tables, generate the default population flow.
Optional — edit the default number of generated synthetic records. The data generation process needs to 'know' how many records have to be generated on each LU table. For example, the number of addresses to be generated for a synthetic customer should be indicated.
The rowsGeneratorDistribution input argument of the sourceDbQuery Actor (named Query) in each LU table's population flow sets the number of generated records for each table. By default, it generates one record for the main LU table, and between 1 and 3 records are generated for the remaining LU tables. The values '1' and '3' are set in TABLE_DEFAULT_DISTRIBUTION_MIN and TABLE_DEFAULT_DISTRIBUTION_MAX TDM general parameters.
Edit options:
The following methods can be used to override the number range of generated records in a TDM implementation:
i. Edit the default number range of generated records (default is 1 to 3): Update the TABLE_DEFAULT_DISTRIBUTION_MIN and TABLE_DEFAULT_DISTRIBUTION_MAX parameters in the TDM DB. This edit will impact the number of generated records on all LU tables, except for the main source LU table.
Example: Run the following Update statements in order to set the number of generated records to be between 2 and 4:
UPDATE tdm_general_parameters set param_value = '2' where param_name = 'TABLE_DEFAULT_DISTRIBUTION_MIN';
UPDATE tdm_general_parameters set param_value = '4' where param_name = 'TABLE_DEFAULT_DISTRIBUTION_MAX';
ii. Edit the rowsGeneratorDistribution input argument in the LU population flow in order to set either a different number range (minimum and maximum values) or a distribution type (default is Uniform distribution) of generated records for a given LU table, if needed. For example, generate customers with 3 to 6 contracts. The data generation randomly generates a number of records within this range:
Set the type of the distributed value to integer:
Then, edit the distribution type and/or the minimum and maximum values, as seen in the below example:
Click here for more information about distribution types.
[lu name]_[lu table name]_number_of_records
Notes:
If the rowsGeneratorDistribution input argument is edited, but not set as an External parameter, the parameter cannot be overridden by the TDM task.
If the rowsGeneratorDistribution input argument is not edited, it is automatically generated behind the scenes as an external parameter with the following naming convention:
[lu name]_[lu table name]_number_of_records
For example: crm_address_number_of_records.
The external parameter enables the user to override the number range of generated records for each table in the TDM task. For example, customers should be generated with 2 to 4 addresses and 3 to 6 contracts each.
The sourceDbQuery Actor (automatically added to the LU population flow and named Query) runs an inner data generation flow to generate synthetic records for data generation tasks. Data generation flows must be created on each source LU table to support synthetic data generation.
A data generation flow must have the following naming convention:
${population name}.generator
For example: activity.pop.generator
Note that a synthetic data generation task execution sets the ROWS_GENERATOR key (session variable) to true, which triggers the execution of the data generation inner flow on each LU table.
From TDM 8.1 onwards, data generation flows are integrated with the Fabric Catalog to generate synthetic data based on field types. Additionally, TDM supports synthetic data generation without using the Fabric Catalog, in cases where the Catalog is not implemented in the TDM project.
The tdmSeqList and TDMSeqSrc2TrgMapping sequence tables must be populated before generating data generation flows.
This is required in order to include sequence generation within data generation flows for fields that are defined as sequences in the TDMSeqSrc2TrgMapping table. The generated flow sets the sequenceId input argument, which is created in the TDM DB for the generated ID with the following naming convention:
Gen_[the sequence name in TDMSeqSrc2TrgMapping]
Example:
The customer, contract and address tables of the CRM LU have the following sequence mapping:
The data generation flows of these tables create the gen_customer_id_seq, gen_address_id_seq, and gen_contract_id_seq DB sequences in the TDM DB and populate the customer_id, address_id, and contract_id fields based on the generated sequences.
In order to create data generation flows, run either:
I. TDMInitFlow flow. Set the CREATE_GENERATE_FLOWS input parameter to true. Note that this flow is designed to run only once, when creating an LU, and it also adds the TDM tables to the LU. If the LU already contains the TDM tables, it is recommended to run the createAllFromTemplates flow (see the below line) for adding the target tables to the LU.
II. createAllFromTemplates flow. Set the CREATE_GENERATE_FLOWS input parameter to true.
III. createGenerateDataTableFlows flow:
Deploy both the LU, for which you need to generate the data generation flows, and the TDM LU to Fabric debug server.
Open the createGenerateDataTableFlows flow imported from the TDM library.
Populate the LU_NAME and OVERRIDE_EXISTING_FLOWS input parameters.
Run the flow to create the data generation flows for the LU tables, except for the tables populated in the TDMFilterOutTargetTables, whose generator_filterout checkbox is checked (true). These data generation flows are created automatically in the GeneratorFlows subdirectory, under the Broadway directory of the LU.
The following data generation flows are created for each LU table:
Data generation flow. This flow has the following naming convention: ${population name}.generator
.
For example: contract.pop.generator
From TDM 8.1 onwards, TDM templates also create inner flows that set default values for table fields based on their types. This inner flow is called when the Fabric Catalog is not implemented in the TDM project. The inner flow has the following naming convention: ${table name}.typeDefaultsGenerator
.
For example: contract.typeDefaultsGenerator
Data generation flows are created with the following logic:
IDs:
The data generation flow sends the parent IDs to the child's population flow, based on the parent-child LU schema definition. For example, the Address LU table is the child of the Customer LU table. It is linked to the Customer LU table via the customer_id field. A new customer_id sequence is generated for the Customer LU table. The Address' data generation flow gets the parent_row as the input, and it maps the parent customer_id to the Address record.
IDs that are not linked to a parent LU table are populated by the Sequence Actors based on the fields mapped in TDMSeqSrc2TrgMapping.
Other fields are populated with synthetic data:
By default, this process calls the CatalogGeneratorRecord Actor to generate the field values based on the Fabric Catalog. If a field's classification is set in the Catalog, the generated value is based on the classification's data generator. Otherwise, a default value is generated based on the field type.
If no data is returned by the CatalogGeneratorRecord Actor (when the Fabric Catalog is not implemented), then the flow calls the ${table name}.typeDefaultsGenerator
inner flow to utilize data generation Actors according to the fields' data type. Note that these default data generation Actors are selected based on the mappings defined in the GenerateDataDefaultFieldTypeActors constTable (imported from the TDM library under the Shared Objects). This table can be edited to change the default data generators and should be updated before the data generation flows are created.
The output of the data generation flow contains a Map that includes a list of fields. These fields are sent to the related LU population flow and loaded into the LU table as a row column. Note that the data generation flow is called by a loop and returns a single record on each call. By default, the rowsGenerator Actor handles both the loop over the parent rows and the loop over the child IDs for each parent ID.
The following manual updates may be required for the data generation flows:
Replacement of the default data generation Actors with other data generators or custom inner flows — this process needs to be done in later flow stages, after the Prepare Generated Data stage (after calling either the CatalogGeneratorRecord Actor or the ${table name}.typeDefaultsGenerator
inner flow).
Overridden fields must be added to a Map. Once added, this Map should be sent as the last parameter to the Merge Maps of all Fields Actor within the data generation flow.
In the below example flow, the logic to generate the Associated_line, Associated_line_fmt, Contract_ref_id, and Description fields is overridden using MTables instead of the default generated values. These fields are added to a Map, and the Map is then sent to the Merge Maps of all Fields Actor:
In general, it is recommended to populate the PII fields in the data generation flow and to avoid overriding them with the Masking Actors in the LU population flow. Such population enables exposing PII fields as external business parameters for the data generation tasks without overriding their values (as set by the user) by the Masking Actors in the LU population.
It should be verified that the Masking Sensitive Data checkbox is clear for the Synthetic environment in the TDM Portal, in order to avoid masking PII fields in the LU population flows for data generation tasks.
TDM 8.1 has added a new Actor — GenerateConsistent. This Actor inherits from the Masking Actor but has its own category value — generate_consistent. Using the GenerateConsistent Actor in data generation flows ensures referential integrity across LUs for the generated field.
Notes:
The TDM execution process sets the generate_consistent key to true on data generation tasks.
The new Actor does not require having an input value since there is no original value for newly generated synthetic entities.
If a PII field exists across multiple LUs and is set in several records within the LUI, you should use the Masking Actor instead of the GenerateConsistent Actor. For example, a customer may have multiple contracts, each requiring a different name. The contracts exist in both the CRM and Billing LUs. To handle this, populate the Masking Actor's input parameters in the data generation flow as follows:
View the below example:
PII fields can differ in their occurrence across records and in the requirement for referential integrity (consistency). Each scenario requires a different implementation approach.
Example: Both the CRM and Billing LUs — representing separate systems — contain the First Name and Last Name fields. It is required to keep the same combination of the First and Last Names in both LUs for a given customer.
The following table describes the implementation recommendations for each scenario:
Click here for more information about the masking Actors.
If users wish to override the default values of parameters that are set in the TDM implementation, they should be able to set their own values. Such overriding is facilitated only if external business parameters, such as City and State, are to be added to data generation flows. The editor of the parameter depends on the parameter type. Spaces and special characters — except for an underscore — are not allowed in the External Name setting.
Click here for more information about integrating the TDM portal with the Broadway editors, as well as implementation guides for MTable and Distribution parameters.
There are several optional modes for executing the data generation inner flow:
Note that the data generation flow can be edited to be executed in different modes, if needed:
Click here for more information and examples related to the RowsGenerator Actor.
The rule-based data generation task runs in a dummy synthetic environment. This synthetic environment must be added to both Fabric and the TDM self-service application, and its name is defined by the SYNTHETIC_ENVIRONMENT Global. By default, this Global is set to Synthetic, but it can be modified to use a different name.
Notes: