The Discovery job is a pipeline that connects a series of steps where some are executed sequentially and some - in parallel. It has 2 main parts: Crawler and Plugin Framework.
The Crawler scans the data source while identifying the existing entities and the relationships between them. The Crawler's output is the Catalog schema.
The Plugin Framework is an internal platform for running the plugins. It is a pipeline of plugins that are executed by the Discovery job after the Crawler task completion.
The pipeline is executed based on a combination of the product configuration and the project rules:
Each plugin is a piece of business logic that is executed in order to complement the Catalog schema. The plugin’s execution can result in a change to the Catalog schema, such as creation or removal of Catalog elements. Some plugins calculate a score - a confidence level of a plugin result's accuracy.
The plugin input parameters are:
name
- plugin's unique nameclass
- plugin's Java class active
- whether the plugin is included in the execution ('true') or not ('false')threshold
- the score above which the plugin result impacts the CatalogmonitorDesc
- the description displayed per each plugin in the Execution Progress area of the Catalog Monitor, under the number.inputParameters
- is a key/value map of additional input parameters that are different per each plugin.The K2view Discovery solution includes a constantly growing list of built-in plugins.
Click here for more details about the built-in plugins.
The data sample is retrieved from the data source during the Discovery job run. The data is encrypted and is being used by various plugins during the job run. Once the plugins' execution has been completed, the data sample is deleted.
The sample size is defined by:
For example, the percentage is 10%, min is 100 and max is 500. Hence, if a table includes 200 rows, the sample size wwould be 100. If a table includes 2,000 rows, the sample size would be 200. If a table includes 100,000 rows, the sample size would be 500.
By default, all the data platform's entities are scanned except for those that are in the global schema exclude list.
The global schema exclude list defines the schemas that should be excluded from a Discovery on any data platform. These excluded schemas are system schemas that are not relevant for the Discovery. The syntax supports regular expressions. For example, "SYS.*" refers to all schema names that start with 'SYS'.
From Fabric V8.2 onwards, the Discovery execution is based on rules that accommodate for different variations of a Discovery pipeline process. It is now possible to create rules per data platform to override the baseline configuration on a schema or even on a dataset level. The job executes a combination of the baseline and user-defined rules.
For example, you can define Plugin X
to be executed on Schema 1
while it is not executed on all other schemas of the same data platform. You can also define a larger data sample on Schema 2
while the rest of the schemas will use a default sample size.
The product initial setup includes a Baseline rule that represents a baseline configuration, such as a sample size, list of all product plugins and their default settings.
One can override a Baseline rule, for example, deactivate an active plugin in the product settings. A crawler filter cannot be set on a Baseline rule since it is applied for all data platforms.
The user can create multiple rules per data platform. Each rule can define:
The rules are executed based on the following hierarchy: When multiple rules apply to the same process element, the most specific rule takes precedence. The following is an illustration of the rules hierarchy:
Plugin X
as inactive. Plugin X
to active. Plugin X
to active and a threshold = 0.8.Depending which interface and/or schema the Discovery is executed on, the Plugin X
settings are taken from the most specific rule.
All the overrides are saved in the Implementation/SharedObjects/Interfaces/Discovery/
folder, in the pluginsOverride.discovery file. This file is created when the overrides are performed using the Discovery Pipeline screen in the Catalog Settings tab.
Click here to learn about baseline configuration as well as override rules that can be viewed and updated via the Discovery Pipeline screen in the Catalog Settings tab.
The Discovery job is a pipeline that connects a series of steps where some are executed sequentially and some - in parallel. It has 2 main parts: Crawler and Plugin Framework.
The Crawler scans the data source while identifying the existing entities and the relationships between them. The Crawler's output is the Catalog schema.
The Plugin Framework is an internal platform for running the plugins. It is a pipeline of plugins that are executed by the Discovery job after the Crawler task completion.
The pipeline is executed based on a combination of the product configuration and the project rules:
Each plugin is a piece of business logic that is executed in order to complement the Catalog schema. The plugin’s execution can result in a change to the Catalog schema, such as creation or removal of Catalog elements. Some plugins calculate a score - a confidence level of a plugin result's accuracy.
The plugin input parameters are:
name
- plugin's unique nameclass
- plugin's Java class active
- whether the plugin is included in the execution ('true') or not ('false')threshold
- the score above which the plugin result impacts the CatalogmonitorDesc
- the description displayed per each plugin in the Execution Progress area of the Catalog Monitor, under the number.inputParameters
- is a key/value map of additional input parameters that are different per each plugin.The K2view Discovery solution includes a constantly growing list of built-in plugins.
Click here for more details about the built-in plugins.
The data sample is retrieved from the data source during the Discovery job run. The data is encrypted and is being used by various plugins during the job run. Once the plugins' execution has been completed, the data sample is deleted.
The sample size is defined by:
For example, the percentage is 10%, min is 100 and max is 500. Hence, if a table includes 200 rows, the sample size wwould be 100. If a table includes 2,000 rows, the sample size would be 200. If a table includes 100,000 rows, the sample size would be 500.
By default, all the data platform's entities are scanned except for those that are in the global schema exclude list.
The global schema exclude list defines the schemas that should be excluded from a Discovery on any data platform. These excluded schemas are system schemas that are not relevant for the Discovery. The syntax supports regular expressions. For example, "SYS.*" refers to all schema names that start with 'SYS'.
From Fabric V8.2 onwards, the Discovery execution is based on rules that accommodate for different variations of a Discovery pipeline process. It is now possible to create rules per data platform to override the baseline configuration on a schema or even on a dataset level. The job executes a combination of the baseline and user-defined rules.
For example, you can define Plugin X
to be executed on Schema 1
while it is not executed on all other schemas of the same data platform. You can also define a larger data sample on Schema 2
while the rest of the schemas will use a default sample size.
The product initial setup includes a Baseline rule that represents a baseline configuration, such as a sample size, list of all product plugins and their default settings.
One can override a Baseline rule, for example, deactivate an active plugin in the product settings. A crawler filter cannot be set on a Baseline rule since it is applied for all data platforms.
The user can create multiple rules per data platform. Each rule can define:
The rules are executed based on the following hierarchy: When multiple rules apply to the same process element, the most specific rule takes precedence. The following is an illustration of the rules hierarchy:
Plugin X
as inactive. Plugin X
to active. Plugin X
to active and a threshold = 0.8.Depending which interface and/or schema the Discovery is executed on, the Plugin X
settings are taken from the most specific rule.
All the overrides are saved in the Implementation/SharedObjects/Interfaces/Discovery/
folder, in the pluginsOverride.discovery file. This file is created when the overrides are performed using the Discovery Pipeline screen in the Catalog Settings tab.
Click here to learn about baseline configuration as well as override rules that can be viewed and updated via the Discovery Pipeline screen in the Catalog Settings tab.