The article describes plugins that create refersTo relations in the Catalog schema based on various types of analysis performed on the data source.
The plugins are:
The purpose of the Reference by Name Comparison plugin (formerly known as Metadata Logical Reference) is to identify possible foreign key references between datasets by matching field names and to create refersTo relations. This plugin is useful in cases where a source does not have predefined foreign key constraints. Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The matching algorithm operates by comparing the field names of two datasets at a time. Prior to the matching, the field names are normalized using the following formatting rules: underscore ‘_’ removal, conversion to lowercase letters and the addition of the table name in case the field name is 'ID'. For example, the field names customer.ID, CUSTOMER_ID and CustomerID will be normalized to the same value — customerid.
This plugin allows defining an exclusion list of field names (e.g., 'username' or 'age') and an exclusion list of field types (e.g., date, time, blob). The defined field names and types are excluded from the matching algorithm.
When the plugin finds a match based on field names, it evaluates the foreign key fields and the direction of the relation using the matching rules described below. The refersTo relation direction is childDataset refersTo parentDataset. The relation is created with a score of the matching rule.
The following matching rules are applied by the plugin. Note that each of these rules is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
fieldNameIsIdAndPk - Parent Dataset ('DS1') has a PK field 'ID'; Child Dataset has a field 'DS1ID' (normalized).
fieldNameIsIdAndNotPk - Parent Dataset ('DS1') has a PK field 'ID'; Child Dataset has a field 'DS1ID' (normalized); both are non-PK.
singleFieldPkAndNotPk - Single PK field in a Parent Dataset and non-PK field in Child Dataset.
commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.
The relation Child refers to Parent is created.
Some examples of the matching rules are:
sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.
Note that the sameFieldNamesNotPk rule to create relations between non-FK fields has been removed as of Fabric V8.3
The fieldTypeIncludeList plugins input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The purpose of the Reference by Query Analysis plugin (introduced in Fabric V8.3) is to identify possible foreign key references between datasets by analyzing JOIN operations in the queries of the input SQL file. Whenever a JOIN is found, its datasets are considered candidates for creating the refersTo relations.
The plugin then evaluates the candidate datasets using the matching rules described below. When one of the rules matches, a refersTo relation is created, with the direction being childDataset refersTo parentDataset. The relation is created with a score of that matching rule.
Some database management systems (such as Oracle) support automatic generation of audit files. These files record activities within the database by tracking executed SQL queries, user logins, schema changes, privilege escalations, and other events. An audit file can be used for creating an input SQL file for the plugin analysis. However, the audit file must first be transformed to remove all information except the SQL queries. This transformation can be done by creating a Broadway flow in your project, which converts the file to the required format.
This plugin is useful when a source does not have predefined foreign key constraints. Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The following matching rules are applied by the plugin. Note that the rule is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
singleFieldPkAndNotPk - Single PK field in Parent Dataset and non-PK field in Child Dataset.joinOnlyNoPkCheck - PKs are not checked. FK is based on JOIN only: left side of condition is the Parent Dataset, right side is the Child Dataset.:joinOnlyNoPkCheck matching rule is only available in the ANTLR mode.commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.The fieldTypeIncludeList plugin input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The queriesPath plugin input parameter must be set to the path to a folder containing the files with a list of queries to be analyzed.
The file name format is: <Data Platform name>.sql.
If the queriesPath is not set or contains an invalid path, the plugin throws an exception.
This plugin input parameter controls how queries are analyzed: using Fabric's internal parser only, by an LLM only, or first by Fabric with fallback to the LLM if needed (when results are insufficient). By default, analysisMode is set to ANTLR (Fabric's internal parser).
Note that the joinOnlyNoPkCheck matching rule is only available in the ANTLR mode.
The llmInterface parameter is optional. It allows overriding the project's default LLM interface to be used by the LLM plugin. This parameter should include the interface name.
llmInterface parameter is not set in the plugin definition, the plugin searches for an LLM AI interface tagged as 'discovery'. If none of the LLM AI interfaces are tagged as 'discovery', it will use an interface tagged as 'default'.llmInterface parameter is only applicable when the plugin invokes an LLM which occurs if the analysisMode parameter is set to either LLM or ANTLR & LLM.Reference by Data Comparison is a new plugin (introduced in Fabric V8.3) that examines data within data source fields to identify correlations using the probabilistic Bloom filter algorithm. Based on the analysis results, this plugin can establish FK relationships between datasets.
The data comparison is performed by comparing field values of two datasets at a time - dataset1 and dataset2. All fields in dataset2 are considered for analysis, while only the PK fields in dataset1 are used for this comparison. The data comparison results with a calculated score that represents the probability of a match between each pair of columns from two datasets.
Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The following matching rules are applied by the plugin. Note that the rule is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
singleFieldPkAndNotPk - Single PK field in Parent Dataset and non-PK field in Child Dataset.commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.The fieldTypeIncludeList plugin input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The factor refers to score multiplication, applied only when comparing columns defined by non-GUID data types. By default, the factor is set to 0.85 as it aims to reduce the score of potential matches between non-GUID columns.
The Trim Extra Relations plugin identifies and removes redundant refersTo relationships between the schema datasets to maintain only the essential connections. The plugin exclusively inspects refersTo relations created by Catalog plugins, while keeping the Crawler and Manual relations untouched. This optimization simplifies the data model and improves overall clarity.
The Problem: Redundant Relationships
Consider a scenario with three tables - Customer, Order, and Product - and the following links:
Customer ──────────────────────────────────> Product
│ ↑
│ │
└──────> Order ──────────────────────────────┘
The Customer → Product relationship is unnecessary because the path to Product is already established through Orders:
This redundant direct connection clutters the data model without adding analytical value.
The Solution: Streamlined Modeling
The plugin removes these unnecessary shortcuts while preserving the meaningful underlying relationships:
Customer ────────> Order ────────> Product
The resulting model is cleaner and accurately reflects business logic:
The plugin maps all existing relationships between datasets, similar to a transit map showing all possible routes. For every relationship, the plugin evaluates a core question: "Is there an alternative route from Dataset1 to Dataset2?"
If an alternative path exists, the direct relationship is deemed redundant and is removed, keeping the data model focused.
The plugin skips the following relationships from deletion to prevent the loss of critical metadata:
Origin = Crawler are actual database constraints and are never removed.Origin = Manual are explicitly defined by users and are preserved regardless of alternative paths.The Trim Extra Relations plugin is most effective when running discovery on data sources that lack physical foreign keys. In these cases, logical foreign keys are generated by Discovery plugins; the Trim plugin then simplifies the resulting model by removing redundant connections.
The article describes plugins that create refersTo relations in the Catalog schema based on various types of analysis performed on the data source.
The plugins are:
The purpose of the Reference by Name Comparison plugin (formerly known as Metadata Logical Reference) is to identify possible foreign key references between datasets by matching field names and to create refersTo relations. This plugin is useful in cases where a source does not have predefined foreign key constraints. Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The matching algorithm operates by comparing the field names of two datasets at a time. Prior to the matching, the field names are normalized using the following formatting rules: underscore ‘_’ removal, conversion to lowercase letters and the addition of the table name in case the field name is 'ID'. For example, the field names customer.ID, CUSTOMER_ID and CustomerID will be normalized to the same value — customerid.
This plugin allows defining an exclusion list of field names (e.g., 'username' or 'age') and an exclusion list of field types (e.g., date, time, blob). The defined field names and types are excluded from the matching algorithm.
When the plugin finds a match based on field names, it evaluates the foreign key fields and the direction of the relation using the matching rules described below. The refersTo relation direction is childDataset refersTo parentDataset. The relation is created with a score of the matching rule.
The following matching rules are applied by the plugin. Note that each of these rules is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
fieldNameIsIdAndPk - Parent Dataset ('DS1') has a PK field 'ID'; Child Dataset has a field 'DS1ID' (normalized).
fieldNameIsIdAndNotPk - Parent Dataset ('DS1') has a PK field 'ID'; Child Dataset has a field 'DS1ID' (normalized); both are non-PK.
singleFieldPkAndNotPk - Single PK field in a Parent Dataset and non-PK field in Child Dataset.
commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.
The relation Child refers to Parent is created.
Some examples of the matching rules are:
sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.
Note that the sameFieldNamesNotPk rule to create relations between non-FK fields has been removed as of Fabric V8.3
The fieldTypeIncludeList plugins input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The purpose of the Reference by Query Analysis plugin (introduced in Fabric V8.3) is to identify possible foreign key references between datasets by analyzing JOIN operations in the queries of the input SQL file. Whenever a JOIN is found, its datasets are considered candidates for creating the refersTo relations.
The plugin then evaluates the candidate datasets using the matching rules described below. When one of the rules matches, a refersTo relation is created, with the direction being childDataset refersTo parentDataset. The relation is created with a score of that matching rule.
Some database management systems (such as Oracle) support automatic generation of audit files. These files record activities within the database by tracking executed SQL queries, user logins, schema changes, privilege escalations, and other events. An audit file can be used for creating an input SQL file for the plugin analysis. However, the audit file must first be transformed to remove all information except the SQL queries. This transformation can be done by creating a Broadway flow in your project, which converts the file to the required format.
This plugin is useful when a source does not have predefined foreign key constraints. Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The following matching rules are applied by the plugin. Note that the rule is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
singleFieldPkAndNotPk - Single PK field in Parent Dataset and non-PK field in Child Dataset.joinOnlyNoPkCheck - PKs are not checked. FK is based on JOIN only: left side of condition is the Parent Dataset, right side is the Child Dataset.:joinOnlyNoPkCheck matching rule is only available in the ANTLR mode.commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.The fieldTypeIncludeList plugin input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The queriesPath plugin input parameter must be set to the path to a folder containing the files with a list of queries to be analyzed.
The file name format is: <Data Platform name>.sql.
If the queriesPath is not set or contains an invalid path, the plugin throws an exception.
This plugin input parameter controls how queries are analyzed: using Fabric's internal parser only, by an LLM only, or first by Fabric with fallback to the LLM if needed (when results are insufficient). By default, analysisMode is set to ANTLR (Fabric's internal parser).
Note that the joinOnlyNoPkCheck matching rule is only available in the ANTLR mode.
The llmInterface parameter is optional. It allows overriding the project's default LLM interface to be used by the LLM plugin. This parameter should include the interface name.
llmInterface parameter is not set in the plugin definition, the plugin searches for an LLM AI interface tagged as 'discovery'. If none of the LLM AI interfaces are tagged as 'discovery', it will use an interface tagged as 'default'.llmInterface parameter is only applicable when the plugin invokes an LLM which occurs if the analysisMode parameter is set to either LLM or ANTLR & LLM.Reference by Data Comparison is a new plugin (introduced in Fabric V8.3) that examines data within data source fields to identify correlations using the probabilistic Bloom filter algorithm. Based on the analysis results, this plugin can establish FK relationships between datasets.
The data comparison is performed by comparing field values of two datasets at a time - dataset1 and dataset2. All fields in dataset2 are considered for analysis, while only the PK fields in dataset1 are used for this comparison. The data comparison results with a calculated score that represents the probability of a match between each pair of columns from two datasets.
Note that this plugin is inactive by default and must be activated via Discovery Pipeline if needed.
The following matching rules are applied by the plugin. Note that the rule is applied only if its score exceeds the plugin's threshold; otherwise, the rule is skipped.
singleFieldPkAndNotPk - Single PK field in Parent Dataset and non-PK field in Child Dataset.commonFieldsInBothPk - Common fields in PK of both datasets, but Child Dataset has additional PKs.sameFieldsInBothPk - Part of PK in both datasets, and both datasets have an identical number of PKs.The fieldTypeIncludeList plugin input parameter controls which field data types are considered when creating relations.
By default, this parameter is set to the STRING, INTEGER or REAL data type for this plugin. The valid values are STRING, INTEGER, REAL, DATETIME, DATE and BOOLEAN.
The factor refers to score multiplication, applied only when comparing columns defined by non-GUID data types. By default, the factor is set to 0.85 as it aims to reduce the score of potential matches between non-GUID columns.
The Trim Extra Relations plugin identifies and removes redundant refersTo relationships between the schema datasets to maintain only the essential connections. The plugin exclusively inspects refersTo relations created by Catalog plugins, while keeping the Crawler and Manual relations untouched. This optimization simplifies the data model and improves overall clarity.
The Problem: Redundant Relationships
Consider a scenario with three tables - Customer, Order, and Product - and the following links:
Customer ──────────────────────────────────> Product
│ ↑
│ │
└──────> Order ──────────────────────────────┘
The Customer → Product relationship is unnecessary because the path to Product is already established through Orders:
This redundant direct connection clutters the data model without adding analytical value.
The Solution: Streamlined Modeling
The plugin removes these unnecessary shortcuts while preserving the meaningful underlying relationships:
Customer ────────> Order ────────> Product
The resulting model is cleaner and accurately reflects business logic:
The plugin maps all existing relationships between datasets, similar to a transit map showing all possible routes. For every relationship, the plugin evaluates a core question: "Is there an alternative route from Dataset1 to Dataset2?"
If an alternative path exists, the direct relationship is deemed redundant and is removed, keeping the data model focused.
The plugin skips the following relationships from deletion to prevent the loss of critical metadata:
Origin = Crawler are actual database constraints and are never removed.Origin = Manual are explicitly defined by users and are preserved regardless of alternative paths.The Trim Extra Relations plugin is most effective when running discovery on data sources that lack physical foreign keys. In these cases, logical foreign keys are generated by Discovery plugins; the Trim plugin then simplifies the resulting model by removing redundant connections.