Starting from V8.2, the Catalog includes a data and metadata profiling plugin powered by LLM. The plugin invokes an LLM via an LLM AI interface defined in the project. Before creating an AI interface, you must install the relevant extension (e.g. OpenAI Connector). The Discovery process will use a designated LLM AI interface tagged as 'discovery'; if no such interface exists, it will fall back to one tagged as 'default'.
The LLM plugin performs profiling of each column's metadata and data. The LLM plugin's response depends on the user prompt, which is defined in the plugin's configuration. The pre-defined user prompt can be modified per the project's needs; this user prompt should correspond to use cases as explained further in this article.
The LLM plugin can perform various tasks. The following two use cases are pre-configured in the product's Discovery Pipeline:
In addition to the above use cases, you can use the same plugin to implement your own use cases by updating the user prompt and other plugin's input parameters. The article includes two additional use cases with configuration examples:
The plugin's input parameters are:
threshold is the minimum score at or above which the plugin skips a column. The threshold is set in order to minimize the number of calls to the LLM. It applies to cases where the column already has the same property created by another plugin during the same Discovery Job execution. threshold is set to 0.7. propertyName is a column's property that should be created by the plugin. "propertyName": "classification", which aims to accommodate the LLM Profiling use case.userPrompt is an LLM prompt definition. It is a dynamic string, comprised of several parts that are combined at run time. Some of these parts are taken from the framework and some are taken from the plugin's definition, as follows:${tableName}, ${columns} and ${columnName} are, respectively, a table and a column being profiled, as well as the names of all other columns in this table. These 3 parameters are passed to the plugin by the framework.userPrompt should be updated to fit the required use case and project's needs. possibleValues is a list of possible property values. "possibleValues":["FIRST_NAME","LAST_NAME","ADDRESS"].userPrompt by removing the text that refers to the possible values. possibleMTableValues is an alternative way to provide the possible property values. It allows the values to be retrieved from a project's MTable. The "possibleMTableValues" should be populated using the following format: "<MTable name>.<Column name>". "possibleMTableValues" : "pii_profiling.name"possibleValues or possibleMTableValues should be populated in the plugin's definition, but not both. The userPrompt should be updated accordingly. sampleSize defines a sample size to be used by the LLM. By default, it is set to 10. If you don't intend to send any sample data to the LLM, set the sample size to 0. samplePrompt defines a part of the user prompt related to the sample data. It is included in the user prompt when the sampleSize > 0 and if the column is not empty in the data snapshot. ${sampleData} is the source data retrieved in the Snapshot step and added to the prompt. incrementalMode defines whether the plugin should be executed for the fields that already have the same property created by the same LLM plugin in a previous Discovery Job execution. This parameter is set in order to minimize the number of calls to the LLM. It has the following modes:"Keep All" - if the plugin has already been executed for this field in a previous Discovery Job execution, don’t invoke the plugin again. This applies regardless of whether the previous execution created a property — once a field has been processed, it will not be processed again. The plugin will only be invoked for new fields."Keep Existing" (default) - if the plugin has already been executed for this field in a previous Discovery Job execution and created a property, don’t invoke it again. The plugin will only be invoked for the new fields and for the fields without this property (e.g., "classification")."Evaluate All" - the plugin will be invoked for all fields.llmInterface is an optional parameter. It allows overriding the default project's LLM AI interface, to be used by the LLM plugin. This parameter should include the interface's name.llmInterface parameter is not set in the plugin definition, the plugin will search for an LLM AI interface tagged as 'discovery'. If none of the LLM AI interfaces are tagged as 'discovery', an interface with a 'default' tag will be used.The Catalog includes 2 built-in plugins that perform profiling and classification of the columns using the regular expressions Data Regex Classifier and Metadata Regex Classifier.
However, these plugins might miss some columns with sensitive data, for various reasons. For example, when a column doesn't have a meaningful name and the regular expression cannot be applied to the column's values (e.g., names of people or geographic locations), the regex-driven plugins will not classify such columns.
LLM-based plugins improve classification by analyzing column data in the context of table and column names.
The following is the default product configuration for LLM Profiling:
{
"name": "LLM Profiling",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.7,
"monitorDesc": "Classifications",
"inputParameters": {
"propertyName": "classification",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease classify the column ${columnName} based on its name, choosing one of the following possible values: ${possibleValues}.\n${samplePrompt}\n If none of the possible values match, return $NONE$.",
"possibleValues": [
"FIRST_NAME",
"LAST_NAME",
"ADDRESS",
"CITY",
"COUNTRY"
],
"possibleMTableValues":"",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM plugin can be used for various tasks. One of them is to generate a free-text description of the Catalog fields. To achieve that, the user prompt and other input parameters should be updated to include the required task.
The following is the default product configuration of the LLM Description plugin, which generates a short description for each field in the Catalog:
{
"name": "LLM Description",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.8,
"monitorDesc": "Descriptions",
"inputParameters": {
"propertyName": "description",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease provide a one-line description of ${columnName} with a minimum of 5 words to be used in technical documentation.\n${samplePrompt}\nDo not include table or column names in your response.",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you describe the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM Profiling plugin is most effective when column names are meaningful, column values provide contextual insight, or both. However, this is not always the case. Sometimes the table and column names are not meaningful and there is no data in them. On the other hand, some field properties can shed more light on how to profile a column. The LLM plugin can use a field property's values to perform the profiling.
For example, when the table and column names are not meaningful, the descriptions (or remarks) might have been included in the data source for each table and/or column explaining what is stored in them.
To include the value of the description property in the LLM analysis, use the following syntax in the userPrompt: ${property.description}.
Below is an LLM plugin configuration to support this use case:
{
"name": "LLM Property Profiling",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.7,
"monitorDesc": "Classifications",
"inputParameters": {
"propertyName": "classification",
"userPrompt":"Please classify the column based on the information included here: \"${property.description}\" (however if this information is empty, return $NONE$). \nChoosing one of the following possible values: ${possibleValues}.\nIf none of the possible values match, return $NONE$.",
"possibleValues": [
"FIRST_NAME",
"LAST_NAME",
"ADDRESS",
"CITY",
"COUNTRY"
],
"possibleMTableValues":"",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM plugin is not limited to the pre-configured use cases above. By customizing the user prompt and input parameters, you can define any use case that fits your project's needs. The following example demonstrates how to identify columns containing medical information — such as a medical condition, treatment, or drug — and tag them with a new property medicalInfo = true.
This can be achieved by setting up the relevant user prompt and updating the input parameters as follows:
{
"name": "LLM Medical",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.8,
"monitorDesc": "Medical Info",
"inputParameters": {
"propertyName": "medicalInfo",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease verify if the column ${columnName}, based on its name, indicates a medical condition or relates to a specific medical treatment or drug.\n${samplePrompt}\nIf yes - return <true>. Otherwise, return $NONE$.",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
Note that when using the com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin class for a custom plugin, the following parameters are mandatory:
threshold - a number between 0 and 1
Input parameters:
propertyNameuserPromptsampleSize - this can be set to 0, if not applicable.samplePrompt - it should be referenced in the userPrompt. However, it can be empty if sampleSize is 0. incrementalModeThe following parameters are optional:
possibleValues - this should be defined as an array and referenced in the userPrompt.llmInterfaceStarting from V8.2, the Catalog includes a data and metadata profiling plugin powered by LLM. The plugin invokes an LLM via an LLM AI interface defined in the project. Before creating an AI interface, you must install the relevant extension (e.g. OpenAI Connector). The Discovery process will use a designated LLM AI interface tagged as 'discovery'; if no such interface exists, it will fall back to one tagged as 'default'.
The LLM plugin performs profiling of each column's metadata and data. The LLM plugin's response depends on the user prompt, which is defined in the plugin's configuration. The pre-defined user prompt can be modified per the project's needs; this user prompt should correspond to use cases as explained further in this article.
The LLM plugin can perform various tasks. The following two use cases are pre-configured in the product's Discovery Pipeline:
In addition to the above use cases, you can use the same plugin to implement your own use cases by updating the user prompt and other plugin's input parameters. The article includes two additional use cases with configuration examples:
The plugin's input parameters are:
threshold is the minimum score at or above which the plugin skips a column. The threshold is set in order to minimize the number of calls to the LLM. It applies to cases where the column already has the same property created by another plugin during the same Discovery Job execution. threshold is set to 0.7. propertyName is a column's property that should be created by the plugin. "propertyName": "classification", which aims to accommodate the LLM Profiling use case.userPrompt is an LLM prompt definition. It is a dynamic string, comprised of several parts that are combined at run time. Some of these parts are taken from the framework and some are taken from the plugin's definition, as follows:${tableName}, ${columns} and ${columnName} are, respectively, a table and a column being profiled, as well as the names of all other columns in this table. These 3 parameters are passed to the plugin by the framework.userPrompt should be updated to fit the required use case and project's needs. possibleValues is a list of possible property values. "possibleValues":["FIRST_NAME","LAST_NAME","ADDRESS"].userPrompt by removing the text that refers to the possible values. possibleMTableValues is an alternative way to provide the possible property values. It allows the values to be retrieved from a project's MTable. The "possibleMTableValues" should be populated using the following format: "<MTable name>.<Column name>". "possibleMTableValues" : "pii_profiling.name"possibleValues or possibleMTableValues should be populated in the plugin's definition, but not both. The userPrompt should be updated accordingly. sampleSize defines a sample size to be used by the LLM. By default, it is set to 10. If you don't intend to send any sample data to the LLM, set the sample size to 0. samplePrompt defines a part of the user prompt related to the sample data. It is included in the user prompt when the sampleSize > 0 and if the column is not empty in the data snapshot. ${sampleData} is the source data retrieved in the Snapshot step and added to the prompt. incrementalMode defines whether the plugin should be executed for the fields that already have the same property created by the same LLM plugin in a previous Discovery Job execution. This parameter is set in order to minimize the number of calls to the LLM. It has the following modes:"Keep All" - if the plugin has already been executed for this field in a previous Discovery Job execution, don’t invoke the plugin again. This applies regardless of whether the previous execution created a property — once a field has been processed, it will not be processed again. The plugin will only be invoked for new fields."Keep Existing" (default) - if the plugin has already been executed for this field in a previous Discovery Job execution and created a property, don’t invoke it again. The plugin will only be invoked for the new fields and for the fields without this property (e.g., "classification")."Evaluate All" - the plugin will be invoked for all fields.llmInterface is an optional parameter. It allows overriding the default project's LLM AI interface, to be used by the LLM plugin. This parameter should include the interface's name.llmInterface parameter is not set in the plugin definition, the plugin will search for an LLM AI interface tagged as 'discovery'. If none of the LLM AI interfaces are tagged as 'discovery', an interface with a 'default' tag will be used.The Catalog includes 2 built-in plugins that perform profiling and classification of the columns using the regular expressions Data Regex Classifier and Metadata Regex Classifier.
However, these plugins might miss some columns with sensitive data, for various reasons. For example, when a column doesn't have a meaningful name and the regular expression cannot be applied to the column's values (e.g., names of people or geographic locations), the regex-driven plugins will not classify such columns.
LLM-based plugins improve classification by analyzing column data in the context of table and column names.
The following is the default product configuration for LLM Profiling:
{
"name": "LLM Profiling",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.7,
"monitorDesc": "Classifications",
"inputParameters": {
"propertyName": "classification",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease classify the column ${columnName} based on its name, choosing one of the following possible values: ${possibleValues}.\n${samplePrompt}\n If none of the possible values match, return $NONE$.",
"possibleValues": [
"FIRST_NAME",
"LAST_NAME",
"ADDRESS",
"CITY",
"COUNTRY"
],
"possibleMTableValues":"",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM plugin can be used for various tasks. One of them is to generate a free-text description of the Catalog fields. To achieve that, the user prompt and other input parameters should be updated to include the required task.
The following is the default product configuration of the LLM Description plugin, which generates a short description for each field in the Catalog:
{
"name": "LLM Description",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.8,
"monitorDesc": "Descriptions",
"inputParameters": {
"propertyName": "description",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease provide a one-line description of ${columnName} with a minimum of 5 words to be used in technical documentation.\n${samplePrompt}\nDo not include table or column names in your response.",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you describe the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM Profiling plugin is most effective when column names are meaningful, column values provide contextual insight, or both. However, this is not always the case. Sometimes the table and column names are not meaningful and there is no data in them. On the other hand, some field properties can shed more light on how to profile a column. The LLM plugin can use a field property's values to perform the profiling.
For example, when the table and column names are not meaningful, the descriptions (or remarks) might have been included in the data source for each table and/or column explaining what is stored in them.
To include the value of the description property in the LLM analysis, use the following syntax in the userPrompt: ${property.description}.
Below is an LLM plugin configuration to support this use case:
{
"name": "LLM Property Profiling",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.7,
"monitorDesc": "Classifications",
"inputParameters": {
"propertyName": "classification",
"userPrompt":"Please classify the column based on the information included here: \"${property.description}\" (however if this information is empty, return $NONE$). \nChoosing one of the following possible values: ${possibleValues}.\nIf none of the possible values match, return $NONE$.",
"possibleValues": [
"FIRST_NAME",
"LAST_NAME",
"ADDRESS",
"CITY",
"COUNTRY"
],
"possibleMTableValues":"",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
The LLM plugin is not limited to the pre-configured use cases above. By customizing the user prompt and input parameters, you can define any use case that fits your project's needs. The following example demonstrates how to identify columns containing medical information — such as a medical condition, treatment, or drug — and tag them with a new property medicalInfo = true.
This can be achieved by setting up the relevant user prompt and updating the input parameters as follows:
{
"name": "LLM Medical",
"class": "com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin",
"active": true,
"threshold": 0.8,
"monitorDesc": "Medical Info",
"inputParameters": {
"propertyName": "medicalInfo",
"userPrompt": "Given the following table ${tableName} which includes the following columns ${columns}.\nPlease verify if the column ${columnName}, based on its name, indicates a medical condition or relates to a specific medical treatment or drug.\n${samplePrompt}\nIf yes - return <true>. Otherwise, return $NONE$.",
"sampleSize": 10,
"samplePrompt": "Here is a data sample from the column ${columnName} to help you classify the column: ${sampleData}.",
"incrementalMode":"KEEP_ALL"
}
}
Note that when using the com.k2view.discovery.plugins.llm.LLMDataProfilingPlugin class for a custom plugin, the following parameters are mandatory:
threshold - a number between 0 and 1
Input parameters:
propertyNameuserPromptsampleSize - this can be set to 0, if not applicable.samplePrompt - it should be referenced in the userPrompt. However, it can be empty if sampleSize is 0. incrementalModeThe following parameters are optional:
possibleValues - this should be defined as an array and referenced in the userPrompt.llmInterface