Hash and upload the sensitive information source table for exact data match sensitive information types

This article shows you how to hash and upload your sensitive information source table.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

Hash and upload the sensitive information source table

In this phase, you:

  1. Set up a custom security group and user account.
  2. Set up the Exact Data Match (EDM) Upload Agent tool.
  3. Use the EDM Upload Agent tool to hash, with a salt value, the sensitive information source table, and upload it.

You can hash and upload your sensitive data using either the Two-computer method or the Single computer method as described in Hash and upload your data. Best practice is to use two computers to separate the processes of hashing and uploading your sensitive data. Separating the steps across two computers helps ensure that your actual data is never available in clear text form on a computer that might be compromised due to its connection to the internet. This also makes isolating any issues you encounter easier to identify.

Prerequisites

Technology requirements

  • A work or school account for Microsoft 365. This account must be added to the EDM_DataUploaders security group.
  • A computer with one of the following operating systems. This computer runs the EDM Upload Agent.
    • Windows 11
    • Windows 10
    • Windows Server 2016 with .NET version 4.6.2
    • Windows Server 2019
    • Windows Server 2022
  • A directory on the computer you use for uploading your data. This directory contains:
    • The EDM Upload Agent.
    • Your sensitive information data file in .csv, .tsv, or pipe (|) format. By default, the EDM Upload Agent expects your data file to be in .csv format. > [!TIP]

      You can use a file with data that is separated by tabs or pipes (instead of commas), by indicating either the "(Tab)" or "(|)" options with the /ColumnSeparated parameter. For example: EdmUploadAgent.exe /UploadData /DataStoreName PatientRecords /DataFile C:\Edm\Hash\PatientRecords.csv /HashLocation C:\Edm\Hash /Schema edm.xml /AllowedBadLinesPercentage 5

    • The output has and salt files that are created when completing the hash procedure.
    • The datastore name from the edm.xml file. Our example uses PatientRecords.

Security group and user account requirements

  1. As a global administrator, go to the admin center using the appropriate link for your subscription and create a security group called EDM_DataUploaders.

  2. Add one or more users to the EDM_DataUploaders security group. (These users are the ones who manage the database of sensitive information.)

Exact Data Match Schema

If you used the EDM schema and sensitive information type tool for the new experience or the EDM sensitive information type/rule package for the classic experience, you must download that schema to hash your sensitive information source table. For more information, see Exporting the EDM schema file in XML format.

To download this EDM schema, open a Command Prompt window and run the following command:

EdmUploadAgent.exe /SaveSchema /DataStoreName <schema name> /OutputDir <path to output folder>

Data formatting requirements

Before you hash and upload your sensitive data, run a search for any special characters in the table that might cause problems in parsing the content.

You can validate that the table is in a suitable format by using the EDM Upload Agent with the following syntax:

EdmUploadAgent.exe /ValidateData /DataFile [data file] /Schema [schema file]

Common formatting issues

  1. Mismatched number of columns: This issue can be due to the presence of commas or quote characters within values in the table that EDM interprets as column delimiters. Unless they're surrounding a whole value, single and double quotes can cause the tool to misidentify the start and end of individual columns.
  2. Single quote characters or commas inside a value: For example, if a person's name includes a single quote such as Tom O'Neil or a city's name starts with an apostrophe such as 's-Gravenhage, you need to modify the data export process used to generate the sensitive information table and surround such columns with double quotes.
  3. Double quote characters inside values: Best practice is to use the tab-delimited format for the table. Tab-delimited tables are less susceptible to such issues.

Hash and upload your data

Your sensitive information source table is formatted as clear-text. By using one computer for the hash step and a different computer for the upload step, you protect your data from being exposed in clear text on a computer with a direct connection to your Microsoft 365 tenant.

Important

This approach requires that the same version of the EDM Upload Agent must be installed on both computers. You can then copy the hash file and the salt file from the secure machine to a computer that can connect directly to your Microsoft 365 tenant.

  1. On the computer in the secure environment, run the following command in a Command Prompt window: EdmUploadAgent.exe /CreateHash /DataFile [data file] /HashLocation [hash file location] /Schema [Schema file] /AllowedBadLinesPercentage [value] For example: EdmUploadAgent.exe /CreateHash /DataFile C:\Edm\Data\PatientRecords.csv /HashLocation C:\Edm\Hash /Schema edm.xml /AllowedBadLinesPercentage 5

    This outputs a hashed file and a salt file with these extensions if you didn't specify the /Salt <saltvalue> option:

    • EdmHash
    • EdmSalt
  2. Securely copy these files to the computer you use to upload your sensitive information source table (for example, PatientRecords.csv) to your tenant.

  3. Authorize the EDM Upload Agent:

    1. As an admin, open a Command Prompt window.
    2. Switch to the directory where the EDM Upload Agent is installed. (The recommended directory is C:\EDM\Data.)
    3. Run the following command:

    EDM Upload Agent.exe /Authorize

    Important

    You must run the EDM Upload Agent from the folder where it's installed and you must sindicate the full path to your data files.

  4. Sign in with your work or school Microsoft 365 account. (The account that was added to the EDM_DataUploaders security group). Your tenant information is extracted from the user account to make the connection.

  5. To upload the hashed data, run the following command in a Command Prompt window:

    EdmUploadAgent.exe /UploadHash /DataStoreName \<DataStoreName\> /HashFile \<HashedSourceFilePath\ /ColumnSeparator ["{Tab}"|"|"]

    For example: EdmUploadAgent.exe /UploadHash /DataStoreName PatientRecords /HashFile C:\\Edm\\Hash\\**PatientRecords.EdmHash**

  6. To verify that the upload of your sensitive data was successful, run the following command in a Command Prompt window:

    EdmUploadAgent.exe /GetDataStore

    If the upload was successful, a list of data stores and when they were last updated displays.

  7. To display all of the data uploads to a particular store, and when they were updated, run the following command in a Command Prompt window:

    EdmUploadAgent.exe /GetSession /DataStoreName <DataStoreName>

Tip

To automate the hash and upload process after you have created it the first time, see Refresh your exact data match sensitive information source table file.

EDM and double-byte character set languages

Exact data match supports double-byte characters, such as those used in Chinese, Japanese, and Korean. However, it doesn't support string matches for corroborative evidence encoded as double byte characters. Neither does it match multi-token CJK text detected in the classified content, unless globalization for EDM is enabled as described later in this document. In all cases, a SIT must be mapped to any multi-token text, both for the primary field and for corroborative evidence fields.

To invoke exact data matching for double-byte characters, take the following steps:

  1. Create an EDM Sensitive Information Type (SIT) configured to match on the double-byte character set language, such as Japanese kanji.
  2. Ensure that you download and install version 17.01.0495.0 (or later) of the EDM Upload Agent
  3. Update the EdmUploadAgent.exe.config file’s globalization parameter to true: <add key=" IsGlobalizationEnabled" value="true">
  4. Hash and upload a source table with the data to be matched.

Next steps