Credential Detection Gaps in Azure Language Service vs. CredScan

Question

Credential Detection Gaps in Azure Language Service vs. CredScan

KwangJe Cho 0 Microsoft Employee

I am from the DevDiv Data Platform team, and we are currently exploring the Azure AI Language (Cognitive) Service to redact PII and credentials within our telemetry. While we currently use the CredScan SDK with Python libraries, we are looking to migrate to the Azure Language Service for scalability reasons.

During our evaluation, we found that several credentials successfully redacted by CredScan are not handled correctly by the Azure Language Service. We are reaching out to see if your team has a plan to consistently update the detection logic to cover a broader range of credentials, or if the service supports custom regex/rules to bridge these gaps.

Our benchmarking revealed that the Azure Language Service missed 19 out of 46 credentials. Key findings from our report include:

Critical Credential Detection Failures

General & Cloud Secrets: Missed X.509 Private Keys, ASP.NET Machine Keys, and AWS Secret Access Keys.

Azure-Specific Keys: Failed to detect Azure Management Certificates, Redis Connection Strings, and Azure Batch Shared Access Keys.

DevOps & CI/CD: Missed GitHub Personal Access Tokens (PATs), NPM Author tokens, and Slack Access Tokens.

Authentication: Failed to identify Web Authentication Cookies (FedAuth) and OAuth Client Secrets.

PII & Clean Text False Positives

We also observed issues with over-redaction where non-sensitive information was incorrectly flagged:

Invalid Formats: The service detected a partial date within an invalid IP address and flagged an invalid email TLD as a Person/Organization.

Business Language: Common phrases like "personally identifiable" and relative time references like "yesterday" were flagged as PII with high confidence.

We would appreciate your guidance on any recommendations you have for achieving CredScan-level parity, or if there is a roadmap for these improvements.

0 comments

2 answers

Your answer

Answer 1

Hello KwangJe Cho,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having Credential Detection Gaps in Azure Language Service vs. CredScan.

For clarifications and to regarding some core questions:

Is Azure Language planning to expand credential detection?

No public roadmap, and model design makes parity with CredScan unlikely.

Can I add custom regex/rules to Azure Language PII detection?

Absolutely not. The model is fixed.

Why were 19/46 secrets missed?

Because Azure Language is not engineered for secret detection or pattern scanning.

How do I achieve CredScan‑level coverage?

You cannot do it with Azure Language alone. You must combine CredScan + your own regex + Azure Language PII.

Can entity Synonyms help detect tokens or secrets?

No. They have zero effect on secret detection.

With the above clarifications and by practical, Azure AI Language provides fixed ML‑ and pattern‑based PII detection and cannot accept custom regex, perform secret scanning, or replace security tools such as CredScan. Its models operate “as‑is” with no rule injection or credential‑level guarantees, as shown in Microsoft’s documentation:

Azure Language Overview - https://language.cognitive.azure.com/
Language Detection & PII Model Constraints - https://learn.microsoft.com/en-us/azure/ai-services/language-service/language-detection/overview

To achieve reliable protection, run CredScan first to capture all keys, tokens, certificates, and connection strings, then apply your custom regex rules for organization‑specific secret formats, and finally use Azure AI Language strictly for traditional PII like names and emails. This layered approach maximizes recall, minimizes false positives, and ensures each component performs the task it was built for:

Microsoft CredScan Documentation - https://learn.microsoft.com/en-us/azure/devops/repos/security/credential-scanning
Azure AI Language PII Detection - https://learn.microsoft.com/en-us/azure/ai-services/language-service/personally-identifiable-information/overview

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer 2

Hi KwangJe Cho,

Thanks for raising this.

At the moment, PII and credential detection in Azure Language services (including Text Analytics / AI Language) is based on pattern and model-driven detection, not on full semantic or contextual understanding. Because of that, the service can miss credentials in scenarios where:

The credential format does not match known or common patterns
The value is embedded in free‑form text or custom token structures
The credential looks similar to a generic string (e.g., mixed alphanumeric content without clear prefixes)
The text is truncated, obfuscated, or lacks clear separators

This behavior is expected and is documented as part of the limitations of the current PII detection models. The service is designed to minimize false positives, which means some true positives especially edge cases may not be detected automatically.

A few workarounds you may consider:

Combine Azure Language PII detection with custom regex or application‑level validation for known credential formats in your workload.
If you have organization‑specific credential patterns (API keys, internal tokens, etc.), handle them outside the built‑in PII categories.
For sensitive workflows, consider using multiple layers of validation (PII detection + secret scanning + logging safeguards).
Share anonymized examples through your support channel or feedback so the product team can evaluate model improvements.

Please let me know if there are any remaining questions or additional details, I can help with, I’ll be glad to provide further clarification or guidance.

Hope this helps!

Share via

Credential Detection Gaps in Azure Language Service vs. CredScan

2 answers

Your answer