Azure AI Search: Why is OCR Reprocessing All Pages on Incremental Update?
Hello,
I'm experimenting with Azure AI Search for a new feature in our product. I'm running into a problem where while I've activated the incremental enrichment, skills that are not supposed to be executed are executed.
Our clients have PDF documents that need to be indexed. We have different kind of content:
- Full scanned documents
- Partially scanned documents: last page containing a signature is scanned
- Not scanned at all
I've set up an indexer with the incremental enrichment cache activated. The skillset consists in:
- document cracking
- ocr skill
- merge skill
- custom skill (extract metadata)
- split skill
- embedding skill Once all my documents in my blob storage are indexed, I update one blob metadata. I expected that I wouldn't see OCR running, but this results on 131 pages processed on cognitive services (exact number of images in the PDF)
I checked in the cached data, and I've found that for this document I have 262 images in the binary folder.
Somehow, something has invalidated the cache and I wonder what
Azure AI Search
-
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-06T06:26:29.3133333+00:00 Hi @mathias Herbaux ,
welcome to the Microsoft Q&A Platform!
The issue of Azure Cognitive Search OCR reprocessing all pages during an incremental update is likely caused by some reasons.
Limit Metadata Changes: Only update essential metadata fields. Minor changes can trigger full reprocessing.Keep Skillset Consistent: Avoid modifying the skillset once incremental enrichment is in place. Any changes can invalidate the cache.
Use Conditional Logic for OCR: Only apply OCR to pages that need it (e.g., newly scanned or modified pages). Store OCR results in a separate field for future reference.
Verify Cache Integrity: Ensure there are no duplicate images or pages in the cache. Clean or reset the cache if needed.
Implement Document Hashing: Use a hash (checksum) for each document to detect actual content changes, so the indexer only reprocesses updated documents.
Custom Split Skill: Customize the split skill to identify and process only scanned pages, preventing unnecessary OCR on non-scanned pages.
Limit Skill Execution: Configure skills to only trigger downstream processing (like merge or embedding) if new content has been processed.
Implementing these steps should help reduce unnecessary reprocessing in Azure Cognitive Search, improving efficiency and resource usage.
Let me know if you have any queries.
If the answer is helpful, please click "Accept Answer" and kindly upvote it. -
mathias Herbaux 0 Reputation points
2024-11-06T13:01:51.56+00:00 I find this answer not relevant to my problem:
Limit Metadata Changes: as specified in my message, I only update one metadata of the blob. It seems weird that a single metadata trigger the reprocessing because the file content hasn't changed.Keep Skillset Consistent: it hasn't been modified, only the one metadata has been udpated
Use Conditional Logic for OCR: how would one know if the file is a scan or partial scan without opening the file? Doesn't seem a good option
Verify Cache Integrity: I did, as mentionned in my post. I've tested my setup multiple times (delete indexer/index/skillset/datasource)
Implement Document Hashing: I hope this is not the only solution to prevent this.
Custom Split Skill: uh? The split skill is applied on text, not pages and happens after the document cracking/ocr
Limit Skill Execution: Can't, the metadata has to be updated on documents.
-
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-07T06:44:15.1866667+00:00 Hi @mathias Herbaux ,
Thankyou for Your Response.
It seems that your Azure AI Search index is reprocessing entire documents, including OCR, even when only a single metadata field is updated.
Double-check Indexer ConfigurationEnsure the incremental update policy is set to update only changed fields.
Verify field mappings to avoid unintended reprocessing.
Optimize Skillset
Consider prioritizing skills to minimize unnecessary processing.
Explore conditional logic within skills, if applicable, to reduce processing steps.
Monitor Cache Behavior
Keep an eye on cache performance and expiration settings.
Adjust cache expiration time to balance performance and accuracy.
Review Azure Monitor LogsAnalyze logs to identify specific triggers for reprocessing.
Introducing incremental enrichment in Azure Cognitive Search: https://azure.microsoft.com/en-us/blog/introducing-incremental-enrichment-in-azure-cognitive-search/Enable caching for incremental enrichment:
https://learn.microsoft.com/en-us/azure/search/search-howto-incremental-index
Let me know if you need any assistances. -
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-08T01:35:13.0733333+00:00 Hi @mathias Herbaux ,
Following up to see if you have chance to check my previous response and help us with requested information to check and assist you further on this.
-
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-11T00:46:54.61+00:00 Hi @mathias Herbaux ,
Following up to see if you have chance to check my previous response and help us with requested information to check and assist you further on this.
-
mathias Herbaux 0 Reputation points
2024-11-12T08:35:54.07+00:00 Hello,
I'm sorry but the provided insights are not helping. It really looks like an AI generated answer.I'd like to have somebody looking into the technical details, instead of vague answers pointing to the documentation that I already looked into a couple of times.
Kind regards
-
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-18T06:50:05.6866667+00:00 Hi @mathias Herbaux ,
Sorry for the late Response.
- Go to Azure Cognitive Search in the Azure Portal.
- Under Indexers, select your indexer and ensure Incremental Enrichment is enabled.
- Check if your skills are configured correctly.
- Ensure that the OCR skill is applied only when necessary.
- Make sure that Document Cracking or other skills aren’t triggering reprocessing of all pages unnecessarily.
- In the Indexers section, after running an incremental update, check "Logs" to see what skills are being triggered.
- If OCR is running on all pages, it will appear in the logs.
- Update the metadata of the blob and trigger an incremental update.
- If OCR processes all pages despite minimal changes, check if metadata changes are causing a full reindex. Incremental Enrichment in Azure Cognitive Search Create an indexer in Azure Cognitive Search This approach should help you that why OCR is being re-triggered on all pages.
-
mathias Herbaux 0 Reputation points
2024-11-18T16:08:35.7766667+00:00 Don't know what to say. I understand that Azure Cognitive Search was the previous name of Azure AI Search, but event the links provided are wrong, pointing to 404 web pages.
FYI: I read those links
- https://learn.microsoft.com/en-us/azure/search/search-howto-incremental-index?tabs=portal
- https://learn.microsoft.com/en-us/azure/search/cognitive-search-incremental-indexing-conceptual
The enableReprocessing is set to True.
The incremental enrichment cache is set, as I mentioned in my original question:
I checked in the cached data, and I've found that for this document I have 262 images in the binary folder.
I might not have be clear enough:
I'm referring to the container created by the incremental enrichment cache, where we can find the documents processed. I noticed that a document after being indexed for the first time had 131 images in the binary folder, corresponding to all images present in the PDF. After updating one single metadata of this PDF and running the indexer again, the binary folder had 262 images, meaning that even the PDF hadn't changed, it had been fully reprocessed. -
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-19T08:20:33.5333333+00:00 Hi @mathias Herbaux ,
Thank you for clarifying.
It seems like the root of the issue is related to how the incremental enrichment cache handles changes. Even with metadata updates, the behavior indicates that the cache might be invalidating or not functioning as expected for your scenario.
Validate Incremental Enrichment Cache:Ensure the cache is enabled and correctly configured in your skillset JSON
{ "enrichmentCache": { "storageConnectionString": "<YourBlobStorageConnectionString>", "containerName": "<YourCacheContainerName>" } }
Disable Reprocessing:
Set
enableReprocessing
tofalse
for all skills except those requiring reprocessing (like OCR).Add a Custom
ShouldProcess
Skill:Implement a custom skill to evaluate whether OCR should run based on metadata changes.
{ "@odata.type": "#Microsoft.Skills.Text.ConditionalSkill", "inputs": [ { "name": "condition", "source": "/document/metadata/lastModified" } ], "outputs": [ { "name": "output", "targetName": "shouldProcess" } ] }
Inspect Indexer Logs:
Check Indexer Run Details for cache misses or triggers causing full reprocessing.
Go to Azure Portal > Indexer > Run Details > Logs.Modify Metadata Handling:
Ensure metadata updates do not affect the fields mapped to skills unnecessarily, avoiding cache invalidation.
Test with Minimal Skills:
Temporarily simplify the skillset to isolate the issue.
-
mathias Herbaux 0 Reputation points
2024-11-19T08:40:30.85+00:00 Humm,
In my understanding, enableReprocessing is a property only available on the cache, not a per skill property.
I've already tried to conditionally activate OCR, but it does not seem to work because:
It takes only one Input that comes from context: image (https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr), moreover as mentioned in the documentation : Currently only works with "/document/normalized_images" field. Hence the fact we can't add a condition to perform OCR.
What are your thoughts about that? -
Shree Hima Bindu Maganti 815 Reputation points • Microsoft Vendor
2024-11-19T13:53:54.7933333+00:00 Hi @mathias Herbaux ,
Thankyou for your response.
Yes you are correct. TheenableReprocessing
property is available only at the cache level, not on a per-skill basis.Regarding conditionally activating OCR:
- The OCR skill only accepts
/document/normalized_images
as input, meaning you cannot add a condition directly within the OCR skill to control its execution. - Conditional activation of OCR isn’t feasible due to this input constraint. The pipeline will process all images if
/document/normalized_images
exists, regardless of metadata changes. - If metadata changes trigger the indexer, the OCR skill will reprocess all images because it relies on normalized images from the pipeline, and no inherent mechanism exists to skip this step based on metadata alone.
- The OCR skill only accepts
-
mathias Herbaux 0 Reputation points
2024-11-19T13:58:55.5966667+00:00 Does it means that there is no possible solution? Sounds weird since this is the exact purpose of the Incremental Enrichment caching feature, to cache skills output so that they are only reexecuted if the input(s) has changed.
Is there somebody technical on your side that could look into it?
-
SnehaAgrawal-MSFT 21,851 Reputation points
2024-11-20T04:39:51.7733333+00:00 @mathias Herbaux Reached you privately.
Sign in to comment