Extract text and associated links using Content Understanding

Question

Extract text and associated links using Content Understanding

Yuvraj 0

Hi,

We are using Azure Content Understanding to extract data from PDF files. It is working well for extracting only the text like the first name, email, etc. But when assigned to extract a links with their text it's failing.

Is there a way we can extract text and it's associated link?

azure-cu

Manas Mohanty 13,255 Reputation points Moderator

2025-11-25T10:56:05.44+00:00

Hi Yuvraj

Could you please share the steps along screenshot to replicate the issue.

Thank you

1 answer

Your answer

Manas Mohanty 13,255 Reputation points Moderator

2025-11-25T10:56:05.44+00:00

Hi Yuvraj

Could you please share the steps along screenshot to replicate the issue.

Thank you

Answer 1

Hello Yuvraj,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having issue with extract text and associated links using content understanding.

The core challenge is not Azure’s OCR capability which already extracts visible text accurately, but the fact that Azure does not return link annotations embedded inside a PDF. Therefore, you must combine Azure’s OCR output with a PDF library’s annotation extraction and then match both data sources using properly normalized coordinates. Microsoft’s official guidance confirms that Azure returns bounding box coordinates in inches for PDF documents and that users must convert and align these coordinates manually when integrating with external systems (Azure Docs: units & bounding box rules – https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-layout).

The first critical part of the solution is extracting link annotations from the PDF using libraries such as PyMuPDF or pypdf, both of which expose /Annots entries and their /URI targets. PyMuPDF documentation clarifies that annotation rectangles and page rectangles are presented in points (1 point = 1/72 inch) and follow a top-left page origin system where y increases downward (PyMuPDF docs – https://pymupdf.readthedocs.io/en/latest/). A validation step is required: you print the page width, height, and annotation rectangle for a sample page to confirm exactly how the PDF library expresses coordinates. This prevents mismatched rectangles during later anchor-text matching.

Next, Azure’s OCR polygons must be converted into the same space as the PDF annotations. Because Azure uses inches for PDF output, you convert each polygon point by multiplying (x, y) by 72 to obtain points. However, since PDF coordinate origins differ—PDF spec uses bottom-left, Azure’s “top-left listed point” is only an ordering hint, and PyMuPDF uses top-left—it is essential to validate the vertical axis. A simple test is to plot one known word on the page; if its y-coordinate appears inverted relative to the annotation rectangle, you apply the vertical flip: normalized_y = page_height_in_points – azure_y_points. This practice aligns with Microsoft community guidance noting the frequent need to reconcile coordinate origins (Azure Q&A – https://learn.microsoft.com/answers/).

Once units and origins are aligned, you can reliably match annotation rectangles to OCR text. Using either simple bounding-box intersection logic or a geometry library such as Shapely, you check whether each Azure word bbox overlaps or falls inside each annotation rectangle. You may also apply a small buffer of 2–6 points to handle minor OCR or rendering shifts. All matched words are then sorted left-to-right and top-to-bottom to form the final anchor text. The method is consistent with PDF parsing norms and recommended by both PyMuPDF and Azure community examples (PyMuPDF examples – https://pymupdf.readthedocs.io/en/latest/tutorial; Azure bounding box examples – https://learn.microsoft.com/azure/ai-services/document-intelligence/).

Below is a validated minimal code sketch for the coordinate conversion and matching phase:

# Convert Azure (inches) → points
def azure_to_points(x_in, y_in):
    return x_in * 72.0, y_in * 72.0

# Flip y to match PyMuPDF if validation shows inversion
def flip_y(y_pts, page_height_pts):
    return page_height_pts - y_pts

# Build bbox for Azure word
pts = [azure_to_points(x, y) for (x, y) in azure_polygon]
# Apply flip if visual test indicates mismatch
pts = [(x, flip_y(y, page_height_pts)) for (x, y) in pts]
wx0, wy0 = min(p[0] for p in pts), min(p[1] for p in pts)
wx1, wy1 = max(p[0] for p in pts), max(p[1] for p in pts)

# Intersection with annotation rect (ax0, ay0, ax1, ay1)
def intersects(a, b):
    return not (a[2] < b[0] or a[0] > b[2] or a[3] < b[1] or a[1] > b[3])

Therefore, Azure alone cannot return hyperlinks, and PDF libraries alone cannot return the OCR text of anchors. By aligning units, fixing the coordinate origin, and applying spatial matching, you accurately reconstruct link–anchor pairs. This process follows best practices from Azure, the PDF specification, and PyMuPDF, producing a dependable extraction pipeline suitable for production workloads.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Extract text and associated links using Content Understanding

1 answer

Your answer