How to split document by page in Azure AI Search?

Question

How to split document by page in Azure AI Search?

Holgado Sánchez, José Luis 20

I'm actually working in a project where I have a Sharepoint with a bunch of PDF files stored in it. I need to store these files in Azure AI Search to perform semantic search.

When I perform retrieval over my Index, I need to recover not only the relevant information but also in which page this information is stored. The Azure AI Search SplitSkill says that it can split documents by pages, but it actually splits them by chunks and I cannot retrieve the page number from there.

So for this, I created a simple Custom Skill that gets the PDF binary, split it by pages using PyMuPDF and put it into an Azure Function, so I can create a WebApiSkill to split the PDF into pages and retrieve both, the page content (to be later further chunked and embedded) and the page number.

The code for the split is simple:

import fitz
import json

def split_pdf_by_pages(pdf_bytes):
    pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
    pages = []
    
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text = page.get_text("text")
        pages.append({
            "page_number": page_num + 1,
            "content": text
        })
    
    return json.dumps(pages)

I still can't solve this problem, because seems that when I read information from Sharepoint, the objects that reach to my CustomSkill are no binaries, so my code fails.

Do anyone knows how to solve this? Any other way to actually split the PDFs in the Sharepoint by actual pages and retrieve the page number?

Thank you everyone in advance.

Accepted answer

0 additional answers

Your answer

Answer 1

@Holgado Sánchez, José Luis To solve this, Let's try two steps to see if we can get something going here.

Ensure Binary Data from SharePoint

The issue seems to be that the data from SharePoint isn’t in binary format when it reaches your Custom Skill. Here are a few steps to ensure you’re getting the binary data:

Check the SharePoint Connector: Ensure that the connector you’re using to read PDFs from SharePoint is configured to fetch the file content as binary. You might need to use the getFileByServerRelativeUrl method with the @content endpoint to get the binary data.

# Example using Microsoft Graph API
import requests
url = "https://graph.microsoft.com/v1.0/sites/{site-id}/drive/items/{item-id}/content"
headers = {
    "Authorization": "Bearer {access-token}"
}
response = requests.get(url, headers=headers)
pdf_bytes = response.content

Modify Your Custom Skill

Ensure your Custom Skill can handle the binary data correctly. Here’s a refined version of your function:

import fitz
import json
def split_pdf_by_pages(pdf_bytes):
    pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
    pages = []
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text = page.get_text("text")
        pages.append({
            "page_number": page_num + 1,
            "content": text
        })
    return json.dumps(pages)

Another alternative is that you can consider using Azure Functions for preprocessing and you may find this as an easier solution.

Holgado Sánchez, José Luis 20 Reputation points

2024-09-30T08:01:43.3166667+00:00

Hello , finally what I had to do to ensure binaries coming from Sharepoint is to set the allow_skillset_to_read_file_data=True in the Indexer Parameters (I'm using Python SDK).

Thank you very much!

Best,
Emily Du-MSFT 51,851 Reputation points Microsoft External Staff

2024-10-03T02:02:28.37+00:00

Helpful information!

Share via

How to split document by page in Azure AI Search?

0 additional answers

Your answer