How to split document by page in Azure AI Search?

Holgado Sánchez, José Luis 20 Reputation points
2024-09-26T15:38:09.8966667+00:00

I'm actually working in a project where I have a Sharepoint with a bunch of PDF files stored in it. I need to store these files in Azure AI Search to perform semantic search.

When I perform retrieval over my Index, I need to recover not only the relevant information but also in which page this information is stored. The Azure AI Search SplitSkill says that it can split documents by pages, but it actually splits them by chunks and I cannot retrieve the page number from there.

So for this, I created a simple Custom Skill that gets the PDF binary, split it by pages using PyMuPDF and put it into an Azure Function, so I can create a WebApiSkill to split the PDF into pages and retrieve both, the page content (to be later further chunked and embedded) and the page number.

The code for the split is simple:

import fitz
import json

def split_pdf_by_pages(pdf_bytes):
    pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
    pages = []
    
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text = page.get_text("text")
        pages.append({
            "page_number": page_num + 1,
            "content": text
        })
    
    return json.dumps(pages)

I still can't solve this problem, because seems that when I read information from Sharepoint, the objects that reach to my CustomSkill are no binaries, so my code fails.

Do anyone knows how to solve this? Any other way to actually split the PDFs in the Sharepoint by actual pages and retrieve the page number?

Thank you everyone in advance.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
992 questions
SharePoint
SharePoint
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
10,677 questions
0 comments No comments
{count} votes

Accepted answer
  1. brtrach-MSFT 16,271 Reputation points Microsoft Employee
    2024-09-26T23:31:17.2333333+00:00

    @Holgado Sánchez, José Luis To solve this, Let's try two steps to see if we can get something going here.

    Ensure Binary Data from SharePoint

    The issue seems to be that the data from SharePoint isn’t in binary format when it reaches your Custom Skill. Here are a few steps to ensure you’re getting the binary data:

    Check the SharePoint Connector: Ensure that the connector you’re using to read PDFs from SharePoint is configured to fetch the file content as binary. You might need to use the getFileByServerRelativeUrl method with the @content endpoint to get the binary data.

    # Example using Microsoft Graph API
    import requests
    url = "https://graph.microsoft.com/v1.0/sites/{site-id}/drive/items/{item-id}/content"
    headers = {
        "Authorization": "Bearer {access-token}"
    }
    response = requests.get(url, headers=headers)
    pdf_bytes = response.content
    
    
    

    Modify Your Custom Skill

    Ensure your Custom Skill can handle the binary data correctly. Here’s a refined version of your function:

    import fitz
    import json
    def split_pdf_by_pages(pdf_bytes):
        pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
        pages = []
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            text = page.get_text("text")
            pages.append({
                "page_number": page_num + 1,
                "content": text
            })
        return json.dumps(pages)
    
    

    Another alternative is that you can consider using Azure Functions for preprocessing and you may find this as an easier solution.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.