@Holgado Sánchez, José Luis To solve this, Let's try two steps to see if we can get something going here.
Ensure Binary Data from SharePoint
The issue seems to be that the data from SharePoint isn’t in binary format when it reaches your Custom Skill. Here are a few steps to ensure you’re getting the binary data:
Check the SharePoint Connector: Ensure that the connector you’re using to read PDFs from SharePoint is configured to fetch the file content as binary. You might need to use the getFileByServerRelativeUrl
method with the @content
endpoint to get the binary data.
# Example using Microsoft Graph API
import requests
url = "https://graph.microsoft.com/v1.0/sites/{site-id}/drive/items/{item-id}/content"
headers = {
"Authorization": "Bearer {access-token}"
}
response = requests.get(url, headers=headers)
pdf_bytes = response.content
Modify Your Custom Skill
Ensure your Custom Skill can handle the binary data correctly. Here’s a refined version of your function:
import fitz
import json
def split_pdf_by_pages(pdf_bytes):
pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
pages = []
for page_num in range(len(pdf_document)):
page = pdf_document.load_page(page_num)
text = page.get_text("text")
pages.append({
"page_number": page_num + 1,
"content": text
})
return json.dumps(pages)
Another alternative is that you can consider using Azure Functions for preprocessing and you may find this as an easier solution.