Extracting Information from PDFs and Storing in a Database Using Azure AI Services

March 15, 2025 Dellenny

Handling documents efficiently is a critical requirement for many businesses. Extracting structured data from PDF files and storing it in a database can streamline operations in finance, legal, healthcare, and other industries. Azure AI Services provides robust tools for automating this process, including Azure AI Document Intelligence (formerly Form Recognizer) and Azure Cognitive Services.

In this blog, we’ll walk through how to:

Read a PDF document
Extract relevant data
Store the extracted information in a database

Solution Overview

The solution involves the following components:

Azure AI Document Intelligence – Extracts text, tables, and key-value pairs from PDFs.
Azure Storage (Optional) – Stores uploaded PDF documents for processing.
Azure Function or Web App – Handles document processing and extraction.
Azure SQL Database or Cosmos DB – Stores the extracted data.

Architecture Diagram

Step-by-Step Implementation

Step 1: Upload PDF to Azure Storage

Before processing, store the PDF in Azure Blob Storage for centralized access.

Create an Azure Storage Account.
Use Blob Storage to store PDFs.
Generate a SAS (Shared Access Signature) URL for secure access.

Step 2: Extract Data Using Azure AI Document Intelligence

Azure AI Document Intelligence can extract structured data from PDFs, including invoices, receipts, business documents, and more.

Setting Up AI Document Intelligence

Go to Azure Portal and create a Document Intelligence resource.
Get the API Key and Endpoint URL.
Use prebuilt models or train a custom model for specific document formats.

Code to Extract Data from PDF

import requests
import json

endpoint = "<your_document_intelligence_endpoint>"
api_key = "<your_api_key>"
file_url = "<pdf_file_sas_url>"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "application/json"
}

data = {"urlSource": file_url}

response = requests.post(f"{endpoint}/formrecognizer/documentModels/prebuilt-document:analyze", headers=headers, json=data)
result = response.json()
print(json.dumps(result, indent=4))

Step 3: Process Extracted Data

The API response contains:

Extracted Text
Tables
Key-Value Pairs

Example of extracted data:

{
    "documents": [
        {
            "fields": {
                "InvoiceNumber": {"valueString": "INV-12345"},
                "TotalAmount": {"valueNumber": 1250.75},
                "Date": {"valueDate": "2024-03-15"}
            }
        }
    ]
}

Step 4: Store Data in Azure SQL Database

Once data is extracted, store it in Azure SQL Database or Cosmos DB for further processing.

Example: Storing Data in SQL Database

import pyodbc

server = "your-sql-server.database.windows.net"
database = "your-db"
username = "your-username"
password = "your-password"

driver = "{ODBC Driver 17 for SQL Server}"
conn = pyodbc.connect(f'DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}')

cursor = conn.cursor()
cursor.execute("INSERT INTO Invoices (InvoiceNumber, TotalAmount, Date) VALUES (?, ?, ?)",
               ("INV-12345", 1250.75, "2024-03-15"))
conn.commit()

Step 5: Automate with Azure Functions

To automate PDF processing:

Trigger Azure Function when a new file is uploaded.
Call Document Intelligence API to extract data.
Store extracted data in a database.

Azure Function can be triggered using Event Grid when a PDF is uploaded to Blob Storage.

Key Takeaways:

✅ Use Azure AI Document Intelligence to extract structured data from PDFs.
✅ Store files securely in Azure Blob Storage.
✅ Automate processing using Azure Functions.
✅ Store extracted information in Azure SQL Database or Cosmos DB.

With this approach, businesses can automate data extraction, reduce manual work, improve accuracy, and enhance document management workflows.

Discover more from Dellenny

Subscribe to get the latest posts sent to your email.