How to Extract Data from PDFs Using an API
Extracting structured data from PDF documents is one of the most common challenges in software development. Whether you're processing invoices, parsing resumes, or digitizing contracts, the goal is always the same: turn an unstructured document into clean, typed data your application can use.
In this guide, we'll show you how to extract data from any PDF using the ScoutExtract API — with working examples in Python and Node.js.
The Problem with Traditional PDF Parsing
Most PDF extraction approaches fall into one of these categories:
- OCR + Regex: Use Tesseract or similar to extract text, then write regex patterns to find specific fields. Breaks whenever the document layout changes.
- Template-based: Define exact coordinates where data appears. Only works for one specific document format.
- ML-based: Train a custom model on labeled examples. Requires hundreds of labeled documents and ongoing maintenance.
All of these approaches are fragile, expensive to maintain, and require significant engineering effort.
A Better Approach: Schema-Driven Extraction
ScoutExtract uses AI to understand documents the way humans do — by reading and comprehending the content. Instead of defining where data is located, you define what data you want using a JSON schema.
The API handles different layouts, handwriting, multi-page documents, tables, and varying terminology (e.g., "Total Due" vs "Amount Payable" vs "Balance").
Quick Start: Extract Invoice Data from a PDF
Step 1: Get your API key
Sign up at extract.ramlabs.dev/dashboard — free, no credit card required.
Step 2: Extract data (Python)
import base64
import requests
# Read your PDF file
with open("invoice.pdf", "rb") as f:
pdf_base64 = base64.b64encode(f.read()).decode()
# Send to ScoutExtract
response = requests.post(
"https://api.ramlabs.dev/v1/extract",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"document": pdf_base64,
"documentType": "pdf",
"schema": "invoice"
}
)
result = response.json()
data = result["data"]
print(f"Invoice #: {data['invoice_number']['value']}")
print(f"Total: ${data['total']['value']}")
print(f"Confidence: {data['total']['confidence']}")
Step 3: Extract data (Node.js)
import { readFileSync } from "node:fs";
const pdf = readFileSync("invoice.pdf");
const pdfBase64 = pdf.toString("base64");
const response = await fetch("https://api.ramlabs.dev/v1/extract", {
method: "POST",
headers: {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
body: JSON.stringify({
document: pdfBase64,
documentType: "pdf",
schema: "invoice"
})
});
const { data } = await response.json();
console.log(`Invoice #: ${data.invoice_number.value}`);
console.log(`Total: $${data.total.value}`);
Understanding Confidence Scores
Every field in the response includes a confidence score between 0.0 and 1.0:
{
"invoice_number": { "value": "INV-2024-0892", "confidence": 0.99 },
"vendor": { "value": "Acme Corp", "confidence": 0.95 },
"total": { "value": 3916.23, "confidence": 0.99 }
}
Use confidence scores to build smart automation:
- > 0.9: Auto-process with high confidence
- 0.7 - 0.9: Process but flag for review
- < 0.7: Route to a human for manual verification
Custom Schemas for Any Document
The pre-built invoice schema works for most invoices, but you can define your own schema for any document type:
custom_schema = {
"po_number": {"type": "string", "description": "Purchase order number"},
"department": {"type": "string", "description": "Requesting department"},
"approved_by": {"type": "string", "description": "Approver's name"},
"budget_code": {"type": "string", "description": "Budget or cost center code"}
}
response = requests.post(
"https://api.ramlabs.dev/v1/extract",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"document": pdf_base64,
"documentType": "pdf",
"schema": custom_schema
}
)
Supported Document Types
| Format | How to Send |
|---|---|
Base64-encoded, documentType: "pdf" | |
| PNG/JPG/WEBP | Base64-encoded, documentType: "image" |
| Plain text | Send as-is in the document field |