How to Extract Data from PDFs Using an API

By RamLabs Team · April 2026 · 6 min read

Extracting structured data from PDF documents is one of the most common challenges in software development. Whether you're processing invoices, parsing resumes, or digitizing contracts, the goal is always the same: turn an unstructured document into clean, typed data your application can use.

In this guide, we'll show you how to extract data from any PDF using the ScoutExtract API — with working examples in Python and Node.js.

The Problem with Traditional PDF Parsing

Most PDF extraction approaches fall into one of these categories:

OCR + Regex: Use Tesseract or similar to extract text, then write regex patterns to find specific fields. Breaks whenever the document layout changes.
Template-based: Define exact coordinates where data appears. Only works for one specific document format.
ML-based: Train a custom model on labeled examples. Requires hundreds of labeled documents and ongoing maintenance.

All of these approaches are fragile, expensive to maintain, and require significant engineering effort.

A Better Approach: Schema-Driven Extraction

ScoutExtract uses AI to understand documents the way humans do — by reading and comprehending the content. Instead of defining where data is located, you define what data you want using a JSON schema.

The API handles different layouts, handwriting, multi-page documents, tables, and varying terminology (e.g., "Total Due" vs "Amount Payable" vs "Balance").

Quick Start: Extract Invoice Data from a PDF

Step 1: Get your API key

Step 2: Extract data (Python)

import base64
import requests

# Read your PDF file
with open("invoice.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode()

# Send to ScoutExtract
response = requests.post(
    "https://api.ramlabs.dev/v1/extract",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "document": pdf_base64,
        "documentType": "pdf",
        "schema": "invoice"
    }
)

result = response.json()

data = result["data"]
print(f"Invoice #: {data['invoice_number']['value']}")
print(f"Total: ${data['total']['value']}")
print(f"Confidence: {data['total']['confidence']}")

Step 3: Extract data (Node.js)

import { readFileSync } from "node:fs";

const pdf = readFileSync("invoice.pdf");
const pdfBase64 = pdf.toString("base64");

const response = await fetch("https://api.ramlabs.dev/v1/extract", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    document: pdfBase64,
    documentType: "pdf",
    schema: "invoice"
  })
});

const { data } = await response.json();
console.log(`Invoice #: ${data.invoice_number.value}`);
console.log(`Total: $${data.total.value}`);

Understanding Confidence Scores

Every field in the response includes a confidence score between 0.0 and 1.0:

{
  "invoice_number": { "value": "INV-2024-0892", "confidence": 0.99 },
  "vendor": { "value": "Acme Corp", "confidence": 0.95 },
  "total": { "value": 3916.23, "confidence": 0.99 }
}

Use confidence scores to build smart automation:

> 0.9: Auto-process with high confidence
0.7 - 0.9: Process but flag for review
< 0.7: Route to a human for manual verification

Custom Schemas for Any Document

The pre-built invoice schema works for most invoices, but you can define your own schema for any document type:

custom_schema = {
    "po_number": {"type": "string", "description": "Purchase order number"},
    "department": {"type": "string", "description": "Requesting department"},
    "approved_by": {"type": "string", "description": "Approver's name"},
    "budget_code": {"type": "string", "description": "Budget or cost center code"}
}

response = requests.post(
    "https://api.ramlabs.dev/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "document": pdf_base64,
        "documentType": "pdf",
        "schema": custom_schema
    }
)

Supported Document Types

Format	How to Send
PDF	Base64-encoded, `documentType: "pdf"`
PNG/JPG/WEBP	Base64-encoded, `documentType: "image"`
Plain text	Send as-is in the `document` field

Get Started Free

25 extractions/month. No credit card required.