Data Extraction - VATextract

VATextract uses advanced OCR technology to extract structured data from any invoice format—PDF, scanned documents, or images.

Supported File Formats

Format	Description
PDF	Native PDFs and scanned documents
JPEG/PNG	High-resolution images
Multi-page	Documents up to 100 pages

Extracted Fields

Header Information

Field	Description
`invoiceNumber`	Unique invoice identifier
`invoiceDate`	Date the invoice was issued
`dueDate`	Payment due date
`deliveryDate`	Goods/services delivery date
`purchaseOrder`	Purchase order reference

Financial Data

Field	Description
`netAmount`	Pre-tax amount
`vatAmount`	VAT/tax amount
`vatRate`	VAT percentage
`totalAmount`	Total including tax
`currency`	ISO currency code (EUR, GBP, USD, etc.)
`freightAmount`	Shipping/freight charges

Supplier Details

Field	Description
`supplierName`	Company name
`supplierTaxId`	VAT/Tax identification number
`supplierAddress`	Full address
`supplierContact`	Email, phone, IBAN, etc.

Line Items

Each line item contains:

{
  "description": "Product or service name",
  "quantity": 10,
  "unitPrice": 25.00,
  "amount": 250.00,
  "productCode": "SKU-12345",
  "vatCode": "T1"
}

VAT codes are automatically assigned based on the line item’s VAT rate and your accounting software. See VAT Codes for details.

OCR Providers

VATextract supports multiple OCR engines:

Google Document AI
AWS Textract

Default provider. Best for European invoices and complex layouts.

Excellent multi-language support
Strong table extraction
High accuracy on scanned documents

Configure your preferred OCR provider in Settings → Preferences, or set the OCR_PROVIDER environment variable for self-hosted deployments.

Extraction Confidence

Each extracted field includes a confidence score (0-100%). Low-confidence extractions are highlighted in the review interface for manual verification.

Geometry Data

For advanced integrations, VATextractprovides bounding box coordinates for each extracted field:

{
  "fieldName": "TOTAL",
  "text": "$1,250.00",
  "confidence": 98.5,
  "geometry": {
    "boundingBox": {
      "left": 0.72,
      "top": 0.85,
      "width": 0.15,
      "height": 0.02
    },
    "pageNumber": 1
  }
}

This enables document overlay highlighting and programmatic field location.

​Supported File Formats

​Extracted Fields

​Header Information

​Financial Data

​Supplier Details

​Line Items

​OCR Providers

​Extraction Confidence

​Geometry Data

Supported File Formats

Extracted Fields

Header Information

Financial Data

Supplier Details

Line Items

OCR Providers

Extraction Confidence

Geometry Data