Invoice Data from Scans

If you want to try this extractor, you can download it here:

Download Invoice Extractor

invoice-extractor.json

Document Input Examples

Here are some examples of the types of documents that this extractor can process.

PDF Invoices

A lot of invoices are delivered as PDF files. These can be generated by accounting software or scanned from paper invoices.

Receipt Photos

Some invoices are only available as photos of receipts. These can be taken with a smartphone or a camera.

Scanned Paper Invoices

A lot of offices still receive paper invoices. These can be scanned and then processed by Data Wizard.

Example Output

This extractor will produce JSON data that looks like the following:

{
    "invoiceNumber": "INV-2022-001",
    "issueDate": "2022-01-01",
    "currency": "EUR",
    "seller": {
        "name": "ACME Inc.",
        "address": "123 Main St.",
        "postalCode": "12345",
        "city": "Springfield",
        "country": "US",
        "vatNumber": "US123456789"
    },
    "buyer": {
        "customerNumber": "CUST-123",
        "name": "Buyer Corp.",
        "address": "456 Elm St.",
        "postalCode": "54321",
        "city": "Shelbyville",
        "country": "US"
    },
    "lineItems": [
        {
            "position": 1,
            "description": "Product A",
            "unitPrice": 100.0,
            "quantity": 2,
            "vatRate": 19.0,
            "netAmount": 200.0
        },
        {
            "position": 2,
            "description": "Product B",
            "unitPrice": 50.0,
            "quantity": 3,
            "vatRate": 19.0,
            "netAmount": 150.0
        }
    ],
    "totalAmounts": {
        "netTotal": 350.0,
        "taxTotal": 66.5,
        "grossTotal": 416.5,
        "dueTotal": 416.5
    },
    "paymentDetails": {
        "paymentTerms": "Net 30 days",
        "paymentMethod": "SEPA_TRANSFER",
        "iban": "DE89370400440532013000"
    }
}

Extractor

Here’s a template for an invoice extractor with some explanations for why some settings are chosen.

JSON Schema

Here’s the example JSON schema to extract invoice data. It includes information about the invoice, seller, buyer, line items, total amounts, and payment details. The schema also contains validation rules for each field to ensure the extracted data is accurate and consistent.

Show Invoice Schema

{
  "title": "Invoice",
  "description": "Schema for an invoice compliant with ZUGFeRD 2.0 / EN16931.",
  "type": "object",
  "properties": {
    "invoiceNumber": {
      "type": "string",
      "description": "Unique invoice identifier",
      "pattern": "^[A-Za-z0-9\\-/]+$"
    },
    "issueDate": {
      "type": "string",
      "format": "date",
      "description": "Date the invoice was issued"
    },
    "currency": {
      "type": "string",
      "description": "Currency code (ISO 4217)",
      "enum": [
        "EUR",
        "USD",
        "GBP",
        "CHF"
      ]
    },
    "seller": {
      "type": "object",
      "description": "Information about the seller",
      "properties": {
        "name": {
          "type": "string",
          "description": "Seller company name"
        },
        "address": {
          "type": "string",
          "description": "Street and number"
        },
        "postalCode": {
          "type": "string",
          "description": "Postal code"
        },
        "city": {
          "type": "string",
          "description": "City name"
        },
        "country": {
          "type": "string",
          "description": "Country code (ISO 3166-1 alpha-2)"
        },
        "vatNumber": {
          "type": "string",
          "description": "VAT identification number"
        }
      },
      "required": [
        "name",
        "address",
        "postalCode",
        "city",
        "country",
        "vatNumber"
      ]
    },
    "buyer": {
      "type": "object",
      "description": "Information about the buyer",
      "properties": {
        "customerNumber": {
          "type": "string",
          "description": "Buyer reference number"
        },
        "name": {
          "type": "string",
          "description": "Buyer company name"
        },
        "address": {
          "type": "string",
          "description": "Street and number"
        },
        "postalCode": {
          "type": "string",
          "description": "Postal code"
        },
        "city": {
          "type": "string",
          "description": "City name"
        },
        "country": {
          "type": "string",
          "description": "Country code (ISO 3166-1 alpha-2)"
        }
      },
      "required": [
        "name",
        "address",
        "postalCode",
        "city",
        "country"
      ]
    },
    "lineItems": {
      "type": "array",
      "description": "List of invoice items",
      "items": {
        "type": "object",
        "properties": {
          "position": {
            "type": "integer",
            "description": "Line item position"
          },
          "description": {
            "type": "string",
            "description": "Item description"
          },
          "unitPrice": {
            "type": "number",
            "description": "Net price per unit",
            "minimum": 0,
            "multipleOf": 0.0001
          },
          "quantity": {
            "type": "number",
            "description": "Quantity of items"
          },
          "vatRate": {
            "type": "number",
            "description": "VAT percentage",
            "minimum": 0,
            "maximum": 100
          },
          "netAmount": {
            "type": "number",
            "description": "Total price before tax"
          }
        },
        "required": [
          "position",
          "unitPrice",
          "quantity",
          "vatRate",
          "netAmount"
        ]
      }
    },
    "totalAmounts": {
      "type": "object",
      "description": "Invoice total amounts",
      "properties": {
        "netTotal": {
          "type": "number",
          "description": "Total amount before tax"
        },
        "taxTotal": {
          "type": "number",
          "description": "Total tax amount"
        },
        "grossTotal": {
          "type": "number",
          "description": "Total amount including tax"
        },
        "dueTotal": {
          "type": "number",
          "description": "Total amount due"
        }
      },
      "required": [
        "netTotal",
        "taxTotal",
        "grossTotal",
        "dueTotal"
      ]
    },
    "paymentDetails": {
      "type": "object",
      "description": "Payment information",
      "properties": {
        "paymentTerms": {
          "type": "string",
          "description": "Payment terms"
        },
        "paymentMethod": {
          "type": "string",
          "description": "Payment method",
          "enum": [
            "SEPA_TRANSFER",
            "CREDIT_CARD",
            "PAYPAL"
          ]
        },
        "iban": {
          "type": "string",
          "description": "IBAN for bank transfer"
        }
      },
      "required": [
        "paymentTerms",
        "paymentMethod"
      ]
    }
  },
  "required": [
    "invoiceNumber",
    "issueDate",
    "currency",
    "seller",
    "buyer",
    "lineItems",
    "totalAmounts"
  ]
}

Extraction Strategy

Strategy: `simple`

You can use any of the strategies for invoice data. It’s probably a good idea to get started with the simple strategy using a model with a large context window, for example google/gemini-2.0-flash.

Tax forms belong to the kind of data that requires further data validation, like calculating totals or verifying tax rates. For these use cases, it can make sense to implement your own custom strategy that includes these additional validation steps.

Find out how to create custom extraction strategies

LLM Recommendation: `google/gemini-2.0-flash`

This model has a large context window and is relatively cheap to use. It also has vision support, which is useful for scanned documents. You can also use it’s lighter variant google/gemini-2.0-flash-lite if you want to save some costs.

Context Settings

Chunk Size: `75k`

A chunk size of around 75.000 tokens should be enough to cover most invoices.

Include Text: `true`

Obviously, we’d want to include any text contents that are included in the invoice.

Include Embedded Images: `false`

We don’t need to include any of the images that are embedded in the files, as they are not relevant.

Include Page Screenshots: `true`

On the other hand, it makes sense to include page screenshots, as they give the LLM some more context about the location of certain text blocks. It also allows the LLM to parse invoices that don’t have a text layer, like scanned documents or photos.

Mark Images with IDs: `false`

Since we’re not interested in assigning any images to data entities, we don’t need this setting.

Next Steps

Learn how to extract some data

Step by step guide to extract data from documents using Data Wizard.

Extractors

Learn how to define and configure data extraction tasks.

Strategies

Understand different data processing strategies.

LLM Provider Configuration

Set up your Large Language Model API keys.

Integration

Embed Data Wizard into other applications using iFrames or APIs.

Receipt photo by: kaboompics.com

Get Started

In-Depth

Integration into External Apps

Examples

Download Invoice Extractor

Document Input Examples

PDF Invoices

Receipt Photos

Scanned Paper Invoices

Example Output

Extractor

JSON Schema

Extraction Strategy

Strategy: `simple`

LLM Recommendation: `google/gemini-2.0-flash`

Context Settings

Chunk Size: `75k`

Include Text: `true`

Include Embedded Images: `false`

Include Page Screenshots: `true`

Mark Images with IDs: `false`

Learn how to extract some data

Extractors

Strategies

LLM Provider Configuration

Integration

Get Started

In-Depth

Integration into External Apps

Examples

Download Invoice Extractor

​Document Input Examples

PDF Invoices

Receipt Photos

Scanned Paper Invoices

​Example Output

​Extractor

​JSON Schema

​Extraction Strategy

​Strategy: simple

​LLM Recommendation: google/gemini-2.0-flash

​Context Settings

​Chunk Size: 75k

​Include Text: true

​Include Embedded Images: false

​Include Page Screenshots: true

​Mark Images with IDs: false

Learn how to extract some data

Extractors

Strategies

LLM Provider Configuration

Integration

Document Input Examples

Example Output

Extractor

JSON Schema

Extraction Strategy

Strategy: `simple`

LLM Recommendation: `google/gemini-2.0-flash`

Context Settings

Chunk Size: `75k`

Include Text: `true`

Include Embedded Images: `false`

Include Page Screenshots: `true`

Mark Images with IDs: `false`