If you want to try this extractor, you can download it here:

Download Invoice Extractor

invoice-extractor.json

Document Input Examples

Here are some examples of the types of documents that this extractor can process.

invoice

PDF Invoices

A lot of invoices are delivered as PDF files. These can be generated by accounting software or scanned from paper invoices.

receipt

Receipt Photos

Some invoices are only available as photos of receipts. These can be taken with a smartphone or a camera.

Scanned Paper Invoices

A lot of offices still receive paper invoices. These can be scanned and then processed by Data Wizard.

Example Output

This extractor will produce JSON data that looks like the following:

{
    "invoiceNumber": "INV-2022-001",
    "issueDate": "2022-01-01",
    "currency": "EUR",
    "seller": {
        "name": "ACME Inc.",
        "address": "123 Main St.",
        "postalCode": "12345",
        "city": "Springfield",
        "country": "US",
        "vatNumber": "US123456789"
    },
    "buyer": {
        "customerNumber": "CUST-123",
        "name": "Buyer Corp.",
        "address": "456 Elm St.",
        "postalCode": "54321",
        "city": "Shelbyville",
        "country": "US"
    },
    "lineItems": [
        {
            "position": 1,
            "description": "Product A",
            "unitPrice": 100.0,
            "quantity": 2,
            "vatRate": 19.0,
            "netAmount": 200.0
        },
        {
            "position": 2,
            "description": "Product B",
            "unitPrice": 50.0,
            "quantity": 3,
            "vatRate": 19.0,
            "netAmount": 150.0
        }
    ],
    "totalAmounts": {
        "netTotal": 350.0,
        "taxTotal": 66.5,
        "grossTotal": 416.5,
        "dueTotal": 416.5
    },
    "paymentDetails": {
        "paymentTerms": "Net 30 days",
        "paymentMethod": "SEPA_TRANSFER",
        "iban": "DE89370400440532013000"
    }
}

Extractor

Here’s a template for an invoice extractor with some explanations for why some settings are chosen.

JSON Schema

Here’s the example JSON schema to extract invoice data. It includes information about the invoice, seller, buyer, line items, total amounts, and payment details. The schema also contains validation rules for each field to ensure the extracted data is accurate and consistent.

Extraction Strategy

Strategy: simple

You can use any of the strategies for invoice data. It’s probably a good idea to get started with the simple strategy using a model with a large context window, for example google/gemini-2.0-flash.

Tax forms belong to the kind of data that requires further data validation, like calculating totals or verifying tax rates. For these use cases, it can make sense to implement your own custom strategy that includes these additional validation steps.

Find out how to create custom extraction strategies

LLM Recommendation: google/gemini-2.0-flash

This model has a large context window and is relatively cheap to use. It also has vision support, which is useful for scanned documents. You can also use it’s lighter variant google/gemini-2.0-flash-lite if you want to save some costs.

Context Settings

Chunk Size: 75k

A chunk size of around 75.000 tokens should be enough to cover most invoices.

Include Text: true

Obviously, we’d want to include any text contents that are included in the invoice.

Include Embedded Images: false

We don’t need to include any of the images that are embedded in the files, as they are not relevant.

Include Page Screenshots: true

On the other hand, it makes sense to include page screenshots, as they give the LLM some more context about the location of certain text blocks. It also allows the LLM to parse invoices that don’t have a text layer, like scanned documents or photos.

Mark Images with IDs: false

Since we’re not interested in assigning any images to data entities, we don’t need this setting.




Next Steps

Learn how to extract some data

Step by step guide to extract data from documents using Data Wizard.


Receipt photo by: kaboompics.com