> ## Documentation Index
> Fetch the complete documentation index at: https://docs.data-wizard.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Invoice Data from Scans

> This example demonstrates how to extract structured data from scanned invoices. This is useful if your system accepts invoice data and your source files are pictures / scans or PDF files.

If you want to try this extractor, you can download it here:

<Card icon="download" title="Download Invoice Extractor" href="#" horizontal>
  `invoice-extractor.json`
</Card>

<Tip>
  [How do I import extractors?](/docs/import-export)
</Tip>

## Document Input Examples

Here are some examples of the types of documents that this extractor can process.

<CardGroup cols={2}>
  <Card title="PDF Invoices" icon="file-pdf" img="../images/examples/invoice.png">
    A lot of invoices are delivered as PDF files. These can be generated by accounting software or scanned from paper invoices.
  </Card>

  <Card title="Receipt Photos" icon="camera" img="../images/examples/receipt.jpg">
    Some invoices are only available as photos of receipts. These can be taken with a smartphone or a camera.
  </Card>

  <Card title="Scanned Paper Invoices" icon="print">
    A lot of offices still receive paper invoices. These can be scanned and then processed by Data Wizard.
  </Card>
</CardGroup>

## Example Output

This extractor will produce JSON data that looks like the following:

<CodeGroup>
  ```json Simple Invoice theme={null}
  {
      "invoiceNumber": "INV-2022-001",
      "issueDate": "2022-01-01",
      "currency": "EUR",
      "seller": {
          "name": "ACME Inc.",
          "address": "123 Main St.",
          "postalCode": "12345",
          "city": "Springfield",
          "country": "US",
          "vatNumber": "US123456789"
      },
      "buyer": {
          "customerNumber": "CUST-123",
          "name": "Buyer Corp.",
          "address": "456 Elm St.",
          "postalCode": "54321",
          "city": "Shelbyville",
          "country": "US"
      },
      "lineItems": [
          {
              "position": 1,
              "description": "Product A",
              "unitPrice": 100.0,
              "quantity": 2,
              "vatRate": 19.0,
              "netAmount": 200.0
          },
          {
              "position": 2,
              "description": "Product B",
              "unitPrice": 50.0,
              "quantity": 3,
              "vatRate": 19.0,
              "netAmount": 150.0
          }
      ],
      "totalAmounts": {
          "netTotal": 350.0,
          "taxTotal": 66.5,
          "grossTotal": 416.5,
          "dueTotal": 416.5
      },
      "paymentDetails": {
          "paymentTerms": "Net 30 days",
          "paymentMethod": "SEPA_TRANSFER",
          "iban": "DE89370400440532013000"
      }
  }
  ```
</CodeGroup>

## Extractor

Here's a template for an invoice extractor with some explanations for why some settings are chosen.

### JSON Schema

Here's the example JSON schema to extract invoice data. It includes information about the invoice, seller, buyer, line items, total amounts, and payment details.
The schema also contains validation rules for each field to ensure the extracted data is accurate and consistent.

<Accordion title="Show Invoice Schema">
  ```json theme={null}
  {
    "title": "Invoice",
    "description": "Schema for an invoice compliant with ZUGFeRD 2.0 / EN16931.",
    "type": "object",
    "properties": {
      "invoiceNumber": {
        "type": "string",
        "description": "Unique invoice identifier",
        "pattern": "^[A-Za-z0-9\\-/]+$"
      },
      "issueDate": {
        "type": "string",
        "format": "date",
        "description": "Date the invoice was issued"
      },
      "currency": {
        "type": "string",
        "description": "Currency code (ISO 4217)",
        "enum": [
          "EUR",
          "USD",
          "GBP",
          "CHF"
        ]
      },
      "seller": {
        "type": "object",
        "description": "Information about the seller",
        "properties": {
          "name": {
            "type": "string",
            "description": "Seller company name"
          },
          "address": {
            "type": "string",
            "description": "Street and number"
          },
          "postalCode": {
            "type": "string",
            "description": "Postal code"
          },
          "city": {
            "type": "string",
            "description": "City name"
          },
          "country": {
            "type": "string",
            "description": "Country code (ISO 3166-1 alpha-2)"
          },
          "vatNumber": {
            "type": "string",
            "description": "VAT identification number"
          }
        },
        "required": [
          "name",
          "address",
          "postalCode",
          "city",
          "country",
          "vatNumber"
        ]
      },
      "buyer": {
        "type": "object",
        "description": "Information about the buyer",
        "properties": {
          "customerNumber": {
            "type": "string",
            "description": "Buyer reference number"
          },
          "name": {
            "type": "string",
            "description": "Buyer company name"
          },
          "address": {
            "type": "string",
            "description": "Street and number"
          },
          "postalCode": {
            "type": "string",
            "description": "Postal code"
          },
          "city": {
            "type": "string",
            "description": "City name"
          },
          "country": {
            "type": "string",
            "description": "Country code (ISO 3166-1 alpha-2)"
          }
        },
        "required": [
          "name",
          "address",
          "postalCode",
          "city",
          "country"
        ]
      },
      "lineItems": {
        "type": "array",
        "description": "List of invoice items",
        "items": {
          "type": "object",
          "properties": {
            "position": {
              "type": "integer",
              "description": "Line item position"
            },
            "description": {
              "type": "string",
              "description": "Item description"
            },
            "unitPrice": {
              "type": "number",
              "description": "Net price per unit",
              "minimum": 0,
              "multipleOf": 0.0001
            },
            "quantity": {
              "type": "number",
              "description": "Quantity of items"
            },
            "vatRate": {
              "type": "number",
              "description": "VAT percentage",
              "minimum": 0,
              "maximum": 100
            },
            "netAmount": {
              "type": "number",
              "description": "Total price before tax"
            }
          },
          "required": [
            "position",
            "unitPrice",
            "quantity",
            "vatRate",
            "netAmount"
          ]
        }
      },
      "totalAmounts": {
        "type": "object",
        "description": "Invoice total amounts",
        "properties": {
          "netTotal": {
            "type": "number",
            "description": "Total amount before tax"
          },
          "taxTotal": {
            "type": "number",
            "description": "Total tax amount"
          },
          "grossTotal": {
            "type": "number",
            "description": "Total amount including tax"
          },
          "dueTotal": {
            "type": "number",
            "description": "Total amount due"
          }
        },
        "required": [
          "netTotal",
          "taxTotal",
          "grossTotal",
          "dueTotal"
        ]
      },
      "paymentDetails": {
        "type": "object",
        "description": "Payment information",
        "properties": {
          "paymentTerms": {
            "type": "string",
            "description": "Payment terms"
          },
          "paymentMethod": {
            "type": "string",
            "description": "Payment method",
            "enum": [
              "SEPA_TRANSFER",
              "CREDIT_CARD",
              "PAYPAL"
            ]
          },
          "iban": {
            "type": "string",
            "description": "IBAN for bank transfer"
          }
        },
        "required": [
          "paymentTerms",
          "paymentMethod"
        ]
      }
    },
    "required": [
      "invoiceNumber",
      "issueDate",
      "currency",
      "seller",
      "buyer",
      "lineItems",
      "totalAmounts"
    ]
  }
  ```
</Accordion>

### Extraction Strategy

#### Strategy: `simple`

You can use any of the strategies for invoice data.
It's probably a good idea to get started with the `simple` strategy using a model with a large context window, for example `google/gemini-2.0-flash`.

<Warning>
  Tax forms belong to the kind of data that requires further data validation, like calculating totals or verifying tax rates.
  For these use cases, it can make sense to implement your own custom strategy that includes these additional validation steps.

  [Find out how to create custom extraction strategies](../custom-strategies)
</Warning>

#### LLM Recommendation: `google/gemini-2.0-flash`

This model has a large context window and is relatively cheap to use. It also has vision support, which is useful for scanned documents.
You can also use it's lighter variant `google/gemini-2.0-flash-lite` if you want to save some costs.

### Context Settings

#### Chunk Size: `75k`

A chunk size of around 75.000 tokens should be enough to cover most invoices.

#### Include Text: `true`

Obviously, we'd want to include any text contents that are included in the invoice.

#### Include Embedded Images: `false`

We don't need to include any of the images that are embedded in the files, as they are not relevant.

#### Include Page Screenshots: `true`

On the other hand, it makes sense to include page screenshots, as they give the LLM some more context about the location of certain text blocks.
It also allows the LLM to parse invoices that don't have a text layer, like scanned documents or photos.

#### Mark Images with IDs: `false`

Since we're not interested in assigning any images to data entities, we don't need this setting.

<br />

<br />

<br />

**Next Steps**

<Card title="Learn how to extract some data" icon="list-ol" href="./extracting-data">
  Step by step guide to extract data from documents using Data Wizard.
</Card>

<CardGroup>
  <Card title="Extractors" icon="laptop-code" href="./extractors">
    Learn how to define and configure data extraction tasks.
  </Card>

  <Card title="Strategies" icon="code-branch" href="./strategies">
    Understand different data processing strategies.
  </Card>

  <Card title="LLM Provider Configuration" icon="sliders" href="./configure-llm">
    Set up your Large Language Model API keys.
  </Card>

  <Card title="Integration" icon="code" href="./integrate">
    Embed Data Wizard into other applications using iFrames or APIs.
  </Card>
</CardGroup>

<br />

> Receipt photo by: [kaboompics.com](https://www.pexels.com/photo/close-up-of-woman-hands-holding-bill-4959926/)
