Products from Brochures
In this example we try to extract a lot of tiny pieces of information from a document, in this case product names and prices. The example is useful for automatically importing product and pricing information from online brochures for competitor analysis or other use cases.
If you want to try this extractor, you can download it here:
Download Product Brochure Extractor
product-brochure-extractor.json
Document Input Examples
Here are some examples of the types of documents that this extractor can process.
PDF Brochures
Supermarket brochures are often available as PDF files online. These brochures contain product information and pricing data.
Online Advertisements
If you take screenshots of competitor advertisements, you could use this tool to extract product and pricing information for better analysis.
Example Output
This extractor will produce JSON data that looks like the following:
Extractor
Here’s a template for a product brochure extractor with some explanations for why some settings are chosen.
JSON Schema
Here’s the example JSON schema to extract product data from brochures. It extracts a list of products, each with a name, original price, and discounted price. The schema also contains validation rules for each field to ensure the extracted data is accurate and consistent.
Extraction Strategy
Strategy: parallel-auto-merge
For supermarket brochures with many independent products, the parallel-auto-merge
strategy is the most effective. It provides fast extraction times and ensures all products are extracted, even from large brochures.
LLM Recommendation: google/gemini-2.0-flash-lite
For product extraction from brochures, google/gemini-2.0-flash-lite
is a good and cost-effective choice. It provides a good balance between performance and cost, and is well-suited for extracting data from structured documents like brochures.
Context Settings
Chunk Size: 15k-50k
A chunk size of around 15.000-50.000 tokens is sufficient for product brochures, as each chunk typically contains multiple independent product listings. We don’t want to use a chunk size that is too large, as it may cause the extractor to miss some products.
Include Text: true
Including text is essential for extracting product names, descriptions, and prices.
Include Embedded Images: false
Embedded images are not necessary for extracting product data from brochures and can be excluded to reduce token usage and processing time.
If we wanted to, we could enable this to allow the LLM to assign product images for ERP systems or online shops.
Include Page Screenshots: true
Page screenshots are helpful if the input data is not OCR’d or if the layout of the brochure is important to determine the prices.
Mark Images with IDs: false
Since we are not associating images with specific product data in this example, marking images with IDs is not needed.
Next Steps
Learn how to extract some data
Step by step guide to extract data from documents using Data Wizard.