Tax Forms to JSON
Convert data from paper-based tax forms into structured JSON for efficient tax processing, data entry, and compliance management.
If you want to try this extractor, you can download it here:
Download Tax Form Extractor
tax-form-extractor.json
Document Input Examples
Here are some examples of the types of documents that this extractor can process.

Tax Forms (PDF)
Official tax forms in PDF format can be processed directly, extracting financial data and taxpayer information.
Example Output
This extractor will produce JSON data that looks like the following:
Extractor
Here’s a template for a tax form extractor, optimized for extracting financial and identification information from tax documents.
JSON Schema
Here’s a JSON schema tailored for tax forms, focusing on common tax-related fields such as taxpayer identification, income, deductions, and tax liabilities.
Extraction Strategy
Strategy: simple
Given the structured nature of most tax forms and the importance of extracting precise data, the simple
strategy is generally appropriate and efficient.
Tax forms belong to the kind of data that requires further data validation, like calculating totals or verifying tax rates. For these use cases, it can make sense to implement your own custom strategy that includes these additional validation steps.
LLM Recommendation: google/gemini-2.0-flash
google/gemini-2.0-flash
(or lite
) is again recommended due to its vision capabilities for handling scanned forms and its ability to process documents within a single context window, which is often sufficient for individual tax forms.
Context Settings
Chunk Size: 50k-75k
Similar to feedback forms, a chunk size of 50.000 to 75.000 tokens should be adequate for most tax forms, allowing for complete processing in a single LLM call.
Include Text: true
Crucial for extracting all textual and numerical data from the tax form, including income figures, deductions, and identification numbers.
Include Embedded Images: false
Embedded images are not typically relevant in tax forms and can be excluded to optimize processing.
Include Page Screenshots: true
Page screenshots are highly recommended to provide the LLM with layout context, which is vital for accurate data extraction from tax forms, especially when dealing with scanned documents where field placement and visual cues are important.
Mark Images with IDs: false
Not necessary for standard tax form data extraction unless you are looking to specifically identify and reference elements like signatures or seals, which is less common in typical tax data processing.
Next Steps
Learn how to extract some data
Step by step guide to extract data from documents using Data Wizard.