Learn about the built-in and custom extraction strategies available in Data Wizard.
Strategies determine how Data Wizard processes the documents and interacts with the LLM. Data Wizard provides multiple built-in strategies, and you can also create custom strategies for specific needs.
Simple: Sends as much of the document as possible within the token limit to the LLM in a single call. Suitable for small documents.
Sequential: Splits the document into smaller parts (based on the chunk size), processes each part sequentially, and includes the results of the previous extraction in the prompt for the next part. Maintains contextual continuity.
Parallel: Splits the document into independent parts and processes each part in isolation. Suitable for multiple independent data points. Good for extracting data that aren’t interconnected across pages.
Auto-Merging: Is the same as the sequential and parallel strategies, but additionally includes functionality that removes duplicate items by concatenating the items of the top-level properties and finally runs a final LLM call at the end to deduplicate the final results. This helps to make the models forget fewer entities if they have to make multiple calls.
Double-Pass: Processes the document twice. On the first pass, it uses the parallel strategy, and on the second pass, it reviews and refines the first pass with the sequential strategy, taking both benefits for increased accuracy and efficiency. This one also supports auto-merging.
Simple
Sends as much of the document as possible within the token limit to the LLM in a single call. Suitable for small documents.
Sequential
Splits the document into smaller parts (based on the chunk size), processes each part sequentially, and includes the results of the previous extraction in the prompt for the next part. Maintains contextual continuity.
Parallel
Splits the document into independent parts and processes each part in isolation. Suitable for multiple independent data points. Is good for extracting data that aren’t interconnected across pages.
Sequential with Auto-Merging
Splits the document into smaller parts (based on the chunk size), processes each part sequentially, and includes the results of the previous extraction in the prompt for the next part. Maintains contextual continuity. This one also supports auto merging.
Parallel with Auto-Merging
Splits the document into independent parts and processes each part in isolation. Suitable for multiple independent data points. Is good for extracting data that aren’t interconnected across pages. This one also supports auto merging.
Double-Pass
Processes the document twice. On the first pass, it uses the parallel strategy, and on the second pass, it reviews
and refines the first pass with the sequential strategy, taking both benefits for increased accuracy and efficiency. This one also supports auto merging.