metadata.json
: Metadata about the source file.contents.json
: A JSON file holding the structured content of the document, broken down into Slices.source.<extension>
: The original uploaded file.[marked.pdf]
(for PDFs): A marked version of the PDF with embedded images highlighted.images/
: Directory containing embedded images extracted from the document.images_marked/
: Directory containing marked versions of embedded images.pages/
: Directory containing screenshots of each page of the document.pages_marked/
: Directory containing marked screenshots of each page with highlighted embedded images.contents.json
Structurecontents.json
file is crucial. It structures the document content into Slices. A Slice represents a logical part of the document, such as:
text
: Textual content, like paragraphs.image
: Embedded images.page-image
: Screenshots of entire pages.page-image-marked
: Marked page screenshots with highlighted embedded images.contents.json
includes:
page
: The page number the slice belongs to.type
: The slice type (text
, image
, etc.).text
(for text slices): The text content.path
(for image slices): Path to the image file within the Artifact directory.mimetype
(for image slices): Image MIME type.x, y, width, height
(for image slices): Image dimensions and position on the page (if available).contents.json
structure