pdf-to-html

Documentation / extractor/pdf-to-html/pdf-to-html

Extract

convertPDFToHTML()

function convertPDFToHTML(pdfURLOrBuffer: string, options?: object): any;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/pdf-to-html.js:46

Convert PDF to HTML

Extracts formatted text from PDF with parsing of linebreaks , page headers, footnotes, and section headings. Supports fonts, links, bold, italics, lists, headings, headers, footnotes, and Table of Contents, Quotes, and Code Blocks, . Removes repeated headers, links footnote anchors to the footnote, and preserves number of the PDF page with invisible I element.

This function uses pdfjs-serverless to work in more environments than PDF.js-based tools: Cloudflare workers, serverless, node.js, and front-end only.

Parameters

Parameter	Type	Description
`pdfURLOrBuffer`	`string`	URL to a PDF file or buffer from fs.readFile
`options?`	{ `addPageNumbers`: `boolean`; `removePageHeaders`: `boolean`; }
`options.addPageNumbers?`	`boolean`	default=false - Adds # to end of each page
`options.removePageHeaders?`	`boolean`	default=true - Removes repeated headers found on each page

Returns

any

HTML formatted text

Author

ai-research-agent (2024), pdf-to-markdown (2017), pdf.js (2012-),

Extract​

convertPDFToHTML()​

Convert PDF to HTML​

Parameters​

Returns​

Author​