Skip to main content

pdf-to-html

ai-research-agent / extractor/pdf-to-html/pdf-to-html

Extract

convertPDFToHTML()

function convertPDFToHTML(pdfURLOrBuffer, options?): string | Object

Convert PDF to HTML

Extracts formatted text from PDF with parsing of linebreaks , page headers, footnotes, and section headings. Supports fonts, links, bold, italics, lists, headings, headers, footnotes, and Table of Contents, Quotes, and Code Blocks, . Removes repeated headers, links footnote anchors to the footnote, and preserves number of the PDF page with invisible I element.

This function uses pdfjs-serverless to work in more environments than PDF.js-based tools: Cloudflare workers, serverless, node.js, and front-end only.

Parameters

ParameterTypeDescription

pdfURLOrBuffer

string

URL to a PDF file or buffer from fs.readFile

options?

{ addPageNumbers: boolean; removePageHeaders: boolean; }

options.addPageNumbers?

boolean

default=false - Adds # to end of each page

options.removePageHeaders?

boolean

default=true - Removes repeated headers found on each page

Returns

string | Object

HTML formatted text

Author

ai-research-agent (2024), pdf-to-markdown (2017), pdf.js (2012-),