Function convertPDFToHTML

  • Extracts formatted text from PDF with parsing of linebreaks , page headers, footnotes, and infering section headings based on standard deviation of range from average text height.

    Parameters

    • pdfURLOrBuffer: string

      URL to a PDF file or buffer from fs.readFile

    • Optionaloptions: {
          addHeadingsTags: boolean;
          addPageNumbers: boolean;
          addSentenceLineBreaks: boolean;
          removePageHeaders: boolean;
          moveFootnotes: boolean;
          timeout: boolean;
      } = {}
      • addHeadingsTags: boolean

        default=true - Adds H1 tags to heading titles in document

      • addPageNumbers: boolean

        default=true - Adds # to end of each page

      • addSentenceLineBreaks: boolean

        default=true - Inserts line breaks at the end of sentence ranges

      • removePageHeaders: boolean

        default=true - Removes repeated headers found on each page

      • moveFootnotes: boolean

        default=false - Moves footnotes to end of document

      • timeout: boolean

        default=10 - http request timeout

    Returns any

    HTML formatted text or {error} if error in parsing