Function extractContent

    1. Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
    2. Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
    3. YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
    4. PDF Text Extraction and Structure: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
    5. Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
    6. Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
    7. Content Validation: Verify the extracted content's accuracy by comparing results from multiple extraction methods, ensuring all essential elements are preserved and properly formatted for the intended use case.

    Parameters

    • urlOrDoc: string | Document

      url or dom object with article content

    • Optionaloptions: {
          images: boolean;
          links: boolean;
          formatting: boolean;
          absoluteURLs: boolean;
          timeout: number;
      } = {}
      • images: boolean

        default=true - include images

      • links: boolean

        default=true - include links

      • formatting: boolean

        default=true - preserve formatting

      • absoluteURLs: boolean

        default=true - convert URLs to absolute

      • timeout: number

        http request timeout

    Returns {
        title: string;
        author_cite: string;
        author_short: string;
        author: string;
        date: string;
        source: string;
        html: string;
        word_count: number;
    }

    • url - The URL of the article
    • html - The HTML content of the article
    • author - The author of the article
    • author_cite - Author name in Last, First Middle format
    • author_short - Author name in Last format
    • author_type - Author type ["single", "two-author", "more-than-two", "organization"]
    • date - The publication date of the article
    • title - The title of the article
    • source - The source or origin of the article
    • word_count - The word count of the full text (without HTML tags)
    • title: string
    • author_cite: string
    • author_short: string
    • author: string
    • date: string
    • source: string
    • html: string
    • word_count: number