Function extractContent

extractContent(urlOrDoc, options?): {
    title: string;
    author_cite: string;
    author_short: string;
    author: string;
    date: string;
    source: string;
    html: string;
    word_count: number;
}
🚜📜 Tractor the Text Extractor
1. Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
2. Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
3. YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
4. PDF Text Extraction and Structure: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
5. Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
6. Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
7. Content Validation: Verify the extracted content's accuracy by comparing results from multiple extraction methods, ensuring all essential elements are preserved and properly formatted for the intended use case.
Parameters
- urlOrDoc: string | Document
  url or dom object with article content
- Optionaloptions: {
      images: boolean;
      links: boolean;
      formatting: boolean;
      absoluteURLs: boolean;
      timeout: number;
  } = {}
  - images: boolean
    default=true - include images
  - links: boolean
    default=true - include links
  - formatting: boolean
    default=true - preserve formatting
  - absoluteURLs: boolean
    default=true - convert URLs to absolute
  - timeout: number
    http request timeout
Returns {
    title: string;
    author_cite: string;
    author_short: string;
    author: string;
    date: string;
    source: string;
    html: string;
    word_count: number;
}
- url - The URL of the article
- html - The HTML content of the article
- author - The author of the article
- author_cite - Author name in Last, First Middle format
- author_short - Author name in Last format
- author_type - Author type ["single", "two-author", "more-than-two", "organization"]
- date - The publication date of the article
- title - The title of the article
- source - The source or origin of the article
- word_count - The word count of the full text (without HTML tags)
- title: string
- author_cite: string
- author_short: string
- author: string
- date: string
- source: string
- html: string
- word_count: number
Author
ai-research-agent (2024)
- Defined in extractor/url-to-content/url-to-content.js:84

Function extractContent

🚜📜 Tractor the Text Extractor

Parameters

images: boolean

links: boolean

formatting: boolean

absoluteURLs: boolean

timeout: number

Returns {
    title: string;
    author_cite: string;
    author_short: string;
    author: string;
    date: string;
    source: string;
    html: string;
    word_count: number;
}

title: string

author_cite: string

author_short: string

author: string

date: string

source: string

html: string

word_count: number

Author

Settings

On This Page

Function extractContent

🚜📜 Tractor the Text Extractor

Parameters

images: boolean

links: boolean

formatting: boolean

absoluteURLs: boolean

timeout: number

Returns { title: string; author_cite: string; author_short: string; author: string; date: string; source: string; html: string; word_count: number; }

title: string

author_cite: string

author_short: string

author: string

date: string

source: string

html: string

word_count: number

Author

Settings

On This Page

Returns {
title: string;
author_cite: string;
author_short: string;
author: string;
date: string;
source: string;
html: string;
word_count: number;
}