Skip to main content

url-to-content

Documentation / extractor/url-to-content/url-to-content

Extract​

extractContent()​

function extractContent(urlOrDoc: string | Document, options?: object): object;

Defined in: extractor/url-to-content/url-to-content.js:87

🚜📜 Tractor the Text Extractor​

  1. Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
  2. Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
  3. YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
  4. PDF to HTML: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
  5. Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
  6. Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
  7. Content Validation: Verify the extracted content's accuracy by comparing results from multiple extraction methods, ensuring all essential elements are preserved and properly formatted for the intended use case.

Parameters​

ParameterTypeDescription

urlOrDoc

string | Document

url or dom object with article content

options?

{ absoluteURLs: boolean; formatting: boolean; images: boolean; links: boolean; timeout: number; }

options.absoluteURLs?

boolean

default=true - convert URLs to absolute

options.formatting?

boolean

default=true - preserve formatting

options.images?

boolean

default=true - include images

options.links?

boolean

default=true - include links

options.timeout?

number

http request timeout

Returns​

object

  • cite - Cite in APA Format with Author name in Last, First Initial format
  • url - The URL of the article
  • html - The HTML content of the article
  • author - The author of the article
  • author_cite - Author name in Last, First Middle format
  • author_short - Author name in Last format
  • author_type - Author type ["single", "two-author", "more-than-two", "organization"]
  • date - The publication date of the article
  • title - The title of the article
  • source - The source or origin of the article
  • word_count - The word count of the full text (without HTML tags)
NameTypeDefined in

author

string

extractor/url-to-content/url-to-content.js:67

author_cite

string

extractor/url-to-content/url-to-content.js:65

cite

string

extractor/url-to-content/url-to-content.js:66

date

string

extractor/url-to-content/url-to-content.js:68

html

string

extractor/url-to-content/url-to-content.js:70

source

string

extractor/url-to-content/url-to-content.js:69

title

string

extractor/url-to-content/url-to-content.js:64

word_count

number

extractor/url-to-content/url-to-content.js:71

Author​

ai-research-agent (2024)

Other​

Article​

Defined in: extractor/url-to-content/url-to-content.js:9

Properties​

PropertyTypeDescriptionDefined in

author

string

The full name of the author of the article

extractor/url-to-content/url-to-content.js:13

author_cite

string

Author name in Last, First Initial format

extractor/url-to-content/url-to-content.js:14

author_short

string

Author name in Last format

extractor/url-to-content/url-to-content.js:15

author_type

number

Author type ["single", "two-author", "more-than-two", "organization"]

extractor/url-to-content/url-to-content.js:16

cite

string

Cite in APA Format with Author name in Last, First Initial format

extractor/url-to-content/url-to-content.js:10

date

string

The publication date of the article

extractor/url-to-content/url-to-content.js:17

html

string

The Basic HTML content of the article

extractor/url-to-content/url-to-content.js:11

source

string

The source or publisher of the article

extractor/url-to-content/url-to-content.js:19

title

string

The title of the article

extractor/url-to-content/url-to-content.js:18

url

string

The URL of the article

extractor/url-to-content/url-to-content.js:12

word_count

number

The word count of the full text (without HTML tags)

extractor/url-to-content/url-to-content.js:20