url-to-content

Documentation / extractor/url-to-content/url-to-content

Extract

extractContent()

function extractContent(urlOrDoc: string | Document, options?: object): object;

Defined in: packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:88

🚜📜 Tractor the Text Extractor

Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
PDF to HTML: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
Content Validation: Verify the extracted content's accuracy by comparing results from multiple extraction methods, ensuring all essential elements are preserved and properly formatted for the intended use case.

Parameters

Parameter	Type	Description
`urlOrDoc`	`string` \| `Document`	url or dom object with article content
`options?`	{ `images`: `boolean`; `links`: `boolean`; `formatting`: `boolean`; `absoluteURLs`: `boolean`; `timeout`: `number`; }
`options.images?`	`boolean`	default=true - include images
`options.links?`	`boolean`	default=true - include links
`options.formatting?`	`boolean`	default=true - preserve formatting
`options.absoluteURLs?`	`boolean`	default=true - convert URLs to absolute
`options.timeout?`	`number`	default=5 - http request timeout

Returns

object

cite - Cite in APA Format with Author name in Last, First Initial format
url - The URL of the article
html - The HTML content of the article
author - The author of the article
author_cite - Author name in Last, First Middle format
author_short - Author name in Last format
author_type - Author type ["single", "two-author", "more-than-two", "organization"]
date - The publication date of the article
title - The title of the article
source - The source or origin of the article
word_count - The word count of the full text (without HTML tags)

Name	Type	Defined in
`title`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:65
`author_cite`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:66
`cite`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:67
`author`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:68
`date`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:69
`source`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:70
`html`	`string`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:71
`word_count`	`number`	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:72

Author

ai-research-agent (2024)

Other

Article

Defined in: packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:9

Properties

Property	Type	Description	Defined in
`cite`	`string`	Cite in APA Format with Author name in Last, First Initial format	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:10
`html`	`string`	The Basic HTML content of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:11
`url`	`string`	The URL of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:12
`author`	`string`	The full name of the author of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:13
`author_cite`	`string`	Author name in Last, First Initial format	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:14
`author_short`	`string`	Author name in Last format	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:15
`author_type`	`number`	Author type ["single", "two-author", "more-than-two", "organization"]	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:16
`date`	`string`	The publication date of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:17
`title`	`string`	The title of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:18
`source`	`string`	The source or publisher of the article	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:19
`word_count`	`number`	The word count of the full text (without HTML tags)	packages/ai-research-agent/src/extractor/url-to-content/url-to-content.js:20

Extract​

extractContent()​

🚜📜 Tractor the Text Extractor​

Parameters​

Returns​

Author​

Other​

Article​

Properties​

Extract

extractContent()

🚜📜 Tractor the Text Extractor

Parameters

Returns

Author

Other

Article

Properties