url-to-content
Documentation / extractor/url-to-content/url-to-content
Extract​
extractContent()​
function extractContent(urlOrDoc: string | Document, options?: object): object;
Defined in: extractor/url-to-content/url-to-content.js:87
🚜📜 Tractor the Text Extractor​

- Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
- Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
- YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
- PDF to HTML: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
- Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
- Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
- Content Validation: Verify the extracted content's accuracy by comparing results from multiple extraction methods, ensuring all essential elements are preserved and properly formatted for the intended use case.
Parameters​
Parameter | Type | Description |
---|---|---|
|
| url or dom object with article content |
| { | |
|
| default=true - convert URLs to absolute |
|
| default=true - preserve formatting |
|
| default=true - include images |
|
| default=true - include links |
|
| http request timeout |
Returns​
object
- cite - Cite in APA Format with Author name in Last, First Initial format
- url - The URL of the article
- html - The HTML content of the article
- author - The author of the article
- author_cite - Author name in Last, First Middle format
- author_short - Author name in Last format
- author_type - Author type ["single", "two-author", "more-than-two", "organization"]
- date - The publication date of the article
- title - The title of the article
- source - The source or origin of the article
- word_count - The word count of the full text (without HTML tags)
Name | Type | Defined in |
---|---|---|
|
| extractor/url-to-content/url-to-content.js:67 |
|
| extractor/url-to-content/url-to-content.js:65 |
|
| extractor/url-to-content/url-to-content.js:66 |
|
| extractor/url-to-content/url-to-content.js:68 |
|
| extractor/url-to-content/url-to-content.js:70 |
|
| extractor/url-to-content/url-to-content.js:69 |
|
| extractor/url-to-content/url-to-content.js:64 |
|
| extractor/url-to-content/url-to-content.js:71 |
Author​
Other​
Article​
Defined in: extractor/url-to-content/url-to-content.js:9
Properties​
Property | Type | Description | Defined in |
---|---|---|---|
| The full name of the author of the article | extractor/url-to-content/url-to-content.js:13 | |
| Author name in Last, First Initial format | extractor/url-to-content/url-to-content.js:14 | |
| Author name in Last format | extractor/url-to-content/url-to-content.js:15 | |
| Author type ["single", "two-author", "more-than-two", "organization"] | extractor/url-to-content/url-to-content.js:16 | |
| Cite in APA Format with Author name in Last, First Initial format | extractor/url-to-content/url-to-content.js:10 | |
| The publication date of the article | extractor/url-to-content/url-to-content.js:17 | |
| The Basic HTML content of the article | extractor/url-to-content/url-to-content.js:11 | |
| The source or publisher of the article | extractor/url-to-content/url-to-content.js:19 | |
| The title of the article | extractor/url-to-content/url-to-content.js:18 | |
| The URL of the article | extractor/url-to-content/url-to-content.js:12 | |
| The word count of the full text (without HTML tags) | extractor/url-to-content/url-to-content.js:20 |