Extract structured content and cite from any URL

🚜📜 Tractor the Text Extractor
- Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
- Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
- YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
- PDF to HTML: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
- Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
- Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
Query Parameters
urlstring
URL to extract content from (supports articles, PDFs, YouTube)
Format
uriimages?boolean
Include images in output (default true)
Default
truelinks?boolean
Include hyperlinks in output (default true)
Default
trueformatting?boolean
Preserve text formatting (default true)
Default
trueabsoluteURLs?boolean
Convert relative URLs to absolute (default true)
Default
truetimeout?integer
HTTP request timeout in seconds (default 5)
Default
5Range
1 <= value <= 30Response Body
application/json
application/json
curl -X GET "https://qwksearch.com/api/extract?url=http%3A%2F%2Fexample.com"{
"title": "string",
"html": "string",
"cite": "string",
"author_cite": "string",
"author_short": "string",
"author_type": "single",
"author": "string",
"date": "2019-08-24",
"source": "string",
"word_count": 0,
"url": "http://example.com"
}{
"error": "string"
}