Extract structured content and cite from any URL

Extractor

🚜📜 Tractor the Text Extractor

  1. Main Content Detection: Extract the main content from a URL by combining Mozilla Readability and Postlight Mercury algorithms, utilizing over 100 custom adapters for major sites for article, author, date HTML classes.
  2. Basic HTML Standardization: Transform complex HTML into a simplified reading-mode format of basic HTML, making it ideal for research note archival and focused reading, with headings, images and links.
  3. YouTube Transcript Processing: When a YouTube video URL is detected, retrieve the complete video transcript including both manual captions and auto-generated subtitles, maintaining proper timestamp synchronization and speaker identification where available.
  4. PDF to HTML: Process PDF documents by extracting formatted text while intelligently handling line breaks, page headers, footnotes. The system analyzes text height statistics to automatically infer heading levels, creating a properly structured document hierarchy based on standard deviation from mean text size.
  5. Citation Information Extraction: Identify and extract citation metadata including author names, publication dates, sources, and titles using HTML meta tags and common class name patterns. The system validates author names against a comprehensive database of 90,000 first and last names, distinguishing between personal and organizational authors to properly format citations.
  6. Author Name Formatting: Process author names by checking against known name databases, handling affixes and titles correctly, and determining whether to reverse the name order based on whether it's a personal or organizational author, ensuring proper citation formatting.
GET
/extract

Query Parameters

urlstring

URL to extract content from (supports articles, PDFs, YouTube)

Formaturi
images?boolean

Include images in output (default true)

Defaulttrue
links?boolean

Include hyperlinks in output (default true)

Defaulttrue
formatting?boolean

Preserve text formatting (default true)

Defaulttrue
absoluteURLs?boolean

Convert relative URLs to absolute (default true)

Defaulttrue
timeout?integer

HTTP request timeout in seconds (default 5)

Default5
Range1 <= value <= 30

Response Body

application/json

application/json

curl -X GET "https://qwksearch.com/api/extract?url=http%3A%2F%2Fexample.com"
{
  "title": "string",
  "html": "string",
  "cite": "string",
  "author_cite": "string",
  "author_short": "string",
  "author_type": "single",
  "author": "string",
  "date": "2019-08-24",
  "source": "string",
  "word_count": 0,
  "url": "http://example.com"
}
{
  "error": "string"
}