ai-research-agent

Javascript Docs (airesearch.js.org)

Live Demo (qwksearch.com)

Join Discord GitHub Discussions

NPM Downloads Build Status PRs Welcome

NPM badge for ai-research-agent
Being is Becoming
Whatever Research Can Be,
That is What It Must Become.
If AI is Humanity's Last Invention,
Then Vector Space is the Final Frontier.

searchSTREAM Docs

  1. Search Web for query via metasearch of major engines or your custom data
  2. Extract text of top results with Tractor the Text Extractor.
  3. SEEKTOPIC: Extract Keyphrase Topics and Top Sentences that centralize those topics
  4. Rerank documents's chunks based on relevance to query, using embeddings by convert text to concept vector of numbers within LLM "concept space", and cosine similarity of query to topic, returning the sentences central to key relevant parts of the article.
  5. Research Agent prompt with key sentences from relevant sources to answer via Groq Llama, OpenAI, or Anthropic API Key and suggest follow-ups

extractSEEKTOPIC Docs

SEEKTOPIC Sample Output

SEEKTOPIC can be used to find unique, domain-specific keyphrases using noun Ngrams. The user can click on keyphrases or LLM can suggest questions based on them. The user can see highlighted just the most important sentences that centralize and tie in the core topics. It is possible to vectorize and compare the dot product similarity of query to keyphrases which are then mapped to parts of the document like section labels. This is more in line with how humans think of article organization into section headings and lead sentences which tie in concepts from others.

SEEKTOPIC extracts unique, domain-specific key phrases from a document using noun n-grams and ranks sentences based on their centrality to the most frequently referenced key phrase concepts, enabling efficient extraction of domain-specific content and provides a flexible framework for summarization.

  1. Sentence Segmentation: Split the text into sentences, accounting for common abbreviations, numbers, URLs, and other exceptions.
  2. Tokenization and Phrase Extraction: Employ a Wiki Phrases tokenizer to identify wiki topics, phrases, and nouns. This includes spell-checking and root word verification using Porter Stemmer.
  3. Noun N-gram Extraction: Generate noun edge-grams, allowing for stop words in the middle (e.g., "state of the art").
  4. Key Phrase Consolidation: Merge smaller n-grams that are subsets of larger ones by comparing weights.
  5. Domain Specificity Calculation: Determine named entities and phrase domain specificity using WikiIDF. This rewards unique key phrases specific to the document's field (e.g., "endocrinology" in medical texts or "thou shall" in religious texts).
  6. Key Phrase Filtering: Select top key phrases based on a combination of frequency and word count.
  7. Graph Construction: Create a double-ring weighted graph with key phrases in the central ring and sentences in the outer ring. Assign weights to links based on concept usage probability.
  8. Sentence Weighting: Apply TextRank algorithm to weight sentences, identifying those that centralize and connect key phrase concepts most referenced by other sentences. This process, based on TextRank and PageRank, includes random surfing and jumping to avoid loops.
  9. Top Results Selection: Select top sentences and key phrases based on overall weight and graph centrality, using either a fixed number or percentage for larger documents.
  10. Output Generation: Return top sentences (with associated key phrases) and top key phrases (with associated sentences).
  11. Dynamic Reranking: If a user interacts with a key phrase or if there's a search query leading to the document, compare query similarity to key phrases, heavily weight the most similar key phrase, and reapply TextRank from step 8.

extract Docs

  1. Extract URL or HTML to main content, improved version combining Mozilla Readability and Postlight Mercury using 100+ custom adapters for major websites.
  2. Strips to basic HTML for reading mode or saving research notes
  3. Youtube - get full transcript for video if detected a youtube video
  4. PDF - Extracts formatted text from PDF with parsing of linebreaks, page headers, footnotes, and adding infer headings based on standard deviation from average text height
  5. Cite - Extract author, date, source, and title from HTML using meta tags and common class names. Validates human name from author string to check against common list of 90k first names, last names, and organizations to infer if last name should be reversed starting by author last name (accounting for affixes/titles), since organizations are not reversed.

compileTopicModel Docs

Search and outline a research base using Wikipedia's 100k popular pages as the core topic phrases graph for LLM Research Agents. Most of the documents online (and by extension thinking in the collective conciousness) can revolve around core topic phrases linked as a graph. If all the available docs are nodes, the links in the graph can be extracted Wiki page entities and mappings of dictionary phrases to their wiki page. These can serve as topic labels, keywords, and suggestions for LLM followup questions. Documents can be linked in a graph with: 1. wiki page entity recognition 2. frequent keyphrases 3. html links 4. research paper references 5. keyphrases to query in global web search 6. site-specific recommendations. These can lay the foundation for LLM Research Agents to fully grok, summarize, and outline a research base.

  • 240K total words & phrases, first 117K first-word or single words to check every token against. 100K Wikipedia Page Titles and links - Wikipedia most popular pages titles. Also includes domain specificity score and what letters should be capital.
  • 84K words and 67K phrases in dictionary lexicon OpenEnglishWordNet, a better updated version of Wordnet - multiple definitions per term, 120k definitions, 45 concept categories
  • JSON Prefix Trie - arranged by sorting words and phrases for lookup by first word to tokenize by word, then find if it starts a phrase based on entries, for Phrase Extraction from a text. There is "unanimous consensus" that Prefix Trie O(1) lookups (instead of thaving to loop through the index for each lookup) makes it the best data type for this task.

weighRelevanceTermFrequency Docs

Calculate term specificity for a single doc with BM25 formula by using Wikipedia term frequencies as the baseline Inverse Frequency across Documents. WikiBM25 solves the need to pass in all docs to compute against all documents in a database. The problem with BM25 and TF-IDF is that a large set of documents is needed to find the words that are repeated often across all. These overused words are often the same list of words, so using Wikipedia's term frequencies ensures a common sense baseline against a neutral corpus.

Use this list to Replace or Combine with All Documents IDF - Many websites may have less than a hundred pages to search through and that is not enough to find which terms are domain-specific. They can score a single doc at a time to find the weight each word in query gets. Wikipedia IDf can be a baseline IDF to average with the All Docs IDF for uniqueness across the average public and the specific domain.

Example: Given a query "Superbowl wins by year" we do not want to simply return docs filled with common words like year, but rather recognize Superbowl is more domains-specific. This requires precomputing IDF values across all docs, and for websites that may not have that many docs to start with may consider averaging their precomputed score with wikiIDF values to ensure most unique words get a score.

LLM RAG Chunk to Query Similarity - When we chunk a document into parts to find which to pass into a LLM prompt, they need to be weighted to relevance to the query. Semantic Embedding with a LLM not only takes resources to compute & store the vectors, it also performs worse than BM25 on its own. Hybrid BM25 & Embeddings RAG is best, but there may not be time to compute BM25 idf scores across all doc chunks. We need a fast way to distinguish more unique words to give them more weight rather than common short words that get repeated a lot in an edge case paragraph. WikiBM25 is the best in use cases like realtime web search where chunking the text cannot be done beforehand.

$$\text{score}(D,Q) = \sum_{i=1}^{N} \text{Wiki-IDF}(q_i) \times \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

WikiIDF Term Specificity Statistics Metadata

  • Phrase Starter Words and Single Words: 104556

  • Total Terms: 204169

  • Dict single words: 84493

  • Dict phrases: 67444

All words in English Wikipedia are sorted by number of pages they are in for 325K words with frequencies of at least 32 wikipages, between 3 to 23 characters of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï, but filtering out numbers and foreign language.

suggestNextWordCompletions Docs

Search-on-keystroke and load this JSON index for word and phrase completion, sorted by how common the terms are with IDF, for search autocomplete dropdown. Tokening by word can often have a meaning widely different than if it is part of a phrase, so it is better to extract phrases by first-word next-words pairings. Search results will be more accurate if we infer likely phrases and search for those words occuring together and not just split into words and find frequency. Examples are "white house" or "state of the art" which should be searched as a phrase but would return different context if split into words. As Led Zeppelin famously put it: ♫ "'Cause you know sometimes words have two meanings."

PRs Welcome