seektopic-keyphrases
ai-research-agent / topics/seektopic-keyphrases
Topics
extractSEEKTOPIC()
function extractSEEKTOPIC(docText, options?): object
🔤📊 SEEKTOPIC: Summarization by Extracting Entities, Keyword Tokens, and Outline Phrases Important to Context
Extracts unique, domain-specific key phrases from a document using noun n-grams and ranks sentences based on their centrality to the most frequently referenced key phrase concepts, enabling efficient extraction of domain-specific content. This can be a first step to use key sentences or topics to vectorize or fit more docs into context limit and visualize them in vector space.
- Sentence Segmentation: Split the text into sentences, accounting for common abbreviations, numbers, URLs, and other exceptions.
- Tokenization and Phrase Extraction: Employ a Wiki Phrases tokenizer to identify wiki topics, phrases, and nouns. This includes spell-checking and root word verification using Porter Stemmer.
- Noun N-gram Extraction: Generate noun edge-grams, allowing for stop words in the middle (e.g., "state of the art").
- Key Phrase Consolidation: Merge smaller n-grams that are subsets of larger ones by comparing weights.
- Domain Specificity Calculation: Determine named entities and phrase domain specificity using WikiIDF. This rewards unique key phrases specific to the document's field (e.g., "endocrinology" in medical texts or "thou shall" in religious texts).
- Key Phrase Filtering: Select top key phrases based on a combination of frequency and word count.
- Graph Construction: Create a double-ring weighted graph with key phrases in the central ring and sentences in the outer ring. Assign weights to links based on concept usage probability.
- Sentence Weighting: Apply TextRank algorithm to weight sentences, identifying those that centralize and connect key phrase concepts most referenced by other sentences. This process, based on TextRank and PageRank, includes random surfing and jumping to avoid loops.
- Top Results Selection: Select top sentences and key phrases based on overall weight and graph centrality, using either a fixed number or percentage for larger documents.
- Output Generation: Return top sentences (with associated key phrases) and top key phrases (with associated sentences).
- Dynamic Reranking: If a user interacts with a key phrase or if there's a search query leading to the document, compare query similarity to key phrases, heavily weight the most similar key phrase, and reapply TextRank from step 8.
Parameters
Parameter | Type | Description |
---|---|---|
|
| input text to analyze |
| { | |
|
| query to give heavy weight to |
|
| default=10 - maximum number of top keyphrases to return |
|
| default=5 - maximum number of top sentences to return |
|
| default=5 - maximum words in a keyphrase |
|
| default=6 - minimum length of a keyphrase |
|
| default=3 - minimum length of a word |
|
| default=1 - minimum words in a keyphrase |
|
| phrases model |
|
| default=0.2 - percentage of top keyphrases to consider |
Returns
object
Name | Type |
---|---|
keyphrases | Object [] |
sentences | string [] |
topSentences | Object [] |
Example
const result = extractSEEKTOPIC(testDoc, { phrasesModel, heavyWeightQuery: "self attention", limitTopSentences: 10});
console.log(result.topSentences); // Array of top sentences with their keyphrases and weights
console.log(result.keyphrases); // Array of top keyphrases with their weights and associated sentence indices
console.log(result.sentences); // Array of all sentences in the input text