Function extractSEEKTOPIC

  • Extracts unique, domain-specific key phrases from a document using noun n-grams and ranks sentences based on their centrality to the most frequently referenced key phrase concepts, enabling efficient extraction of domain-specific content. This can be a first step to use key sentences or topics to vectorize or fit more docs into context limit and visualize them in vector space.

    1. Sentence Segmentation: Split the text into sentences, accounting for common abbreviations, numbers, URLs, and other exceptions.
    2. Tokenization and Phrase Extraction: Employ a Wiki Phrases tokenizer to identify wiki topics, phrases, and nouns. This includes spell-checking and root word verification using Porter Stemmer.
    3. Noun N-gram Extraction: Generate noun edge-grams, allowing for stop words in the middle (e.g., "state of the art").
    4. Key Phrase Consolidation: Merge smaller n-grams that are subsets of larger ones by comparing weights.
    5. Domain Specificity Calculation: Determine named entities and phrase domain specificity using WikiIDF. This rewards unique key phrases specific to the document's field (e.g., "endocrinology" in medical texts or "thou shall" in religious texts).
    6. Key Phrase Filtering: Select top key phrases based on a combination of frequency and word count.
    7. Graph Construction: Create a double-ring weighted graph with key phrases in the central ring and sentences in the outer ring. Assign weights to links based on concept usage probability.
    8. Sentence Weighting: Apply TextRank algorithm to weight sentences, identifying those that centralize and connect key phrase concepts most referenced by other sentences. This process, based on TextRank and PageRank, includes random surfing and jumping to avoid loops.
    9. Top Results Selection: Select top sentences and key phrases based on overall weight and graph centrality, using either a fixed number or percentage for larger documents.
    10. Output Generation: Return top sentences (with associated key phrases) and top key phrases (with associated sentences).
    11. Dynamic Reranking: If a user interacts with a key phrase or if there's a search query leading to the document, compare query similarity to key phrases, heavily weight the most similar key phrase, and reapply TextRank from step 8.

    Parameters

    • docText: string

      input text to analyze

    • Optionaloptions: {
          phrasesModel: any;
          maxWords: number;
          minWords: number;
          minWordLength: number;
          topKeyphrasesPercent: number;
          limitTopSentences: number;
          limitTopKeyphrases: number;
          minKeyPhraseLength: number;
          heavyWeightQuery: string;
      } = {}
      • phrasesModel: any

        phrases model

      • maxWords: number

        default=5 - maximum words in a keyphrase

      • minWords: number

        default=1 - minimum words in a keyphrase

      • minWordLength: number

        default=3 - minimum length of a word

      • topKeyphrasesPercent: number

        default=0.2 - percentage of top keyphrases to consider

      • limitTopSentences: number

        default=5 - maximum number of top sentences to return

      • limitTopKeyphrases: number

        default=10 - maximum number of top keyphrases to return

      • minKeyPhraseLength: number

        default=6 - minimum length of a keyphrase

      • heavyWeightQuery: string

        query to give heavy weight to

    Returns any[]

    • [{text, keyphrases, weight}] array of sentences
    extractSEEKTOPIC(testDoc, { phrasesModel, heavyWeightQuery: "self attention", limitTopSentences: 10})
    

    Gulakov, A. (2024)