Skip to main content

compile-topic-model

ai-research-agent / datasets/compile-topic-model

Other

weightWikiWordSpecificity()

function weightWikiWordSpecificity(query): number

Find domain-specific unique words for a single doc with BM25 formula by using Wikipedia term frequencies as the common words corpus. All words in English Wikipedia are sorted by number of pages they are in for 325K words with frequencies of at least 32 wikipages, between 3 to 23 characters of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï, but filtering out numbers and foreign language.
Total Terms (frequency>=32): 324896
Filesize (JSON, frequency>=32): 4MB
Total Articles (Wiki-en-2020): 5,989,879

Galkin, M., Malykh, V. (2020). Wikipedia TF-IDF Dataset Release (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3631674 https://github.com/SmartDataAnalytics/Wikipedia_TF_IDF_Dataset

Parameters

ParameterTypeDescription

query

string

phrase to search wiki-idf for each word

Returns

number

score for term specificity 0-12~

Topics

compileTopicModel()

function compileTopicModel(options?): Promise<void>

Compile a topic phrases model from a dictionary and Wikipedia page titles.
Search and outline a research base using Wikipedia's 100k popular pages as the core topic phrases graph for LLM Research Agents. Most of the documents online (and by extension thinking in the collective conciousness) can revolve around core topic phrases linked as a graph.
If all the available docs are nodes, the links in the graph can be extracted Wiki page entities and mappings of dictionary phrases to their wiki page. These can serve as topic labels, keywords, and suggestions for LLM followup questions. Documents can be linked in a graph with:

  1. wiki page entity recognition
    2. frequent keyphrases
    3. html links
  2. research paper references
    5. keyphrases to query in global web search
    6. site-specific recommendations.
    These can lay the foundation for LLM Research Agents to fully grok, summarize, and outline a research base.

    240K total words & phrases, first 117K first-word or single words to check every token against. 100K Wikipedia Page Titles and links - Wikipedia most popular pages titles. Also includes domain specificity score and what letters should be capital.
    84K words and 67K phrases in dictionary lexicon OpenEnglishWordNet, a better updated version of Wordnet - multiple definitions per term, 120k definitions, 45 concept categories
    JSON Prefix Trie - arranged by sorting words and phrases for lookup by first word to tokenize by word, then find if it starts a phrase based on entries, for Phrase Extraction from a text.
    There is "unanimous consensus" that Prefix Trie O(1) lookups (instead of having to loop through the index for each lookup) makes it the best data type for this task.

Parameters

ParameterTypeDescription

options?

{ addJSONLineBreaks: number; addWikiPageTitles: boolean; maxSynonymsPerTerm: number; minTermCharCount: number; sortInFirstTwoLettersTrie: boolean; }

options.addJSONLineBreaks?

number

include line breaks in JSON output for debugging

options.addWikiPageTitles?

boolean

true to add wiki page titles, false for dictionary only

options.maxSynonymsPerTerm?

number

max synonyms per term

options.minTermCharCount?

number

min length of term to include

options.sortInFirstTwoLettersTrie?

boolean

sort the first words by first two letters Trie, needd for autocomplete after 2 letters typed

Returns

Promise<void>