Skip to main content

term-frequency-import

ai-research-agent / datasets/term-frequency-import

Functions

importTermFrequency()

function importTermFrequency(): object

Script to download, decompress, parse and process Wikipedia term frequency dataset, compiled by SmartDataAnalytics in 2020 and containing term frequencies on Wikipedia articles. All words in English Wikipedia are sorted by number of pages they are in for 325K words with frequencies of at least 32 wikipages, between 3 to 23 characters of Latin alphanumerics like az09, punctuation like .-, and diacritics like éï, but filtering out numbers and foreign language.
Total Terms (frequency>=32): 324896
Filesize (JSON, frequency>=32): 4MB
Total Articles (Wiki-en-2020): 5,989,879

Returns

object

Author

Galkin, M., Malykh, V. (2020). Wikipedia TF-IDF Dataset Release (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3631674 https://github.com/SmartDataAnalytics/Wikipedia_TF_IDF_Dataset