Keyword extraction by entropy difference between the intrinsic and extrinsic mode
We strive to propose a new metric to evaluate and rank the relevance of words in a text. The method uses the Shannon’s entropy difference between the intrinsic and extrinsic mode, which refers to the fact that relevant words significantly reflect the author’s writing intention, i.e., their occurrences are modulated by the author’s purpose, while the irrelevant words are distributed randomly in the text. By using The Origin of Species by Charles Darwin as a representative text sample, the performance of our detector is demonstrated and compared to previous proposals. Since a reference text ‘‘corpus’’ is all of an author’s writings, books, papers, etc. his collected works is not needed. Our approach is especially suitable for single documents of which there is no a priori information available.
Project Members
- Zhen Yang
- Weitong Chen
- Hanchen Li
- Chaoyang Li
- Ning Lu
- Longbo Zhang
- Youjun E
Highlights
- We propose a new metric to evaluate and rank the relevance of words in a text.
- The metric uses the Shannon’s entropy difference between the intrinsic and extrinsic mode.
- We believe that this work is a new result in keyword extraction and ranking.
- Our approach is especially suitable for single documents of which there is no a priori information available.
Publication
Code & Toolbox
- Online Demo 1!
- Online Demo 2!
- Github page
- Codeproject page
- SPROUT toolbox, developed by Prof. Zhiqiang Cai, which use our algorithm to extract keywords for target corpus, and then use the keywords to find extra articles on wiki to expand the corpus.
Update
- 张龙伯. 基于多尺度划分的关键词检测算法, 北京工业大学硕士学位论文,2014.
张龙伯. 基于多尺度划分的关键词检测系统, 计算机软著(登记号: 2014SRBJ0226),2014.- 在YANG’ 13算法的基础上加入多尺度分析方法。
- 对文章进行多尺度划分的方法,综合考虑词语在各个粒度下的分布特性,计算词语的主题相关度,从而有效的检测出文本中的关键词。
- 对文章《物种起源》进行关键词检测,性能明显提升,得到top19准确率100%的性能。