Journal of Advances in Applied Mathematics

The Exploration and Application of K-medoids in Text Clustering

Download PDF (335.3 KB) PP. 93 - 102 Pub. Date: July 1, 2019

DOI: 10.22606/jaam.2019.43001

Author(s)

Qiongjie Dai
1 School of Economics and Management, North China Electric Power University, Beijing, China 2 School of Mathematics and Computer Engineering, Ordos Institute of Technology, Ordos, Inner Mongolia, China
Jicheng Liu^*
School of Economics and Management, North China Electric Power University, Beijing, China

Abstract

Clustering algorithms is a statistical analysis method for classifying samples/indexes. The traditional text clustering algorithm is complicated and not convenient for data processing. Therefore, we proposed a new text clustering algorithm based on K-medoids. The new text clustering algorithm combines document category with semantics contribution. The new clustering algorithm can not only optimize the document frequency, but also take consideration of influence of the document category on the characteristic weight. The new text clustering algorithm was shown as follows: first, combine the proposed semantic contribution with fuzzy cluster, and vested the document (with no category information) category thereby; then we proposed the category information entropy and combined it with the semantic contribution in order to modify the traditional TF-IDF weight calculation method. We found the new text clustering algorithm was superior to the traditional weight calculation method after testing it in open platform of Chinese text categorization corpus data set. Therefore, we concluded that the new text clustering algorithm might have vast foreground of application. To solve the shortcomings of the traditional weight calculation method of feature items, text clustering algorithm based on K-medoids was proposed. The frequency and inverse document frequency were improved, and the influence of document category on feature weight was further studied. At the same time, because there may not be any standard classification datasets in practice, a new weight calculation method combining category and semantic contribution was proposed. First, the semantic contribution was proposed and then combined with fuzzy clustering. A text set with category information was obtained by rough clustering of text set without category information. Then, the category information entropy was proposed and combined with the semantic contribution to improve the traditional TF-IDF weight calculation method. Thus, a more effective weight calculation method was obtained. The Chinese text categorization corpus dataset in open platform of Chinese natural language processing of Fudan University was used for testing. The results showed that the new method for weight calculation of feature items was superior to the traditional weight calculation method. It is concluded that the improved text clustering algorithm can be used in a wider range of occasions.

Keywords

K-medoids, XML document clustering, UCI dataset, cluster center

References

[1] Han, J., Sun, Z., & Hao, H. (2015). L 0 -norm based structural sparse least square regression for feature selection. Pattern Recognition, 48(12), 3927-3940.

[2] Xu, J., Liu, J., Yin, J., & Sun, C. (2016). A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously. Knowledge-Based Systems, 98(C), 172-184.

[3] Rathborne, J. M., Longmore, S. N., Jackson, J. M., Kruijssen, J. M. D., Alves, J. F., & Bally, J., et al. (2015). A cluster in the making: alma reveals the initial conditions for high-mass cluster formation. Astrophysical Journal, 802(2).

[4] Peker, M. (2016). A decision support system to improve medical diagnosis using a combination of k-medoids clustering based attribute weighting and svm. Journal of Medical Systems, 40(5), 1-16.

[5] Broin, P. ó., Smith, T. J., & Golden, A. A. (2015). Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach. BMC Bioinformatics., 16(1), 1-12.

[6] Mojahed, A., & Iglesia, B. D. L. (2017). An adaptive version of k -medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowledge & Information Systems, 50(1), 1-26.

[7] Zhao, X., Li, Y., & Zhao, Q. (2015). Mahalanobis distance based on fuzzy clustering algorithm for image segmentation. Digital Signal Processing, 43(C), 8-16.

[8] Abin, A. A., & Beigy, H. (2015). Active constrained fuzzy clustering: a multiple kernels learning approach. Pattern Recognition, 48(3), 953-967.

[9] Ferreira, C. S., Lachos, V. H., & Bolfarine, H. (2016). Likelihood-based inference for multivariate skew scale mixtures of normal distributions. Asta Advances in Statistical Analysis, 100(4), 1-21.

[10] Mandur, J. S., & Budman, H. M. (2015). Robust algorithms for simultaneous model identification and optimization in the presence of model-plant mismatch. Industrial & Engineering Chemistry Research, 18(12), 1470-1481.

[11] Velmurugan, T. (2018). A state of art analysis of telecommunication data by k-means and k-medoids clustering algorithms. Journal of Computer & Communications, 06(1), 190-202.

[12] Khatami, A., Mirghasemi, S., Khosravi, A., Lim, C. P., & Nahavandi, S. (2017). A new k-medoids clustering and swarm intelligence approach to fire flame detection. Expert Systems with Applications, 68(C), 69-80.