Thai Text Clustering with K-Means and TF-IDF in Python for Educational Applications

Authors

  • Yotanut Boonyo Ph.D. student in the Methodology for Innovation Development in Education program, Department of Educational Research and Psychology, Faculty of Education, Chulalongkorn University

Keywords:

Thai Text Clustering, Qualitative Data, TF-IDF, K-Means, Python, Text Mining

Abstract

Qualitative text data—such as student feedback, interview transcripts, and online content—are increasingly available from web-based sources and institutional repositories. However, their unstructured nature makes large-scale analysis difficult. Text clustering helps organize such data by grouping documents with similar content. This tutorial presents a step-by-step workflow for clustering Thai-language texts using TF-IDF and the K-means algorithm in Python. It covers preprocessing, vectorization, clustering, and evaluation, with code examples based on Thai-language documents. The tutorial concludes with examples of educational applications, including analyzing open-ended survey responses, exploring curriculum topics, and identifying emerging themes in academic writing.

References

Aggarwal, C. C., & Zhai, C. (2012). A survey of text clustering algorithms. Mining text data, 77-128. https://doi.org/10.1007/978-1-4614-3223-4_4

Ahmed, M. H., Tiun, S., Omar, N., & Sani, N. S. (2022). Short text clustering algorithms, application and challenges: a survey. Applied Sciences, 13(1), 342. https://doi.org/10.3390/app13010342

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919. https://doi.org/10.48550/arXiv.1707.02919

Bhattacharjee, P., & Mitra, P. (2021). A survey of density based clustering algorithms. Frontiers of Computer Science, 15, 1-27. https://doi.org/10.1007/s11704-019-9059-3

Bo, D., Wang, X., Shi, C., Zhu, M., Lu, E., & Cui, P. (2020). Structural deep clustering network. In Proceedings of the web conference 2020 (pp. 1400-1410). https://doi.org/10.1145/3366423.3380214

Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining (pp. 160-172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14

Chormai, P., Prasertsom, P., & Rutherford, A. (2019). Attacut: A fast and accurate neural thai word segmenter. arXiv. https://doi.org/10.48550/arXiv.1911.07056

Cui, M. (2020). Introduction to the k-means clustering algorithm based on the elbow method. Accounting, Auditing and Finance, 1(1), 5-8. https://doi.org/10.23977/accaf.2020.0101024

Ferreira‐Mello, R., André, M., Pinheiro, A., Costa, E., & Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6), e1332. https://doi.org/10.1002/widm.1332

Forestier, G., Wemmert, C., & Gançarski, P. (2010). Background knowledge integration in clustering using purity indexes. In Knowledge Science, Engineering and Management: 4th International Conference, KSEM 2010, Belfast, Northern Ireland, UK, September 1-3, 2010. Proceedings 4 (pp. 28-38). Springer. https://doi.org/10.1007/978-3-642-15280-1_6

Hassan, B. A., Tayfor, N. B., Hassan, A. A., Ahmed, A. M., Rashid, T. A., & Abdalla, N. N. (2024). From A-to-Z review of clustering validation indices. Neurocomputing, 601, 128198. https://doi.org/10.1016/j.neucom.2024.128198

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2, 193-218. https://doi.org/10.1007/BF01908075

Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178-210. https://doi.org/10.1016/j.ins.2022.11.139

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall.

Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2016). Variational deep embedding: An unsupervised and generative approach to clustering. arXiv. https://doi.org/10.48550/arXiv.1611.05148

Kleinberg, J. (2002). An impossibility theorem for clustering. Advances in neural information processing systems, 15. https://proceedings.neurips.cc/paper

/2002/hash/43e4e6a6f341e00671e123714de019a8-Abstract.html

Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal, 1(6), 90-95.

Kriegel, H. P., Kröger, P., Sander, J., & Zimek, A. (2011). Density‐based clustering. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(3), 231-240. https://doi.org/10.1002/widm.30

Luque, C., Luna, J. M., Luque, M., & Ventura, S. (2019). An advanced review on text mining in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3), e1302. https://doi.org/10.1002/widm.1302

Mehta, V., Agarwal, M., & Kaliyar, R. K. (2024). A comprehensive and analytical review of text clustering techniques. International Journal of Data Science and Analytics, 18(3), 239-258. https://doi.org/10.1007/s41060-024-00540-x

Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97. https://doi.org/10.1002/widm.53

Patil, R., Boit, S., Gudivada, V., & Nandigam, J. (2023). A survey of text representation and embedding techniques in NLP. IEEE Access, 11, 36120-36146. https://doi.org/10.1109/ACCESS.2023.3266377

Petukhova, A., Matos-Carvalho, J. P., & Fachada, N. (2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6, 100-108. https://doi.org/10.1016/j.ijcce.2024.11.004

Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., Suriyawongkul, A., Lowphansirikul, L., Chormai, P., ... & Udomcharoenchaikit, C. (2023). PyThaiNLP: Thai natural language processing in Python. arXiv. https://doi.org/10.48550/arXiv.2312.04649

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0

Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., ... & Lin, C. T. (2017). A review of clustering techniques and developments. Neurocomputing, 267, 664-681. https://doi.org/10.1016/j.neucom.2017.06.053

Shahapure, K. R., & Nicholas, C. (2020). Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA) (pp. 747-748). IEEE. https://doi.org/10.1109/DSAA49011.2020.00096.

Shutaywi, M., & Kachouie, N. N. (2021). Silhouette analysis for performance evaluation in machine learning with applications to clustering. Entropy, 23(6), 759. https://doi.org/10.3390/e23060759

Sinaga, K. P., & Yang, M. S. (2020). Unsupervised k-means clustering algorithm. IEEE Access, 8, 80716–80727. https://doi.org/10.1109/ACCESS.2020.2988796

Timbers, T., Campbell, T., Lee, M., Ostblom, J., & Heagy, L. (2024). Data Science: A First Introduction with Python. CRC Press.

Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478-487). PMLR. https://proceedings.mlr.press/v48/xieb16.html

Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of data science, 2(2), 165-193. https://doi.org/10.1007/s40745-015-0040-1

Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3), 645-678. https://doi.org/10.1109/TNN.2005.845141

Downloads

Published

2025-06-25

How to Cite

Boonyo, Y. (2025). Thai Text Clustering with K-Means and TF-IDF in Python for Educational Applications. Journal of Research Methodology, 38(1), ุ69–90. retrieved from https://so12.tci-thaijo.org/index.php/jrm/article/view/2972

Issue

Section

Tutorial