A Model for Clustering and Predicting Customer Behavior in a Structured Data Set
Nataliya Boyko
Lviv Polytechnic National University, Lviv, 79013 Ukraine
https://doi.org/10.47191/jefms/v6-i3-06ABSTRACT:
The main goal of the work is to study the process of leads coring. In this paper, two ways of lead generation are explained – manual and using predictive models. This condones, per investigation, is the most efficient way to score a lead.The analytical review starts with an overview of existing lead scoring methods, describing each and comparing the pros and cons. There is an analytical comparison of the products that already exist on the market, such as HubSpot, Infer, PipeCandy and Maroon.ai. As a conclusion from this comparison, the effect of this study has a benefit in that it is a low-weight plugin with a significantly lower price than the market median. To provide a list of methods and algorithms used and describe why each was selected for a specific goal. Four aggregations were made to choose the best-fitting model, resulting in the LightGBM selection. There is a comprehensive description of the LighGBM features and math background, along with an explanation of GOSS. Also, after the model was selected, the set data manipulations were done, such as aggregation, class weight balance, tuning and exhaustive analysis, and correction of disbalance. In part to the results, there is an extensive analysis of the model performance after studying the data set and usage in a real-case CRM – Salesforce. As results showed – the created plugin can be easily integrated into any CRM solution using the native marketplaces or package-delivery systems.
KEYWORDS:
model, customer generation, randomforest, software product, marketing.
REFERENCES:
1) Tung, A.K., Hou, J., Han, J. “Spatial clustering in the presence of obstacles”. The 17th Intern. conf. on data engineering (ICDE’01), Heidelberg, 2001, p. 359–367. DOI: 10.1109/ICDM.2002.1184042
2) Boehm, C., Kailing, K., Kriegel, H., Kroeger, P. “Density connected clustering with local subspace preferences”. IEEE Computer Society. Proc. of the 4th IEEE Intern. conf. on data mining. Los Alamitos, 2004, p. 27–34. DOI: 10.1007/978-0-387-39940-9_605
3) Boyko, N., Kmetyk-Podubinska, K., Andrusiak, I. “Application of Ensemble Methods of Strengthening in Search of Legal Information”. Lecture Notes on Data Engineering and Communications Technologies. 2021; Vol. 77: 188-200. URL: https://doi.org/10.1007/978-3-030-82014-5_13.
4) Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M. “A Monte Carlo algorithm for fast projective clustering”. ACM SIGMOD Intern. conf. on management of data, Madison, Wisconsin, USA. 2002: 418–427.
5) Bengio, Y., Simard, P., Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5 (2). p. 157–166. doi: 10.1109/72.279181
6) Chaudhary, K., Poirion, O. B., Lu, L., Garmire, L. X. (2017). Deep learning based multi-omics integration robustly predicts survival in liver cancer. Clin. Can. Res. 0853. p. 1246–1259. doi: 10.1101/114892
7) Chaudhary, K., Poirion, O. B., Lu, L., Garmire, L. X. (2018). Deep learning–based multi-omics integration robustly predicts survival in liver cancer. Clin. Can. Res. 24 (6). p. 1248–1259. doi: 10.1158/1078-0432.CCR-17-0853
8) Boyko, N., Mokryk, Y. "Detecting Fraud in Banking Transactions with Random Forest Models." 2021 IEEE 8th International Conference on Problems of Infocommunications, Science and Technology (PIC S&T). 2021. p. 1-6, doi: 10.1109/PICST54195.2021.9772209
9) Cheng, B., Liu, M., Zhang, D., Musell, B.C., Shen, D. (2015). Domain Transfer Learning for MCI Conversion Prediction. IEEE Trans. Biomed. Eng. 62 (7). p. 1805–1817. doi: 10.1109/TBME.2015.2404809
10) Choi, E., Schuetz, A., Stewart, W. F., Sun, J. (2017). Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inf. Assoc. 24 (2). p. 361–370.
11) Deng, L., Hinton, G., Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on: 2013. IEEE: 8599–8603. doi: 10.1109/ICASSP.2013.6639344
12) Huang, M., Yang, W., Feng, Q., Chen, W., Weiner, M. W., Aisen, P., et al. (2017). Longitudinal measurement and hierarchical classification framework for the prediction of Alzheimer’s disease. Sci. Rep. 7. p. 39880. doi: 10.1038/srep39880
13) Lama, R. K., Gwak, J., Park, J.-S., Lee, S.-W. (2017). Diagnosis of Alzheimer’s disease based on structural MRI images using a regularized extreme learning machine and PCA features. J. Healthcare Eng. 2017. p. 11. doi: 10.1155/2017/5485080
14) Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., et al. (2006). Machine learning in bioinformatics. Briefings Bioinf. 7 (1). p. 86–112. doi: 10.1093/bib/bbk007
15) LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521 (7553). p. 436. doi: 10.1038/nature14539
16) Lee, G., Nho, K., Kang, B., Sohn, K.A., Kim, D. (2019). Predicting Alzheimer's disease progression using multi-modal deep learning approach. Sci. Rep. 9(1). p. 1952. doi: 10.1038/s41598-018-37769-z
17) Boyko, N., Hetman, S., Kots, I. “Comparison of Clustering Algorithms for Revenue and Cost Analysis”. Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021). Lviv, Ukraine. 2021; Vol.1: 1866-1877.
18) Lu, D., Popuri, K., Ding, G. W., Balachandar, R., Beg, M. F. (2018). Multimodal and multiscale deep neural networks for the early diagnosis of Alzheimer’s disease using structural mr and fdg-pet images. Sci. Rep. 8 (1). p. 5697. doi: 10.1038/s41598-018-22871-z
19) Sandeep, C., Kumar, A., Mahadevan, K., Manoj, P. (2017). Feature extraction of MRI brain images for the early detection of alzheimer’s disease. Bioprocess Eng. 1 (2). 35–42. doi: 10.1109/I2C2.2017.8321780
20) Young, J., Modat, M., Cardoso, M. J., Mendelson, A., Cash, D., Ourselin, S. (2013). Initiative AsDN: accurate multimodal probabilistic prediction of conversion to Alzheimer’s disease in patients with mild cognitive impairment. NeuroImage Clin. 2. p. 735–745. doi: 10.1016/j.nicl.2013.05.004
21) Zhang, D., Shen, D. (2012). Initiative AsDN: multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59 (2). p. 895–907. doi: 10.1016/j.neuroimage.2011.09.069
22) Zhang, D., Shen, D. (2012). Initiative AsDN: predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers. PloS One 7 (3). p. e33182. doi: 10.1371/journal.pone.0033182
23) Diks, C., Hommes, C., and Wang, J. (2019). Critical slowing down as an early warning signal for financial crises? Empirical Economics 57 (4). p. 1201–1228. DOI: 10.1007/s00181-018-1527-3
24) Kölbel, J. F., Busch, T. and Jancso, L. M. (2017). How media coverage of corporate social irresponsibility increases financial risk. Strategic Management Journal 38(11). p. 2266–2284.
25) Bouslah, K., Kryzanowski, L. and M’Zali, B. (2018). Social performance and firm risk: impact of the financial crisis. Journal of Business Ethics 149 (3). p. 643–669. https://doi.org/10.1007/s10551-016-3017-x
26) Srinivasan S. and Kamalakannan, T. Multi criteria decision making in financial risk management with a multi-objective genetic algorithm. Computational Economics 52 (2). p. 443–457. DOI: 10.1007/s10614-017-9683-7
27) Hossain, M. Z., Akhtar, M. N., Ahmad, R. B. and Rahman, M. (2019). A dynamic K-means clustering for data mining. Indonesian Journal of Electrical Engineering and Computer Science 13 (2). p. 521–526. DOI: http://doi.org/10.11591/ijeecs.v13.i2.pp521-526
28) Boyko, N., Muzyka, M. "Analysis of Multimodal Data for Classification Problems by Using Methods of Machine Learning." 2021 IEEE 8th International Conference on Problems of Infocommunications, Science and Technology (PIC S&T). 2021. p. 525-534. doi: 10.1109/PICST54195.2021.9772203.
29) Jothi, R., Mohanty, S. K. and Ojha, A. (2019). DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Analysis and Applications 22(2). p. 649–667. DOI: 10.1007/s10044-017-0673-0
30) Shakeel, P. M., Baskar, S., Dhulipala, V. S. and Jaber, M. M. (2018). Cloud based framework for diagnosis of diabetes mellitus using K-means clustering. Health Information Science and Systems 6 (1). p. 1–7. DOI: 10.22937/IJCSNS.2021.21.6.31
31) C. Slamet, A. Rahman, M. A. Ramdhani, and W. Darmalaksana, “Clustering the verses of the Holy Qur'an using K-means algorithm,” Asian Journal of Information Technology, vol. 15, no. 24, pp. 5159–5162, 2016.
32) Bekiros, S., Nguyen, D. K., Sandoval Junior, L. and Uddin, G. S. (2017). Information diffusion, cluster formation and entropy-based network dynamics in equity and commodity markets. European Journal of Operational Research 256(3). p. 945–961. DOI: 10.1016/j.ejor.2016.06.052
33) Polyakova, M.V., Krylov, V.N. (2022). Data normalization methods to improve the quality of classification in the breast cancer diagnostic system. Applied Aspects of Information Technology 5(1). p. 55–63. DOI: https://doi.org/10.15276/aait.05.2022.5