Yıl: 2017 Cilt: 25 Sayı: 4 Sayfa Aralığı: 2614 - 2634 Metin Dili: İngilizce İndeks Tarihi: 29-07-2022

A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets

Öz:
The number and length of massive datasets have increased day by day and this yields more complex machine learning stages due to the high computational costs. To decrease the computational cost many methods were proposed in the literature such as data condensing, feature selection, and filtering. Although clustering methods are generally employed to divide samples into groups, another way of data condensing is by determining ideal exemplars (or prototypes), which can be used instead of the whole dataset. In this study, first the efficiency of traditional data condensing by clustering approach was confirmed according to obtained accuracies and condensing ratios in 9 different synthetic or real batch datasets. This approach was then improved to be employed in time-ordered datasets. In order to validate the proposed approach, 23 different real time-ordered datasets were used in experiments. Achieved mean RMSEs were 0.27 and 0.29 by employing the condensed (mean condensed ratio was 97.17%) and the whole datasets, respectively. Obtained results showed that higher accuracy rates and condensing ratios were achieved by the proposed approach.
Anahtar Kelime:

Konular: Mühendislik, Elektrik ve Elektronik
Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Dash M, Liu H. Feature selection for classification. Intell Data Anal 1997; 1: 131-156.
  • [2] Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157-1182.
  • [3] Ladha L, Deepa T. Feature selection methods and algorithms. Int J Comput Sci Eng 2011; 3: 1787-1797.
  • [4] Datta RP, Saha S. An Empirical Comparison of Rule Based Classification Techniques in Medical Databases. Working Paper IT-11-07. New Delhi, India: Indian Institute of Foreign Trade, 2011.
  • [5] Wilson DR, Martinez TR. Reduction techniques for instance-based learning algorithms. Mach Learn 2000; 38: 257-286.
  • [6] Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv 1999; 31: 264-323.
  • [7] Balcan MF, Blum A, Vempala S. Clustering via similarity functions: theoretical foundations and algorithms. In: 40th ACM Symposium on Theory of Computing Conference; 17–20 May 2008; Victoria, Canada. New York, NY, USA: ACM. pp. 1-42.
  • [8] Xu R, Wunsch D. Survey of clustering algorithms. IEEE T Neural Networ 2005; 16: 645-678.
  • [9] Likas A, Vlassis N, Verbeek JJ. The global k-means clustering algorithm. Pattern Recogn 2003; 36: 451-461.
  • [10] Li Y, Wu H. A clustering method based on K-means algorithm. Phys Procedia 2012; 25: 1104-1109.
  • [11] Xu R, Wunsch DC. Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 2010; 3: 120-154.
  • [12] Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 2014; 71: 52-78.
  • [13] Liao TW. Clustering of time series data—a survey. Pattern Recogn 2005; 38: 1857-1874.
  • [14] Napoleon D, Pavalakodi S. A new method for dimensionality reduction using K-means clustering algorithm for high dimensional data set. Int J Comput Appl 2011; 13: 41-46.
  • [15] Ougiaroglou S, Evangelidis G. Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic Conference on Informatics; 5–7 October 2012; Piraeus, Greece. pp. 168-173.
  • [16] Olvera-L´opez JA, Carrasco-Ochoa JA, Mart´ınez-Trinidad JF. A new fast prototype selection method based on clustering. Pattern Anal Appl 2010; 13: 131-141.
  • [17] Karegowda AG, Jayaram MA, Manjunath AS. Cascading K-means clustering and K-nearest neighbor classifier for categorization of diabetic patients. Int J Eng Adv Tech 2012; 1: 147-151.
  • [18] Garc´ıa S, Derrac J, Luengo J, Herrera F. A first approach to nearest hyperrectangle selection by evolutionary algorithms. In: Ninth International Conference on Intelligent Systems Design and Applications; 30 November–2 December 2009; Pisa, Italy. New York, NY, USA: IEEE. pp. 517-522.
  • [19] Gadodiya SV, Chandak MB. prototype selection algorithms for kNN classifier: a survey. Int J Adv Res Comp Comm Eng 2013; 2: 4829-4832.
  • [20] Triguero I, Derrac J, Garcia S, Herrera F. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE T Syst Man Cy C 2012; 42: 86-100.
  • [21] Ashby FG, Maddox WT. Relation between prototype, exemplar and decision bound models of categorization. J Math Psychol 1993; 37: 372-400.
  • [22] Maddox WT, Ashby FG. Comparing decision bound and exemplar models of categorization. Percept Psychophys 1993; 53: 49-70.
  • [23] Medin DL, Schaffer MM. Context theory of classification learning. Psychol Rev 1978; 85: 207-238.
  • [24] Medin DL, Ross BH, Markman AB. Cognitive Psychology. 4th ed. New York, NY, USA: Wiley, 2005.
  • [25] Choromanska A, Monteleoni C. Online clustering with experts. In: International Conference on Artificial Intelligence and Statistics; 21–23 April 2012; La Palma, Spain. pp. 227-235.
  • [26] Ta˘gluk ME, Ertu˘grul OF. A joint generalized exemplar method for classification of massive datasets. Appl Soft ¨ Comput 2015; 36: 487-498.
  • [27] Duin RPW, Juszczak P, Paclik P, Pekalska E, de Ridder D, Tax DMJ. PR-Tools 4.0, A MATLAB Toolbox for Pattern Recognition. Technical report. Delft, the Netherlands: ICT Group, 2004.
  • [28] Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Annual Symposium on Computer Application in Medical Care; 6–9 November 1988; Washington, DC, USA. pp. 261-265.
  • [29] Bache K, Lichman M. UCI Machine Learning Repository. Irvine, CA, USA: University of California School of Information and Computer Science, 2013.
  • [30] Diaconis P, Efron B. Computer-intensive methods in statistics. Sci Am 1983; 248: 116-126.
  • [31] Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing 2006; 70: 489-501.
  • [32] Dorvlo AS, Jervase JA, Al-Lawati A. Solar radiation estimation using artificial neural networks. Appl Energ 2002; 71: 307-319.
  • [33] Berument H, Kiymaz H. The day of the week effect on stock market volatility. J Econ Finance 2001; 25: 181-193.
  • [34] Zhang SL. A study on the weekday effect and leverage effect on CSI-300 Index Futures Volatility-according to expanded conditional autoregressive range model of application. In: IEEE 2012 International Conference on Management Science and Engineering; 20–22 September 2012; Dallas, TX, USA. New York, NY, USA: IEEE. pp. 1522-1527.
  • [35] Chelton DB, Davis RE. Monthly mean sea-level variability along the west coast of North America. J Phys Oceanogr 1982; 12: 757-784.
  • [36] Watt JH, Van den Berg SA. Research Methods for Communication Science. New York, NY, USA: Allyn & Bacon, 1995.
  • [37] Schunk DH. Learning Theories: An Educational Perspective. 6th ed. Hoboken, NJ, USA: Pearson, 2012.
  • [38] Osherson DN, Smith EE. On the adequacy of prototype theory as a theory of concepts. Cognition 1981; 9: 35-58.
  • [39] Smith EE, Osherson DN. Conceptual combination with prototype concepts. Cognitive Sci 1984; 8: 337-361.
  • [40] Bouton ME, Moody EW. Memory processes in classical conditioning. Neurosci Biobehav R 2004; 28: 663-674.
  • [41] Sugar CA, James GM. Finding the number of clusters in a dataset. J Am Stat Assoc 2003; 98: 1-24.
  • [42] Pham DT, Dimov SS, Nguyen CD. Selection of K in K-means clustering. Proc IME C J Mech Eng Sci 219: 103-119.
  • [43] Perlich C, Swirszcz G. On cross-validation and stacking: building seemingly predictive models on random data. SIGKDD Explor 2011; 12: 11-15.
  • [44] Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv 2010; 4: 40-79.
  • [45] Qi M, Zhang GP. Trend time–series modeling and forecasting with neural networks. IEEE T Neural Networ 2008; 19: 808-816.
  • [46] Beck N, Katz JN. Random coefficient models for time-series–cross-section data: Monte Carlo experiments. Polit Anal 2007; 15: 182-195.
  • [47] Xu QS, Liang YZ. Monte Carlo cross validation. Chemometr Intell Lab Syst 2001; 56: 1-11.
  • [48] Arthur D, Vassilvitskii S. How slow is the k-means method? In: The Twenty-Second Annual Symposium on Computational Geometry; 5–7 June 2006; Sedona, AZ, USA. pp. 144-153.
  • [49] Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. London, UK: Academic Press, 1979.
  • [50] Ratsch G, Onoda T, Muller KR. An improvement of AdaBoost to avoid overfitting. In: The Fifth International Conference on Neural Information Processing; 21–23 October 1998; Kitakyushu, Japan. pp. 506-509.
  • [51] Raymer ML, Doom TE, Kuhn LA, Punch WF. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE T Syst Man Cy B 2003; 33: 802-814.
  • [52] Ertu˘grul OF, Ta˘gluk ME. A novel machine learning method based on generalized behavioral learning theory. Neural ¨ Comput Appl (in press).
  • [53] Wang S, Li Z, Liu C, Zhang X, Zhang H. Training data reduction to speed up SVM training. Appl Intell 2014; 41: 405-420.
  • [54] Jensen R, Shen Q. Are more features better? A response to attributes reduction using fuzzy rough sets. IEEE T Fuzzy Syst 2009; 17: 1456-1458.
  • [55] Kumar CA, Srinivas S. Concept lattice reduction using fuzzy K-means clustering. Expert Syst Appl 2010; 37: 2696-2704.
  • [56] Dietterich, TG. Machine learning for sequential data: a review. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR); 6–9 August 2002; Windsor, ON, Canada. pp. 15-30.
APA Ertuğrul Ö (2017). A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. , 2614 - 2634.
Chicago Ertuğrul Ömer Faruk A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. (2017): 2614 - 2634.
MLA Ertuğrul Ömer Faruk A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. , 2017, ss.2614 - 2634.
AMA Ertuğrul Ö A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. . 2017; 2614 - 2634.
Vancouver Ertuğrul Ö A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. . 2017; 2614 - 2634.
IEEE Ertuğrul Ö "A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets." , ss.2614 - 2634, 2017.
ISNAD Ertuğrul, Ömer Faruk. "A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets". (2017), 2614-2634.
APA Ertuğrul Ö (2017). A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. Turkish Journal of Electrical Engineering and Computer Sciences, 25(4), 2614 - 2634.
Chicago Ertuğrul Ömer Faruk A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. Turkish Journal of Electrical Engineering and Computer Sciences 25, no.4 (2017): 2614 - 2634.
MLA Ertuğrul Ömer Faruk A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. Turkish Journal of Electrical Engineering and Computer Sciences, vol.25, no.4, 2017, ss.2614 - 2634.
AMA Ertuğrul Ö A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. Turkish Journal of Electrical Engineering and Computer Sciences. 2017; 25(4): 2614 - 2634.
Vancouver Ertuğrul Ö A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets. Turkish Journal of Electrical Engineering and Computer Sciences. 2017; 25(4): 2614 - 2634.
IEEE Ertuğrul Ö "A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets." Turkish Journal of Electrical Engineering and Computer Sciences, 25, ss.2614 - 2634, 2017.
ISNAD Ertuğrul, Ömer Faruk. "A novel approach for extracting ideal exemplars by clustering for massive time-ordered datasets". Turkish Journal of Electrical Engineering and Computer Sciences 25/4 (2017), 2614-2634.