It is well known [5] that SVM doesn’t support online learning, requires solving computationally expensive nonlinear programming problems, and constructs optimal separating hyperplane for two-class problems only. In this work we have also used a perceptron with an enlarged margin and multi-class learning rule which has been developed by us in order to overcome SVM drawbacks. In the developed perceptron outputs of neurons that correspond to classes are determined as y = xw, where w are weights of modifiable connections, x is the i-th c i i ic ic i element value of the vector that is input to those connections (in the present context it was the binary vector of XII-th International Conference "Knowledge - Dialogue - Solution" distributed representations, but the original data vector could be used for linear tasks as well). For the “true” class neuron y = y (1–T), where 0 y, where c is the index of the ic ic true ic ic c c-true true correct class. E.g., f(w) = w/|c|. Our previous version of the enlarged margin perceptron had single-class (not multi-class) learning rule: unlearning with single class c* = argmax y was performed in case of error, and f(w) = с c w. For T=0 and single-class learning rule one obtains usual percepton, while for T=0 and multi-class learning rule one obtains usual percepton with multi-class learning. Multi-class learning extracts and uses more information from a single error and so provides a potential for faster learning and better generalization for essentially multi-class tasks, especially at early learning iterations of the training set. This could be critical for online learning tasks.

Experimental Results for Numerical Data Figure 1 demonstrates Leonard-Kramer problem results: dependencies of classification errors percent %err, elementary cell size cell (the smaller is the cell, the larger is resolution), and average fields dimensionality E{s} vs the code density p (the fraction of 1s in the code). For Prager and RSC coding, the results of SVM and of the enlarged margin perceptron (T=0.75) with multi-class learning were averaged by 10 realizations of codes at N=100. Results for SVM with kernels (Kernel) are also shown. For all cases (as well as for large Ns Figure 2) classification error reaches its minimum near p=0.25, which corresponds to the minimum cell and E{s} = 2.

6 2 0,cell, cell % err % err Prag Kernel 0,1,E{S} Prag SVM 5 Prag Perc 0,12 1,6 0,RSC Kernel RSC SVM 1,4 0,1 RSC Perc E{S}/20 Prag 1,2 0,E{S}/20 RSC 0,3 Cell Prag Cell RSC 0,06 0,8 0,0,0,0,4 0,0,0,0 0 0 p N 0 0,2 0,4 0,6 0,8 1 100 1000 Figure 1 Figure Figure 2 demonstrates %error and cell vs N at p=0.25. The results were averaged by 10 realizations of codes. At N=500 the SVM result has already been close to the kernel result. For the enlarged margin perceptron (T=0.75) with multi-class learning the error for N>(300–1000) was lower than the SVM one. Training for perceptron was faster than that for SVM by 20 times, while testing was >100 times faster.

The experimental results for the DataGen data are given in Figure 3 (A=4, S=3, C=4, R=4, where R determines the complexity of the class regions [6]) and the number of samples per class is equal to 100. Averaging was conducted through 5 realizations of the DataGen samples and 5 realizations of codes. For those parameters the minimum cell value corresponds to p~0.3 (and close to it for p=0.125...0.5) and the error minimum for both SVM and the enlarged margin perceptron is also reached in this interval. For N=100 it is biased to the larger p values (which ensures a more stable number of 1s). For N=1000 the minimum is biased to the smaller p which corresponds to a larger mean dimensionality of receptive fields, while the number of 1s remains large enough and the cell is small enough. The training time for the perceptron is ~20 times less than for SVM, and the testing time is ~500 times less.

Neural and Growing Networks 33 3,Prag Kernel Prag N=100 SVM Prag N=100 Perc Prag N=1000 SVM 2,Prag N=1000 Perc RSC Kernel RSC N=100 SVM RSC N=100 Perc RSC N=1000 SVM 1,RSC N=1000 Perc E{S} Prag E{S} RSC 0,13 p 0 0,2 0,4 0,6 0,8 Figure We have also obtained and compared experimental results for multi-class and single class learning perceptrons.

The error for multi-class learning perceptron was up to 1.5 times lower than for the usual one, whereas the error for multi-class perceptron with the enlarged margin was still lower and comparable with the error for single-class perceptron with the enlarged margin. The learning curves (test error vs training iteration number) were typically lower for multi-class learning than for single-class learning, and best results for multi-class perceptrons were typically higher than those for multi-class perceptrons.

For the artificial data of the Elena database the code parameters were N=1000, A=S=2, p=0.25; for the real data (Iris, Phoneme, Satimage, Texture) N=10000, S=2,5(4), p=0.1 and 0.25. Table 1 demonstrates percent of classification errors. For SVM and perceptron the results were obtained by averaging over 10 realizations of RSC and Prager codes. The best results of the known methods kNN, MLP, IRVQ are also given [7]. The comparison of results shows that RSC and Prager coding provided the best result to Concentric, Phoneme, Texture and the second best result for Satimage and Gaussian 7D. Perceptron training time is (on the average) several times less, and test time is dozens of times less than that for SVM.

Classification of Texts and Images Traditional approaches to text classification use functions of word occurrence frequencies as elements of their vector representations. To reduce vectors’ dimensionality, methods for informative feature selection can be used [4]; however, even simplified Table methods that consider features as RSC RSC RSC Prager Prager independent have quadratic Database kNN MLP IRVQ SVM kern. perc. SVM kern.

computational complexity. We Clouds 12.68 14.84 – 12.4 14.8 11.8 12.2 11.propose and investigate usage of Concentric 1.36 1.2 – 1.17 1.04 1.7 2.8 1.distributed representations for dimensionality reduction of vector Gaussian2 S=2 28.12 35.12 – 27.83 35.64 27.4 26.8 27.text representation. N-dimensional Gaussian7 S=2 14.35 15.68 – 14.36 15.76 15.9 15.3 11.binary code with m 1s in random Gaussian7 S=5 14.69 14.64 – 13.36 15.12 – – – positions is used to represent each Iris S=2 6.53 6.67 5.33 5.59 6.67 4 4.3 6.word. N-dimensional text Iris S=4 4.27 6.67 5.73 6.13 6.67 – – – representation is formed by Phoneme S=2 14.12 11.51 13.7 15.79 14.47 12.3 16.3 16.summation of its word vectors, and Phoneme S=5 13.61 11.62 13.19 14.82 12.62 – – – transformation back to binary space Satimage S=2 10.06 10.13 9.15 10.82 10.79 9.9 12.3 11.is performed by a threshold Satimage S=5 10.11 – 9.1 10.64 – – – – operation, or by context-dependent Texture S=2 0.82 0.76 1.13 0.82 0.80 1.9 2.0 3.thinning CDT (see [3]).

Texture S=5 0.73 – 1.07 0.74 – – – – E{S} % err XII-th International Conference "Knowledge - Dialogue - Solution" Testing in the classification task has been conducted using Reuters-21578 text collection [3] by means of SVM.

For the TOP-10 categories BEP (break even point of recall/precision characteristic) for the initial vector representation of N*=20000 was 0.920/0.863 (micro/macro averaging). Using of the distributed representations with N=1000, m=2 made it possible to obtain 0.861/0.775 (micro/macro averaging), and usage of CDT in some experiments increased it by several percents.

The analogously formed distributed representations have been studied for classification of handwritten digit images of the MNIST database [4]. Images have been coded by the extracting binary features. The presence of each feature corresponded to the combination of white and black points in some positions of retina (LIRA features [4]). As a result, a "primary" binary code was obtained. Then it was transformed to the "secondary" representation using the same procedures as for text information.

Classification results with dimensionality reduction from N* to N are shown in Table 2. Line “sel” contains the error percent obtained using selection of informative features [4]. Line “distr” contains classification results for the "secondary" binary distributed representations. Results for the distributed representations considerably exceed the results of initial representations for the same N and are similar to the results of feature selection methods [4].

We have also obtained and compared MNIST experimental results for multi-class and single class learning perceptrons. Here we used original LIRA features without transformation to secondary distributed representations, for N={1000, 10000, 50000}, and both with and without feature selection. We observed the same tendencies as for numerical data, however the advantage of multi-class learning was more pronounced for weaker classifiers (at N=1000) than for better ones (at N=50000).

Table N (err) 5000(667) 10000 (407) 50000 (195) 128000 (160) N* 1000 1000 5000 1000 5000 10000 1000 5000 sel 820 578 420 492 264 242 474 261 distr 904 727 415 632 274 213 826 264 Conclusions The developed binary distributed representations of vector data (numeric, text, images) were investigated in the classification tasks. A comparative analysis of various method results for the tasks with artificial and real data was carried out. The study showed that analytical expressions for the characteristics of the RSC-Prager codes of the numerical vectors obtained in [2] make it possible to select code parameters that provide high results in the nonlinear classification tasks using linear classifiers. Results obtained with the proposed perceptron with an enlarged margin are comparable with the results of the state-of-art SVM classifiers, however a significant decrease in training and recognition time has been observed. The results obtained with the RSC-Prager kernels also make it possible to reduce training and test time for small S.

Application of distributed encoding for representation of binary features in texts and images also made it possible to obtain computationally effective solutions to classification tasks preserving classification quality. A promising direction of further studies could consist in developing computationally efficient RSC and Prager kernels, as well as developing distributed representations and kernels that provide a more adequate account for structural information in the input data.

Bibliography [1] R.. Duda, P. Hart, D. Stork. Pattern Classification, 2nd ed. – New York: John Wiley & Sons, 2000.

[2] S.V. Slipchenko, I.S. Misuno, D.A. Rachkovskij. Properties of coarse coding with random hyperrectangle receptive fields.

Mathematical machines and systems, N 4, pp. 15-29, 2005 (in Russian).

[3] I.S. Misuno. Distributed vector representation and classification of texts. USIM, N 1, pp. 85-91, 2006 (in Russian).

[4] S.V. Slipchenko, D.A. Rachkovskij, I.S. Misuno. The experimental research of handwritten digits classification. System technologies, N 4 (39), pp. 110–133, 2005 (in Russian).

[5] V.N. Vapnik. Statistical Learning Theory. – New York: John Wiley & Sons, 1998.

[6] I.S. Misuno, D.A. Rachkovskij, E.G. Revunova, S.V. Slipchenko, A.М. Sokolov, A.E. Teteryuk. Modular software neurocomputer SNC - implementation and applications. USiM, N 2, pp. 74–85, 2005 (in Russian).

[7] D. Zhora. Evaluating Performance of Random Subspace Classifier on ELENA Classification Database. Artificial Neural Networks: Biological Inspirations – ICANN 2005 – Springer–Verlag Berlin Heidelberg, pp. 343–349, 2005.

Neural and Growing Networks Authors' Information Ivan S. Misuno – e-mail: i.misuno@longbow.kiev.ua Dmitri A. Rachkovskij – e-mail: dar@infrm.kiev.ua Sergey V. Slipchenko – e-mail: slipchenko_serg@ukr.net International Research and Training Center of Information Technologies and Systems; Pr. Acad. Glushkova, 40, Kiev, 03680, Ukraine SELECTING CLASSIFIERS TECHNIQUES FOR OUTCOME PREDICTION USING NEURAL NETWORKS APPROACH Tatiana Shatovskaya Abstract: This paper presents an analysis of different techniques that is designed to aid a researcher in determining which of the classification techniques would be most appropriate to choose the ridge, robust and linear regression methods for predicting outcomes for specific quazistationarity process.

Keywords: classification techniques, neural network, composite classifier ACM Classification Keywords: F.2.1 Numerical Algorithms and Problems 1. Introduction There are a lot of approaches to building mathematical models for kvazistationarity process with multicollinearity and noisiness. For example, ridge regression is a linear-regression variant that is used for highly correlated independent variables, as is often the case for a set of predictors that are designed to approximate the same function [1]. Ridge regression adds a constraint that the sum of the squares of the regression coefficients be equal to a constant. Varying this parameter produces a set of predictors. Robust methods estimation parameters of mathematical model have stability in relation to infringement of requests normality the rests of model. They are insensitive not only to mistakes in a dependent variable, but also take into account a degree of influence of points of factorial space, that is reveal emissions in independent variables that allows to receive effective estimations of the coefficients regression models. For all methods a necessary condition of a solvency of their estimations is symmetry of allocating of mistakes of regression model.

But the main problem for the researcher is how to select an appropriate method for given task. In some cases using only one classification method for choosing the estimation method could not the solve problem. A multitude of techniques exists for modeling process outcomes. But the selection of modeling techniques to use for a given class of process is a nontrivial problem as there are many techniques from which to choose. It could be that the modeling technique used is not the most appropriate for the task and that accuracy can be increased through the use of a more appropriate model. There are many reasons why a model may have low predictive value.

Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.