A note of caution: comparison of different classifiers is not an easy task. Before you get into ranking of methods using the numbers presented in tables below please note the following facts. Many results we have collected give only a single number (even results from the StatLog project!), without standard deviation. Since most classifiers may give results that differ by several percent on slightly different data partitions single numbers do not mean much. Leaveoneout tests have been criticized as a basis for accuracy evaluation, the conclusion is that crossvalidation is safer, cf: Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th Int. Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp. 11371143. Crossvalidation tests (CV) are also not ideal. Theoretically about 2/3 of results should be within a single standard deviation from the average, and 95% of results should be within two standard deviations, so in a 10fold crossvalidation you should see very rarely reuslts that are beter or worse than 2xSTDs. Running CV several times may also give you different answers. Search for the best estimator continues. Cf: Dietterich, T. (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10 (7), 18951924; Nadeau C, Bengio Y. (1999) Inference for the Generalization Error. Tech. rep. 99s25, CIRANO, J. Machine Learning (Kluver, in print). Even the best accuracy and variance estimation is not sufficient, since performance cannot be characterized by a single number. It should be much better to provide full Receiver Operator Curves (ROC). Combining ROC with variance estimation would be ideal. Unfortunately this still remains to be done. All we can do now is to collect some numbers in tables. Our results are obtained usually with the GhostMiner package, developed in our group. Some publications with results are on my page. TuneIT, Testing Machine Learning & Data Mining Algorithms  Automated Tests, Repeatable Experiments, Meaningful Results. Results of handwritten signs and numbers classification are here.
Appendicitis.
106 vectors, 8 attributes, two classes (85 acute a. +21 other, or 80.2+19.8%), data from Shalom Weiss; Results obtained with the leaveoneout test, % of accuracy given Attribute names: WBC1, MNEP, MNEA, MBAP, MBAA, HNEP, HNEA
For 90% accuracy and p=0.95 confidence level 2tailed bounds are: [82.8%,94.4%] S.M. Weiss, I. Kapouleas, "An empirical comparison of pattern recognition, neural nets and machine learning classification methods", in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990 H.J. Hamilton, N. Shan, N. Cercone, RIAC: a rule induction algorithm based on approximate classification, Tech. Rep. CS 9606, Regina University 1996. CMLP2LN (logical rules) only estimated loo since the rules are like PVM. 3 crisp logical rules, overall 91.5% accuracy Results for 10fold stratified crossvalidation
Method
Accuracy %
Reference
NBC+WX+G(WX)
??.5±7.7
TMGM
NBC+G(WX)
??.2±6.7
TMGM
kNN auto+G(WX) Eukl
??.2±6.7
TMGM
CMLP2LN
89.6
our logical rules
20NN, stand. Eukl f 4,1,7
89.3±8.6
our (KG); feature sel. from CV on the whole data set
From UCI repository, 699 cases, 9 attributes, two classes, 458 (65.5%) & 241 (34.5%). Results obtained with the leaveoneout test, % of accuracy given.
F6 has 16 missing values, removing these vectors leaves 683 examples.
Method
Accuracy %
Reference
FSM
98.3
our (RA)
3NN stand Manhatan
97.1
our (KG)
21NN stand. Euclidean
96.9
our (KG)
C4.5 (decision tree)
96.0
Hamilton et.al
RIAC (prob. inductive)
95.0
Hamilton et.al
H.J. Hamilton, N. Shan, N. Cercone, RIAC: a rule induction algorithm based on approximate classification, Tech. Rep. CS 9606, Regina University 1996. Results obtained with the 10fold crossvalidation, 16 vectors with F6 values missing removed, 683 samples left, % of accuracy given.
method
Accuracy %
Reference
Naive MFT
97.1
Opper, Winther, L1O est. 97.3
SVM Gauss, C=1,s=0.1
97.0±2.3
WDGM
SVM (10xCV)
96.9
Opper, Winther
SVM lin, opt C
96.9±2.2
WDGM, same with Minkovsky kernel
Cluster means, 2 prototypes
96.5±2.2
MB
Default, majority
65.5

Results obtained with the 10fold crossvalidation, % of accuracy given, all data, missing vlues handled in different ways.
method
Accuracy %
Reference
NB + kernel est
97.5±1.8
WD, WEKA, 10X10CV
SVM (5xCV)
97.2
Bennet and Blue
kNN with DVDM distance
97.1
our (KG)
GM kNN, k=3, raw, Manh
97.0±2.1
WD, 10X10CV
GM kNN, k=opt, raw, Manh
97.0±1.7
WD, 10CV only
VSS, 8 it/2 neurons
96.9±1.8
WD/MK; 98.1% train
FSMFeature Space Mapping
96.9±1.4
RA/WD, a=.99 Gaussian
Fisher linear discr. anal
96.8
Ster, Dobnikar
MLP+BP
96.7
Ster, Dobnikar
MLP+BP (Tooldiag)
96.6
Rafał Adamczak
LVQ
96.6
Ster, Dobnikar
kNN, Euclidean/Manhattan f.
96.6
Ster, Dobnikar
SNB, seminaive Bayes (pairwise dependent)
96.6
Ster, Dobnikar
SVM lin, opt C
96.4±1.2
WDGM, 16 missing with 10
VSS, 8 it/1 neuron!
96.4±2.0
WD/MK, train 98.0%
GM IncNet
96.4±2.1
NJ/WD; FKF, max. 3 neurons
NB  naive Bayes (completly independent)
96.4
Ster, Dobnikar
SSV opt nodes, 3CV int
96.3±2.2
WD/GM; training 96.6±0.5
IB1
96.3±1.9
Zarndt
DBCART (decision tree)
96.2
Shang, Breiman
GM SSV Tree, opt nodes BFS
96.0±2.9
WD/KG (beam search 94.0)
LDA  linear discriminant analysis
96.0
Ster, Dobnikar
OC1 DT (5xCV)
95.9
Bennet and Blue
RBF (Tooldiag)
95.9
Rafał Adamczak
GTO DT (5xCV)
95.7
Bennet and Blue
ASI  Assistant I tree
95.6
Ster, Dobnikar
MLP+BP (Weka)
95.4±0.2
TW/WD
OCN2
95.2±2.1
Zarndt
IB3
95.0±4.0
Zarndt
MML tree
94.8±1.8
Zarndt
ASR  Assistant R (RELIEF criterion) tree
94.7
Ster, Dobnikar
C4.5 tree
94.7±2.0
Zarndt
LFC, Lookahead Feature Constr binary tree
94.4
Ster, Dobnikar
CART tree
94.4±2.4
Zarndt
ID3
94.3±2.6
Zarndt
C4.5 (5xCV)
93.4
Bennet and Blue
C 4.5 rules
86.7±5.9
Zarndt
Default, majority
65.5

QDA  quadratic discr anal
34.5
Ster, Dobnikar
For 97% accuracy and p=0.95 confidence level 2tailed bounds are: [95.5%,98.0%] K.P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997 N. Shang, L. Breiman, ICONIP'96, p.133 B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996. F. Zarndt, A Comprehensive Case Study: An Examination of Machine Learning and Connectionist Algorithms, MSc Thesis, Dept. of Computer Science, Brigham Young University, 1995
From UCI repository (restricted): 286 instances, 201 norecurrenceevents (70.3%), 85 recurrenceevents (29.7%); 9 attributes, between 213 values each, 9 missing values Results  10xCV? Sometimes methodology was unclear; difficult, noisy data, some methods are below the base rate (70.3%).

 For 78% accuracy and p=0.95 confidence level 2tailed bounds are: [72.9%,82.4%]
Assistant86 achieved 78 %, but this seems to be best result that happens in some crossvalidations, not the average.
Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant86: A KnowledgeElicitation Tool for Sophisticated Users. In I.Bratko & N.Lavrac (Eds.) Progress in Machine Learning, 3145, Sigma Press.
Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In: Progress in Machine Learning (from the Proceedings of the 2nd European Working Session on Learning), 1130, Bled, Yugoslavia: Sigma Press.
Porter R.B., G. Beate Zimmer, Don R. Hush: Stack Filter Classifiers. ISMM 2009: 282294
Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The MultiPurpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. In Proceedings of the Fifth National Conference on Artificial Intelligence, 10411045, Philadelphia, PA: Morgan Kaufmann.
Tan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning, 121134, Ann Arbor, MI.
F. Zarndt, A Comprehensive Case Study: An Examination of Machine Learning and Connectionist Algorithms, MSc Thesis, Dept. of Computer Science, Brigham Young University, 1995
S.M. Weiss, I. Kapouleas. An empirical comparison of pattern recognition, neural nets and machine learning classification methods, in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990
They used leaveoneout tests and obtained: MLP+backprop: 75.7% train, 71.5% test; Bayes 75.9% train, 71.8% test, CART & PVM 77.4% train, 77.1% test; kNN 65.3 test
From UCI repository, 155 vectors, 19 attributes, Two classes, die with 32 (20.6%), live with 123 (79.4%). Many missing values! F18 has 67 missing values, F15 has 29, F17 has 16 and other features between 0 and 11. Results obtained with the leaveoneout test, % of accuracy given
Method
Accuracy, % test
Reference
CMLP2LN/SSV single rule
76.2±0.0
WD/K. Grabczewski, stable rule
SSV Tree rule
75.7±1.1
WD, av. from 10x10CV
MML Tree
75.3±7.8
Zarndt
SVM Gauss, C=1, s =0.1
73.8±4.3
WD, GM
MLP+backprop
73.5±9.4
Zarndt
SVM Gauss, C, s opt
72.4±5.1
WD, GM
IB1
71.8±7.5
Zarndt
CART
71.4±5.0
Zarndt
ODT trees
71.3±4.2
Blanchard
SVM lin, C=opt
71.0±4.7
WD, GM
UCN 2
70.7±7.8
Zarndt
SFC, Stack filters
70.6±4.2
Porter
Default, majority
70.3±0.0
==
SVM lin, C=1
70.0±5.6
WD, GM
C 4.5 rules
69.7±7.2
Zarndt
Bayes rule
69.3±10.0
Zarndt
C 4.5
69.2±4.9
Blanchard
Weighted networks
6873.5
Tan, Eshelman
IB3
67.9±7.7
Zarndt
ID3 rules
66.2±8.5
Zarndt
AQ15
6672
Michalski e.a.
Inductive
6572
Clark, Niblett
Method
Accuracy %
Reference
21NN, stand Manhattan
90.3
our (KG)
FSM
90.0
our (RA)
14NN, stand. Euclid
89.0
our (KG)
LDA
86.4
Weiss & K
CART (decision tree)
82.7
Weiss & K
MLP+backprop
82.1
Weiss & K
MLP, CART, LDA results from (check it ?) S.M. Weiss, I. Kapouleas, "An empirical comparison of pattern recognition, neural nets and machine learning classification methods", in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990. Other results  our own; Results obtained with the 10fold crossvalidation, % of accuracy given; our results with stratified crossvalidation, other results  who knows? Differences for this dataset are rather small, 0.10.2%.
Method
Accuracy %
Reference
Weighted 9NN
92.9±?
Karol Grudziński
18NN, stand. Manhattan
90.2±0.7
Karol Grudziński
FSM with rotations
89.7±?
Rafał Adamczak
15NN, stand. Euclidean
89.0±0.5
Karol Grudziński
VSS 4 neurons, 5 it
86.5±8.8
WD/MK, train 97.1
FSM without rotations
88.5
Rafał Adamczak
LDA, linear discriminant analysis
86.4
Stern & Dobnikar
Naive Bayes and SemiNB
86.3
Stern & Dobnikar
IncNet
86.0
Norbert Jankowski
QDA, quadratic discriminant analysis
85.8
Stern & Dobnikar
1NN
85.3±5.4
Stern & Dobnikar, std added by WD
VSS 2 neurons, 5 it
85.1±7.4
WD/MK, train 95.0
ASR
85.0
Stern & Dobnikar
Fisher discriminant analysis
84.5
Stern & Dobnikar
LVQ
83.2
Stern & Dobnikar
CART (decision tree)
82.7
Stern & Dobnikar
MLP with BP
82.1
Stern & Dobnikar
ASI
82.0
Stern & Dobnikar
LFC
81.9
Stern & Dobnikar
RBF (Tooldiag)
79.0
Rafał Adamczak
MLP+BP (Tooldiag)
77.4
Rafał Adamczak
Results on BP, LVQ, ..., SNB are from: B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996. Our good results reflect superior handling of missing values ? Duch W, Grudziński K (1998) A framework for similaritybased methods. Second Polish Conference on Theory and Applications of Artificial Intelligence, Lodz, 2830 Sept. 1998, pp. 3360 Weighted kNN: Duch W, Grudzinski K and Diercksen G.H.F (1998) Minimal distance neural methods. World Congress of Computational Intelligence, May 1998, Anchorage, Alaska, IJCNN'98 Proceedings, pp. 12991304
Attributes types: Real: 1,4,5,8,10,12; Ordered:11, Binary: 2,6,9 Nominal:7,3,13 Classes: Absence (1) or presence (2) of heart disease; In Statlog experiments on heart data cost or risk matrix has been used with 9fold crossvalidation, only cost values are given. Results below are obtained with the 10fold crossvalidation, % of accuracy given, no risk matrix
From UCI repository, 303 cases, 13 attributes (4 cont, 9 nominal), 7 vectors with missing values ? 2 (no, yes) or 5 classes (no, degree 1, 2, 3, 4). Class distribution: 164 (54.1%) no, 55+36+35+13 yes (45.9%) with disease degree 14. Results obtained with the leaveoneout test, % of accuracy given, 2 classes used.
Method
Accuracy %
Reference
LDA
84.5
Weiss ?
25NN, stand, Euclid
83.6±0.5
WD/KG repeat??
CMLP2LN
82.5
RA, estimated?
FSM
82.2
Rafał Adamczak
MLP+backprop
81.3
Weiss ?
CART
80.8
Weiss ?
MLP, CART, LDA where are these results from ??? Other results  our own. Results obtained with the 10fold crossvalidation, % of accuracy given. Ster & Dobnikar reject 6 vectors (leaving 297) with missing values. We use all 303 vectors replacing missing values by means for their class; in KNN we have used Stalog convention, 297 vectors
For 85% accuracy and p=0.95 confidence level 2tailed bounds are: [80.5%,88.6%] Results obtained with BP, LVQ, ..., SNB are from: B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In: A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996.
Magnus Stensmo and Terrence J. Sejnowski, A Mixture Model System for Medical and Machine Diagnosis, Advances in Neural Information Processing Systems 7 (1995) 10771084
Kristin P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997 Other results for this dataset (methodology sometimes uncertain): D. Wettschereck, averaging 25 runs with 70% train and 30% test, variants of kNN with different metric functions and scaling. David Aha & Dennis Kibler  From UCI repository past usage
Method
Accuracy %
Reference
kNN, Value Distance Metric (VDM)
82.6
D. Wettschereck
kNN, Euclidean
82.4±0.8
D. Wettschereck
kNN, Variable Similarity Metric
82.4
D. Wettschereck
kNN, Modified VDM
83.1
D. Wettschereck
Other kNN variants
< 82.4
D. Wettschereck
kNN, Mutual Information
81.8
D. Wettschereck
CLASSIT (hierarchical clustering)
78.9
Gennari, Langley, Fisher
NTgrowth (instancebased)
77.0
Aha & Kibler
C4
74.8
Aha & Kibler
Naive Bayes
82.8±1.3
Friedman et.al, 5xCV, 296 vectors
Gennari, J.H., Langley, P, Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 1161. Friedman N, Geiger D, Goldszmit M (1997). Bayesian networks classifiers. Machine Learning 29: 131163
From the UCI repository, dataset "Pima Indian diabetes": 2 classes, 8 attributes, 768 instances, 500 (65.1%) negative (class1), and 268 (34.9%) positive tests for diabetes. class2. All patients were females at least 21 years old of Pima Indian heritage. Attributes used: 1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) Results obtained with the 10fold crossvalidation, % of accuracy given; Statlog results are with 12fold crossvalidation
For 77.7% accuracy and p=0.95 confidence level 2tailed bounds are: [74.6%,80.5%] Results on BP, LVQ, ..., SNB are from: B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996.
Porter R.B., G. Beate Zimmer, Don R. Hush: Stack Filter Classifiers. ISMM 2009: 282294
Shang N, L. Breiman, ICONIP'96, p.133
Other results (with different tests):
Method
Accuracy %
Reference
SVM (5xCV)
77.6
Bennet and Blue
C4.5
76.0±0.9
Friedman, 5xCV
SemiNaive Bayes
76.0±0.8
Friedman, 5xCV
Naive Bayes
74.5±0.9
Friedman, 5xCV
Default, majority
65.1
Friedman N, Geiger D, Goldszmit M (1997). Bayesian networks classifiers. Machine Learning 29: 131163 Opper/Winther use 200 training and 332 test examples (following Rippley), with TAP MFT results on test 81%, SVS at 80.1% and best NN as 77.4%.
Thyroid, From UCI repository, dataset "anntrain.data": A Thyroid database suited for training ANNs. 3772 learning and 3428 testing examples; primary hypothyroid, compensated hypothyroid, normal. Training: 93+191+3488 or 2.47%, 5.06%, 92.47% Test: 73+177+3178 or 2.13%, 5.16%, 92.71% 21 attributes (15 binary, 6 continuous); 3 classes The problem is to determine whether a patient referred to the clinic has hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. Because 92 percent of the patients are not hyperthyroid. A good classifier must be significant better than 92%. Note: These are the datas Quinlans used in the case study of his article "Simplifying Decision Trees" (International Journal of ManMachine Studies (1987) 221234) Names: I (W.D.) have investigated this issue and after some mail exchange with Chris Mertz, who maintains the UCI repository; here is the conclusion:
1 age: continuous
2 sex: {M, F}
3 on thyroxine: logical
4 maybe on thyroxine: logical
5 on antithyroid medication: logical
6 sick  patient reports malaise: logical
7 pregnant: logical
8 thyroid surgery: logical
9 I131 treatment: logical
10 test hypothyroid: logical
11 test hyperthyroid: logical
12 on lithium: logical
13 has goitre: logical
14 has tumor: logical
15 hypopituitary: logical
16 psychological symptoms: logical
17 TSH: continuous
18 T3: continuous
19 TT4: continuous
20 T4U: continuous
21 FTI: continuous
Results:
Method
% training
% test
Reference
CMLP2LN rules+ASA
99.90
99.36
Rafał/Krzysztof/Grzegorz
CART
99.80
99.36
Weiss
PVM
99.80
99.33
Weiss
SSV beam search
99.80
99.33
WD
IncNet
99.68
99.24
Norbert
SSV opt leaves or pruning
99.7
99.1
WD
MLP init+ a,b opt.
99.5
99.1
Rafał
CMLP2LN rules
99.7
99.0
Rafał/Krzysztof
Cascade correlation
100.0
98.5
Schiffmann
Local adapt. rates
99.6
98.5
Schiffmann
BP+genetic opt.
99.4
98.4
Schiffmann
Quickprop
99.6
98.3
Schiffmann
RPROP
99.6
98.0
Schiffmann
3NN, Euclides, with 3 features
98.7
97.9
W.D./Karol
1NN, Euclides, with 3 features
98.4
97.7
W.D./Karol
Best backpropagation
99.1
97.6
Schiffmann
1NN, Euclides, 8 features used

97.3
Karol/W.D.
SVM Gauss, C=8 s=0.1
98.3
96.1
WD
Bayesian classif.
97.0
96.1
Weiss?
SVM Gauss, C=1 s=0.1
95.4
94.7
WD
BP+conj. gradient
94.6
93.8
Schiffmann
1NN Manhattan, std data
93.8
Karol G./WD
SVM lin, C=1
94.1
93.3
WD
SVM Gauss, C=8 s=5
100
92.8
WD
Default, majority 250 test errors
92.7
1NN Manhattan, raw data
92.2
Karol G./WD
For 99.90% accuracy on training and p=0.95 confidence level 2tailed bounds are: [99.74%,99.96%] Most NN results from W. Schiffmann, M. Joost, R. Werner, 1993; MLP2LN and Init+a,b ours. kNN, PVM and CART from S.M. Weiss, I. Kapouleas, "An empirical comparison of pattern recognition, neural nets and machine learning classification methods", in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990 SVM with linear and Gaussian kernels gives quite poor results on this data. 3 crisp logical rules using TSH, FTI, T3, on_thyroxine, thyroid_surgery, TT4 give 99.3% of accuracy on the test set.
Hepatobiliary disorders
Contains medical records of 536 patients admitted to a universityaffiliated Tokyobased hospital, with four types of hepatobiliary disorders: alcoholic liver damage, primary hepatoma, liver cirrhosis and cholelithiasis. The records included results of 9 biochemical tests and sex of the patient. The same 163 cases as in [Hayashi et.al] were used as the test data. FSM gives about 60 Gaussian or triangular membership functions achieving accuracy of 75.575.8%. Rotation of these functions (i.e. introducing linear combination of inputs to the rules) does not improve this accuracy. 10fold crossvalidation tests on the mixed, training plus test data, give similar results. The best results were obtained with the K* method based on algorithmic complexity optimization, giving 78.5% on the test set, and kNN with Manhattan distance function, k=1 and selection of features (using the leaveoneout method on the training data, features 2, 5, 6 and 9 were removed), giving 80.4% accuracy. Simulated annealing optimization of the scaling factors for the remaining 5 features give 81.0% and optimizing scaling factors using all input features 82.8%. The scaling factors are: 0.92, 0.60, 0.91, 0.92, 0.07, 0.41, 0.55, 0.86, 0.30. Similar accuracy is obtained using multisimplex method for optimization of the scaling factors.
Method
Training set
Test set
Reference
IB2IB4
81.285.5
43.644.6
WEKA, our calculation
Naive Bayes

46.6
WEKA, our calculation
1R (rules)
58.4
50.3
WEKA, our calculation
T2 (rules from decision tree)
67.5
53.3
WEKA, our calculation
FOIL (inductive logic)
99
60.1
WEKA, our calculation
FSM, initial 49 crisp logical rules
83.5
63.2
FSM, our calculation
LDA (statistical)
68.4
65.0
our calculation
DLVQ (38 nodes)
100
66.0
our calculation
C4.5 decision rules
64.5
66.3
our calculation
Best fuzzy MLP model
75.5
66.3
Mitra et. al
MLP with RPROP
68.0
our calculation
Cascade Correlation
71.0
our calculation
Fuzzy neural network
100
75.5
Hayashi
C4.5 decision tree
94.4
75.5
our calculation
FSM, Gaussian functions
93
75.6
our calculation
FSM, 60 triangular functions
93
75.8
our calculation
IB1c (instancebased)

76.7
WEKA, our calculation
kNN, k=1, Camberra, raw
76.1
80.4
WD/SBL
K* method

78.5
WEKA, our calculation
1NN, 4 features removed, Manhattan
76.9
80.4
our calculation, KG
1NN, Camberra, raw, removed f2, 6, 8, 9
77.2
83.4
our calculation, KG
Y. Hayashi, A. Imura, K. Yoshida, “Fuzzy neural expert system and its application to medical diagnosis”, in: 8th International Congress on Cybernetics and Systems, New York City 1990, pp. 5461 S. Mitra, R. De, S. Pal, “Knowledge based fuzzy MLP for classification and rule generation”, IEEE Transactions on Neural Networks 8, 13381350, 1997, a knowledgebased fuzzy MLP system gives results on the test set in the range from 33% to 66.3%, depending on the actual fuzzy model used. W. Duch and K. Grudzinski, ``Prototype Based Rules  New Way to Understand the Data,'' Int. Joint Conference on Neural Networks, Washington D.C., pp. 18581863, 2001. Contains best results with 1NN, Camberra and feature selection, 83.4% on the test.
Training 4435 test 2000 cases, 36 semicontinous [0 to 255] attributes (= 4 spectral bands x 9 pixels in neighbourhood) and 6 decision classes: 1,2,3,4,5 and 7 (class 6 has been removed because of doubts about the validity of this class). The StatLog database consists of the multispectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multispectral values. In the sample database, the class of a pixel is coded as a number.


Method
% training
% test
Time train
Time test
MLP+SCG
96.0
91.0
reg alfa=0.5, 36 hidden nodes, 1400 it
fast; WD
kNN

90.9
autok=3, Manhattan, std data
GM 2.0
kNN
91.1
90.6
2105, Statlog
944; parametry?
kNN

90.4
autok=5, Euclidean, std data
GM 2.0
kNN

90.0
k=1, Manhattan, std data, no training
fast, GM 2.0
FSM
95.1
89.7
std data, a=0.95
fast, GM 2.0; best NN result
LVQ
95.2
89.5
1273
44
kNN

89.4
k=1, Euclidean, std data, no training
fast, GM 2.0
Dipol92
94.9
88.9
746
111
MLP+SCG
94.4
88.5
5000 it; active learning+reg a=0.5, 812 hidden
fast; WD
SVM
91.6
88.4
std data, Gaussian kernel
fast, GM 2.0; unclassified 4.3%
Radial
88.9
87.9
564
74
Alloc80
96.4
86.8
63840
28757
IndCart
97.7
86.2
2109
9
CART
92.1
86.2
330
14
MLP+BP
88.8
86.1
72495
53
Bayesian Tree
98.0
85.3
248
10
C4.5
96.0
85.0
434
1
New ID
93.3
85.0
226
53
QuaDisc
89.4
84.5
157
53
SSV
90.9
84.3
default par.
very fast, GM 2.0
Cascade
88.8
83.7
7180
1
Log DA, Disc
88.1
83.7
4414
41
LDA, Discrim
85.1
82.9
68
12
Kohonen
89.9
82.1
12627
129
Bayes
69.2
71.3
75
17
The original database was generated from Landsat MultiSpectral Scanner image data. The sample database was generated taking a small section (82 rows and 100 columns) from the original data. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infrared. Each pixel is a 8bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels. The database is a (tiny) subarea of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 subarea. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. In each line of data the four spectral values for the topleft pixel are given first followed by the four spectral values for the topmiddle pixel and then those for the topright pixel, and so on with the pixels read out in sequence lefttoright and toptobottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary. All results from Statlog book, except GM  GhostMiner calculations, W. Duch.
351 data records, with class division 224 (63.8%) + 126 (35.9%). Usually first 200 vectors are taken for training, and last 151 for the test, but this is very unbalanced: in the training set 101 (50.5%) and 99 (49.5%) are from 1/2 class, in the test set 123 (82%) and 27 (18%) are from class 1/2. 34 attributes, but f2=0 always and should be removed; f1 is binary, the remaining 32 attributes are continuous. 2 classes  different types of radar signals reflected from ionoshpere. Some vectors: 8, 18, 20, 22, 24, 30, 38, 52, 76, 78, 80, 82, 103, 163, 169, 171, 183, 187, 189, 191, 201, 215, 219, 221, 223, 225, 227, 229, 231, 233, 249, are either binary 0, 1 or have only 3 values 1, 0, +1. For example, vector 169 has only one component = 1, all others are 0.
Method
Accuracy %
Reference
3NN + simplex
98.7
Our own weighted kNN
VSS 2 epochs
96.7
MLP with numerical gradient
3NN
96.7
KG, GM with or without weights
IB3
96.7
Aha, 5 errors on test
1NN, Manhattan
96.0
GM kNN (our)
MLP+BP
96.0
Sigillito
SVM Gaussian
94.9±2.6
GM (our), defaults, similar for C=1100
C4.5
94.9
Hamilton
3NN Canberra
94.7
GM kNN (our)
RIAC
94.6
Hamilton
C4 (no windowing)
94.0
Aha
C4.5
93.7
Bennet and Blue
SVM
93.2
Bennet and Blue
Nonlin perceptron
92.0
Sigillito
FSM + rotation
92.8
our
1NN, Euclidean
92.1
Aha, GM kNN (our)
DBCART
91.3
Shang, Breiman
Linear perceptron
90.7
Sigillito
OC1 DT
89.5
Bennet and Blue
CART
88.9
Shang, Breiman
SVM linear
87.1±3.9
GM (our), defaults
GTO DT
86.0
Bennet and Blue
Perceptron+MLP results: Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262266. N. Shang, L. Breiman, ICONIP'96, p.133 David Aha: kNN+C4+IB3, from Aha, D. W., & Kibler, D. (1989). Noisetolerant instancebased learning algorithms. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 794799). Detroit, MI: Morgan Kaufmann. IB3 parameter settings: 70% and 80% for acceptance and dropping respectively. RIAC, C4.5 from: H.J. Hamilton, N. Shan, N. Cercone, RIAC: a rule induction algorithm based on approximate classification, Tech. Rep. CS 9606, Regina University 1996. K.P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997 Training/test division is not too good in this case, distributions are a bit differnet. In 10xCV results are:
Method
Accuracy %
Reference
SFM+G+G(WX)
??±2.6
GM (our), C=1, s=25
kNN auto+WX+G(WX)
??.4±3.6
GM (our)
SVM Gaussian
94.6±4.3
GM (our), C=1, s=25
VSSMKNN
91.5±4.3
MK, 12 neurons (similar 817)
SVM lin
89.5±3.8
GM (our), C=1, s=25
SSV tree
87.8±4.5
GM (our), default
1NN
85.8±4.9
GM std, Euclid
3NN
84.0±5.4
GM std, Euclid
VSS is an MLP with search, implemented by Mirek Kordos, used with 3 epochs; neurons may be sigmoidal or stepwise (64 values). Maszczyk T, Duch W, Support Feature Machine, WCCI 2010 (submitted).
208 cases, 60 continuous attributes, 2 classes, 111 metal, 97 rock. From the CMU benchmark repository This dataset has been used in two kinds of experiments: 1. The "aspectangle independent" experiments use all 208 cases with 13fold crossvalidation, averaged over 10 runs to get std. 2. The "angle independent experiments" use training / test sets with 104 vectors each. Class distribution in training is 49 + 55, in test 62 + 42. Estimation of L1O on the whole dataset (Opper and Winther) give 78.2% only; is the test so easy? Some of this results were made without standardization of the data, which is here very important! The "angle independent experiments" with training / test sets.
Method
Train %
Test %
Reference
1NN, 5D from MDS, Euclid, std
97.1
our, GM (WD)
1NN, Manhattan std
97.1
our, GM (WD)
1NN, Euclid std
96.2
our, GM (WD)
TAP MFT Bayesian

92.3
Opper, Winther
Naive MFT Bayesian

90.4
Opper, Winther
SVM

90.4
Opper, Winther
MLP+BP, 12 hidden, best MLP

90.4
Gorman, Sejnowski
1NN, Manhattan raw
92.3
our, GM (WD)
1NN, Euclid raw
91.3
our, GM (WD)
FSM  methodology ?
83.6
our (RA)
The "angle dependent experiments" with 13 CV on all data.
1NN Euclid on 5D MDS input
87.5±0.8
our GM (WD)
1NN Euclidean, std data
86.8±1.2
our GM (WD)
1NN Manhattan, std data
86.3±0.3
our GM (WD)
MLP+BP, 12 hidden
99.8±0.1
84.7±5.7
Gorman, Sejnowski
1NN Manhattan, raw data
84.5±0.4
our GM (WD)
MLP+BP, 24 hidden
99.8±0.1
84.5±5.7
Gorman, Sejnowski
MLP+BP, 6 hidden
99.7±0.2
83.5±5.6
Gorman, Sejnowski
SVM linear, C=0.1
82.7±8.5
our GM (WD), std data
1NN Euclidean, raw data
82.1±0.9
our GM (WD)
SVM Gauss, C=1, s=0.1
77.4±10.1
our GM (WD), std data
SVM linear, C=1
76.9±11.9
our GM (WD), raw data
SVM linear, C=1
76.0±9.8
our GM (WD), std data
DBCART, 10xCV
81.8
Shang, Breiman
CART, 10xCV
67.9
Shang, Breiman
M. Opper and O. Winther, Gaussian Processes and SVM: Mean Field Results and LeaveOneOut. In: Advances in Large Margin Classifiers, Eds. A. J. Smola, P. Bartlett, B. Schölkopf, D. Schuurmans, MIT Press, 311326, 2000; same methodology as Gorman with Sejnowski. N. Shang, L. Breiman, ICONIP'96, p.133, 10xCV Gorman, R. P., and Sejnowski, T. J. (1988). "Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets", Neural Networks 1, pp. 7589, 13xCV Our results: kNN results from 10xCV and from 13xCV are quite similar, so Shang and Breiman should not differ much from 13 CV. WD Leaveoneout (L1O) estimations on std data: L1O with k=1, Euclidean distance, for all data gives 87.50%, other k and distance function do not give significant improvement. SVM linear, C=1, L1O 75.0%, for Gaussian kernel, C=1, L1O is 78.8% Other L1O results taken from C. Domeniconi, J. Peng, D. Gunopulos, "An adaptive metric for pattern classification".
528 training, 462 test cases, 10 continous attributes, 11 classes From the UCI benchmark repository. Speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios. Results on the total set
Method
Train
Test
Reference
CARTDB, 10xCV on total set !!!
90.0
Shang, Breiman
CART, 10xCV on total set
78.2
Shang, Breiman
Method
Train
Test
Reference
Square node network, 88 units
54.8
UCI
Gaussian node network, 528 units
54.6
UCI
1NN, Euclides, raw
99.24
56.3
WD/KG
Radial Basis Function, 528 units
53.5
UCI
Gaussian node network, 88 units
53.5
UCI
FSM Gauss, 10CV na treningowym
92.60
51.94
our (RA)
Square node network, 22
51.1
UCI
Multilayer perceptron, 88 hidden
50.6
UCI
Modified Kanerva Model, 528 units
50.0
UCI
Radial Basis Function, 88 units
47.6
UCI
Singlelayer perceptron, 88 hidden
33.3
UCI
N. Shang, L. Breiman, ICONIP'96, p.133, made 10xCv instead of using the test set.
Parameters in SVM were optimized, that is in each CV different paramters were used, so only approximate value can be quoted. If they are fixed to C=1000, s=1 results are a bit worse. Papers using this data:
S. K. Pal and D. Dutta Majumder, ``Fuzzy sets and decision making approaches in vowel and speaker recognition'', IEEE Transactions on Systems, Man, and Cybernetics, Vol. 7, pp. 625629, 1977.
S. Mitra, M. Banerjee and S. K. Pal, Rough knowledgebased network, fuzziness and classification, Neural Computing & Applications 7, 1725, 1998.
Duch W and Hayashi Y, Computational intelligence methods and data understanding. In: Quo Vadis computational Intelligence? New trends and approaches in computational intelligence. Eds. P. Sincak, J. Vascak, Springer studies in fuzziness and soft computing, Vol. 54 (2000), pp. 256270.
Chaoshun Li, Jianzhong Zhou, Qingqing Li and Xiuqiao Xiang, A Fuzzy Cluster Algorithm Based on Mutative Scale Chaos Optimization, LNCS 5264, 259267, 2008.
Source: UCI, described in Forina, M. et al, PARVUS  An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. Class distribution: 178 cases = [59, 71, 48] in Class 13; 13 continuous attributes: alcohol, malicacid, ash, alkalinity, magnesium, phenols, flavanoids, nonanthocyanins, proanthocyanins, color, hue, OD280/D315, proline.
Method
Test
Reference
Leaveoneout test results
RDA
100
[1]
QDA
99.4
[1]
LDA
98.9
[1]
kNN, Manhattan, k=1
98.7
GMWD, std data
1NN
96.1
[1] ztransformed data
kNN, Euclidean, k=1
95.5
GMWD, std data
kNN, Chebyshev, k=1
93.3
GMWD, std data
10xCV tests below
kNN, Manhattan, auto k=110
98.9±2.3
GMWD, 2D data, after MDS/PCA
IncNet, 10CV, def, Gauss
98.9±2.4
GMWD, std data, up to 3 neurons
10 CV SSV, opt prune
98.3±2.7
GMWD, 2D data, after MDS/PCA
10 CV SSV, node count 7
98.3±2.7
GMWD, 2D data, after MDS/PCA
kNN, Euclidean, k=1
97.8±2.8
GMWD, 2D data, after MDS/PCA
kNN, Manhattan, k=1
97.8±2.9
GMWD, 2D data, after MDS/PCA
kNN, Manhattan, auto k=110
97.8±3.9
GMWD
kNN, Euclidean, k=3, weighted features
97.8±4.7
GMWD
IncNet, 10CV, def, bicentral
97.2±2.9
GMWD, std data, up to 3 neurons
kNN, Euclidean, auto k=110
97.2±4.0
GMWD
10 CV SSV, opt node
97.2±5.4
GMWD, 2D data, after MDS/PCA
FSM a=.99, def
96.1±3.7
GMWD, 2D data, after MDS/PCA
FSM 10CV, Gauss, a=.999
96.1±4.7
GMWD, std data, 811 neurons
FSM 10CV, triang, a=.99
96.1±5.9
GMWD, raw data
kNN, Euclidean, k=1
95.5±4.4
GMWD
10 CV SSV, opt node, BFS
92.8±3.7
GMWD
10 CV SSV, opt node, BS
91.6±6.5
GMWD
10 CV SSV, opt prune, BFS
90.4±6.1
GMWD
UCI past usage: [1] S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 9202, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland (submitted to Technometrics). [2] S. Aeberhard, D. Coomans and O. de Vel, "The classification performance of RDA" Tech. Rep. no. 9201, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland (submitted to Journal of Chemometrics).
Shang, Breiman CART 71.4% accuracy, DBCART 70.6%. Leaveoneout results taken from C. Domeniconi, J. Peng, D. Gunopulos, "An adaptive metric for pattern classification".
Stalog Data: splice junctions are points on a DNA sequence at which `superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a "acceptors'' while EI borders are referred to as "donors''.) Number of Instances: 3190. Class distribution:
Class
Train
Test
1
464 (23.20%)
303 (25.55%)
2
485 (24.25%)
280 (23.61%)
3
1051 (52.55%)
603 (50.84%)
All
2000 (100%)
1186 (100%)
Number of attributes: originally 60 attributes {a,c,t,g}, usually converted to 180 binary indicator variables {(0,0,0), (0,0,1), (0,1,0), (1,0,0)}, or 240 binary variables. Much better performance is generally observed if attributes closest to the junction are used (middle). In the StatLog version (180 variables), this means using attributes A61 to A120 only.
Method
% in training
% on test
Time train
Time test
RBF, 720 nodes
98.5
95.9
kNN GM, p(XC), k=6, Euclid, raw
96.8
95.5
0
short
Dipol92
99.3
95.2
213
10
Alloc80
93.7
94.3
14394

QuaDisc
100.0
94.1
1581
809
LDA, Discrim
96.6
94.1
929
31
FSM, 8 Gaussians, 180 binary
95.4
94.0
Log DA, Disc
99.2
93.9
5057
76
SSV Tree, p(XC), opt node, 4CV
94.8
93.4
short
short
Naive Bayes
94.8
93.2
52
15
Castle, middle 90 binary var
93.9
92.8
397
225
IndCart, 180 binary
96.0
92.7
523
516
C4.5, on 60 features
96.0
92.4
9
2
CART, middle 90 binary var
92.5
91.5
615
9
MLP+BP
98.6
91.2
4094
9
Bayesian Tree
99.9
90.5
82
11
CN2
99.8
90.5
869
74
New ID
100.0
90.0
698
1
Ac2
100.0
90.0
12378
87
Smart
96.6
88.5
79676
16
Cal5
89.6
86.9
1616
8
Itrule
86.9
86.5
2212
6
kNN
91.1
85.4
2428
882
Kohonen
89.6
66.1


Default, majority
52.5
50.8
kNN GM  GhostMiner version of kNN (our group) SSV Decision Tree  our results
Datasets used for classification: comparison of results
Links on: AI and Machine Learning  AI in Information Retrieval  Cognitive science  Computational Intelligence  Neuroscience  Software & Databases  Science & Fringes  Comparison of classfication results  Logical rules extracted from data 
Before using any new dataset it should be described here!
Results from the Statlog project are here.
Logical rules derived for data are here.
Medical:
Appendicitis 
Breast cancer (Wisconsin) 
Breast Cancer (Ljubljana) 
Diabetes (Pima Indian) 
Heart disease (Cleveland) 
Heart disease (Statlog version) 
Hepatitis 
Hypothyroid 
Hepatobiliary disorders 
Other datasets:
Ionosphere 
Satellite image dataset (Statlog version) 
Sonar 
Telugu Vovel 
Vovel 
Wine 
Other data: Glass, DNA 
More results for Statlog datasets.
A note of caution: comparison of different classifiers is not an easy task. Before you get into ranking of methods using the numbers presented in tables below please note the following facts.
Many results we have collected give only a single number (even results from the StatLog project!), without standard deviation. Since most classifiers may give results that differ by several percent on slightly different data partitions single numbers do not mean much.
Leaveoneout tests have been criticized as a basis for accuracy evaluation, the conclusion is that crossvalidation is safer, cf:
Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th Int. Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp. 11371143.
Crossvalidation tests (CV) are also not ideal. Theoretically about 2/3 of results should be within a single standard deviation from the average, and 95% of results should be within two standard deviations, so in a 10fold crossvalidation you should see very rarely reuslts that are beter or worse than 2xSTDs. Running CV several times may also give you different answers. Search for the best estimator continues. Cf:
Dietterich, T. (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10 (7), 18951924;
Nadeau C, Bengio Y. (1999) Inference for the Generalization Error. Tech. rep. 99s25, CIRANO, J. Machine Learning (Kluver, in print).
Even the best accuracy and variance estimation is not sufficient, since performance cannot be characterized by a single number. It should be much better to provide full Receiver Operator Curves (ROC). Combining ROC with variance estimation would be ideal.
Unfortunately this still remains to be done. All we can do now is to collect some numbers in tables.
Our results are obtained usually with the GhostMiner package, developed in our group.
Some publications with results are on my page.
TuneIT, Testing Machine Learning & Data Mining Algorithms  Automated Tests, Repeatable Experiments, Meaningful Results.
Results of handwritten signs and numbers classification are here.
Appendicitis.
106 vectors, 8 attributes, two classes (85 acute a. +21 other, or 80.2+19.8%), data from Shalom Weiss;Results obtained with the leaveoneout test, % of accuracy given
Attribute names: WBC1, MNEP, MNEA, MBAP, MBAA, HNEP, HNEA
k=4,5, stand. Euclid, f2+f4 removed
S.M. Weiss, I. Kapouleas, "An empirical comparison of pattern recognition, neural nets and machine learning classification methods", in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990
H.J. Hamilton, N. Shan, N. Cercone, RIAC: a rule induction algorithm based on approximate classification, Tech. Rep. CS 9606, Regina University 1996.
CMLP2LN (logical rules) only estimated loo since the rules are like PVM.
3 crisp logical rules, overall 91.5% accuracy
Results for 10fold stratified crossvalidation
Wisconsin breast cancer.
From UCI repository, 699 cases, 9 attributes, two classes, 458 (65.5%) & 241 (34.5%).Results obtained with the leaveoneout test, % of accuracy given.
F6 has 16 missing values, removing these vectors leaves 683 examples.
Results obtained with the 10fold crossvalidation, 16 vectors with F6 values missing removed, 683 samples left, % of accuracy given.
K.P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997
N. Shang, L. Breiman, ICONIP'96, p.133
B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996.
F. Zarndt, A Comprehensive Case Study: An Examination of Machine Learning and Connectionist Algorithms, MSc Thesis, Dept. of Computer Science, Brigham Young University, 1995
Breast Cancer (Ljubljana data)
From UCI repository (restricted): 286 instances, 201 norecurrenceevents (70.3%), 85 recurrenceevents (29.7%);9 attributes, between 213 values each, 9 missing values
Results  10xCV? Sometimes methodology was unclear;
difficult, noisy data, some methods are below the base rate (70.3%).


For 78% accuracy and p=0.95 confidence level 2tailed bounds are: [72.9%,82.4%]
They used leaveoneout tests and obtained:
MLP+backprop: 75.7% train, 71.5% test;
Bayes 75.9% train, 71.8% test,
CART & PVM 77.4% train, 77.1% test;
kNN 65.3 test
Hepatitis.
From UCI repository, 155 vectors, 19 attributes,Two classes, die with 32 (20.6%), live with 123 (79.4%).
Many missing values! F18 has 67 missing values, F15 has 29, F17 has 16 and other features between 0 and 11.
Results obtained with the leaveoneout test, % of accuracy given
==
Other results  our own;
Results obtained with the 10fold crossvalidation, % of accuracy given; our results with stratified crossvalidation, other results  who knows? Differences for this dataset are rather small, 0.10.2%.
Our good results reflect superior handling of missing values ?
Duch W, Grudziński K (1998) A framework for similaritybased methods. Second Polish Conference on Theory and Applications of Artificial Intelligence, Lodz, 2830 Sept. 1998, pp. 3360
Weighted kNN: Duch W, Grudzinski K and Diercksen G.H.F (1998) Minimal distance neural methods. World Congress of Computational Intelligence, May 1998, Anchorage, Alaska, IJCNN'98 Proceedings, pp. 12991304
Statlog version of Cleveland Heart disease.
13 attributes (extracted from 75), no missing values.270=150+120 observations selected from the 303 cases (Cleveland Heart).
Attribute Information:
in mg/dl
by flouroscopy
Classes: Absence (1) or presence (2) of heart disease;
In Statlog experiments on heart data cost or risk matrix has been used with 9fold crossvalidation, only cost values are given.
Results below are obtained with the 10fold crossvalidation, % of accuracy given, no risk matrix
Cleveland heart disease.
From UCI repository, 303 cases, 13 attributes (4 cont, 9 nominal), 7 vectors with missing values ?2 (no, yes) or 5 classes (no, degree 1, 2, 3, 4).
Class distribution: 164 (54.1%) no, 55+36+35+13 yes (45.9%) with disease degree 14.
Results obtained with the leaveoneout test, % of accuracy given, 2 classes used.
Other results  our own.
Results obtained with the 10fold crossvalidation, % of accuracy given.
Ster & Dobnikar reject 6 vectors (leaving 297) with missing values.
We use all 303 vectors replacing missing values by means for their class; in KNN we have used Stalog convention, 297 vectors
baserate
Results obtained with BP, LVQ, ..., SNB are from: B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In: A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996.
Magnus Stensmo and Terrence J. Sejnowski, A Mixture Model System for Medical and Machine Diagnosis, Advances in Neural Information Processing Systems 7 (1995) 10771084
Kristin P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997
Other results for this dataset (methodology sometimes uncertain):
D. Wettschereck, averaging 25 runs with 70% train and 30% test, variants of kNN with different metric functions and scaling.
David Aha & Dennis Kibler  From UCI repository past usage
Friedman N, Geiger D, Goldszmit M (1997). Bayesian networks classifiers. Machine Learning 29: 131163
Diabetes.
From the UCI repository, dataset "Pima Indian diabetes":2 classes, 8 attributes, 768 instances, 500 (65.1%) negative (class1), and 268 (34.9%) positive tests for diabetes. class2.
All patients were females at least 21 years old of Pima Indian heritage.
Attributes used:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
Results obtained with the 10fold crossvalidation, % of accuracy given; Statlog results are with 12fold crossvalidation
Results on BP, LVQ, ..., SNB are from: B. Ster and A. Dobnikar, Neural networks in medical diagnosis: Comparison with other methods. In A. Bulsari et al., editor, Proceedings of the International Conference EANN '96, pages 427430, 1996.
 Bennett K.P, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997
 Blanchard, G., Schafer,C., Rozenholc,Y., &Muller,K.R. (2007) Optimal dyadic decision trees. Machine Learning 66: 709717.
 Michie D, D.J. Spiegelhalter, C.C. Taylor (eds), Machine Learning, Neural and Statistical Classification, Stalog project book.
 Porter R.B., G. Beate Zimmer, Don R. Hush: Stack Filter Classifiers. ISMM 2009: 282294
 Shang N, L. Breiman, ICONIP'96, p.133
Other results (with different tests):Opper/Winther use 200 training and 332 test examples (following Rippley), with TAP MFT results on test 81%, SVS at 80.1% and best NN as 77.4%.
Hypothyroid.
Thyroid, From UCI repository, dataset "anntrain.data": A Thyroid database suited for training ANNs.3772 learning and 3428 testing examples; primary hypothyroid, compensated hypothyroid, normal.
Training: 93+191+3488 or 2.47%, 5.06%, 92.47%
Test: 73+177+3178 or 2.13%, 5.16%, 92.71%
21 attributes (15 binary, 6 continuous); 3 classes
The problem is to determine whether a patient referred to the clinic has hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. Because 92 percent of the patients are not hyperthyroid. A good classifier must be significant better than 92%.
Note: These are the datas Quinlans used in the case study of his article "Simplifying Decision Trees" (International Journal of ManMachine Studies (1987) 221234)
Names: I (W.D.) have investigated this issue and after some mail exchange with Chris Mertz, who maintains the UCI repository; here is the conclusion:
Results:
Most NN results from W. Schiffmann, M. Joost, R. Werner, 1993; MLP2LN and Init+a,b ours.
kNN, PVM and CART from S.M. Weiss, I. Kapouleas, "An empirical comparison of pattern recognition, neural nets and machine learning classification methods", in: J.W. Shavlik and T.G. Dietterich, Readings in Machine Learning, Morgan Kauffman Publ, CA 1990
SVM with linear and Gaussian kernels gives quite poor results on this data.
3 crisp logical rules using TSH, FTI, T3, on_thyroxine, thyroid_surgery, TT4 give 99.3% of accuracy on the test set.
Hepatobiliary disorders
Contains medical records of 536 patients admitted to a universityaffiliated Tokyobased hospital, with four types of hepatobiliary disorders: alcoholic liver damage, primary hepatoma, liver cirrhosis and cholelithiasis. The records included results of 9 biochemical tests and sex of the patient. The same 163 cases as in [Hayashi et.al] were used as the test data.FSM gives about 60 Gaussian or triangular membership functions achieving accuracy of 75.575.8%. Rotation of these functions (i.e. introducing linear combination of inputs to the rules) does not improve this accuracy. 10fold crossvalidation tests on the mixed, training plus test data, give similar results. The best results were obtained with the K* method based on algorithmic complexity optimization, giving 78.5% on the test set, and kNN with Manhattan distance function, k=1 and selection of features (using the leaveoneout method on the training data, features 2, 5, 6 and 9 were removed), giving 80.4% accuracy. Simulated annealing optimization of the scaling factors for the remaining 5 features give 81.0% and optimizing scaling factors using all input features 82.8%. The scaling factors are: 0.92, 0.60, 0.91, 0.92, 0.07, 0.41, 0.55, 0.86, 0.30. Similar accuracy is obtained using multisimplex method for optimization of the scaling factors.
S. Mitra, R. De, S. Pal, “Knowledge based fuzzy MLP for classification and rule generation”, IEEE Transactions on Neural Networks 8, 13381350, 1997, a knowledgebased fuzzy MLP system gives results on the test set in the range from 33% to 66.3%, depending on the actual fuzzy model used.
W. Duch and K. Grudzinski, ``Prototype Based Rules  New Way to Understand the Data,'' Int. Joint Conference on Neural Networks, Washington D.C., pp. 18581863, 2001. Contains best results with 1NN, Camberra and feature selection, 83.4% on the test.
Other, nonmedical data
Landsat Satellite image dataset (STATLOG version)
Training 4435 test 2000 cases, 36 semicontinous [0 to 255] attributes (= 4 spectral bands x 9 pixels in neighbourhood) and 6 decision classes: 1,2,3,4,5 and 7 (class 6 has been removed because of doubts about the validity of this class).The StatLog database consists of the multispectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multispectral values. In the sample database, the class of a pixel is coded as a number.


The database is a (tiny) subarea of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 subarea. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. In each line of data the four spectral values for the topleft pixel are given first followed by the four spectral values for the topmiddle pixel and then those for the topright pixel, and so on with the pixels read out in sequence lefttoright and toptobottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary.
All results from Statlog book, except GM  GhostMiner calculations, W. Duch.
Ionosphere
351 data records, with class division 224 (63.8%) + 126 (35.9%). Usually first 200 vectors are taken for training, and last 151 for the test, but this is very unbalanced: in the training set 101 (50.5%) and 99 (49.5%) are from 1/2 class, in the test set 123 (82%) and 27 (18%) are from class 1/2.34 attributes, but f2=0 always and should be removed; f1 is binary, the remaining 32 attributes are continuous.
2 classes  different types of radar signals reflected from ionoshpere.
Some vectors: 8, 18, 20, 22, 24, 30, 38, 52, 76, 78, 80, 82, 103, 163, 169, 171, 183, 187, 189, 191, 201, 215, 219, 221, 223, 225, 227, 229, 231, 233, 249, are either binary 0, 1 or have only 3 values 1, 0, +1.
For example, vector 169 has only one component = 1, all others are 0.
Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262266.
N. Shang, L. Breiman, ICONIP'96, p.133
David Aha: kNN+C4+IB3, from Aha, D. W., & Kibler, D. (1989). Noisetolerant instancebased learning algorithms. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 794799). Detroit, MI: Morgan Kaufmann.
IB3 parameter settings: 70% and 80% for acceptance and dropping respectively.
RIAC, C4.5 from: H.J. Hamilton, N. Shan, N. Cercone, RIAC: a rule induction algorithm based on approximate classification, Tech. Rep. CS 9606, Regina University 1996.
K.P. Bennett, J. Blue, A Support Vector Machine Approach to Decision Trees, R.P.I Math Report No. 97100, Rensselaer Polytechnic Institute, Troy, NY, 1997
Training/test division is not too good in this case, distributions are a bit differnet.
In 10xCV results are:
Maszczyk T, Duch W, Support Feature Machine, WCCI 2010 (submitted).
Sonar: Mines vs Rocks
208 cases, 60 continuous attributes, 2 classes, 111 metal, 97 rock.From the CMU benchmark repository
This dataset has been used in two kinds of experiments:
1. The "aspectangle independent" experiments use all 208 cases with 13fold crossvalidation, averaged over 10 runs to get std.
2. The "angle independent experiments" use training / test sets with 104 vectors each. Class distribution in training is 49 + 55, in test 62 + 42.
Estimation of L1O on the whole dataset (Opper and Winther) give 78.2% only; is the test so easy? Some of this results were made without standardization of the data, which is here very important!
The "angle independent experiments" with training / test sets.
N. Shang, L. Breiman, ICONIP'96, p.133, 10xCV
Gorman, R. P., and Sejnowski, T. J. (1988). "Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets", Neural Networks 1, pp. 7589, 13xCV
Our results: kNN results from 10xCV and from 13xCV are quite similar, so Shang and Breiman should not differ much from 13 CV.
WD Leaveoneout (L1O) estimations on std data:
L1O with k=1, Euclidean distance, for all data gives 87.50%, other k and distance function do not give significant improvement.
SVM linear, C=1, L1O 75.0%, for Gaussian kernel, C=1, L1O is 78.8%
Other L1O results taken from C. Domeniconi, J. Peng, D. Gunopulos, "An adaptive metric for pattern classification".
Vovel
528 training, 462 test cases, 10 continous attributes, 11 classesFrom the UCI benchmark repository.
Speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios.
Results on the total set
Telugu Vovel
871 patterns, 6 overlapping vowel classes (Indian Telugu vowel sounds), 3 features (formant frequencies).Papers using this data:
Wine data
Source: UCI, described in Forina, M. et al, PARVUS  An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Class distribution: 178 cases = [59, 71, 48] in Class 13;
13 continuous attributes: alcohol, malicacid, ash, alkalinity, magnesium, phenols, flavanoids, nonanthocyanins, proanthocyanins, color, hue, OD280/D315, proline.
[1] S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 9202, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland (submitted to Technometrics).
[2] S. Aeberhard, D. Coomans and O. de Vel, "The classification performance of RDA" Tech. Rep. no. 9201, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland (submitted to Journal of Chemometrics).
Other Data
Glass identification
Shang, Breiman CART 71.4% accuracy, DBCART 70.6%.Leaveoneout results taken from C. Domeniconi, J. Peng, D. Gunopulos, "An adaptive metric for pattern classification".
DNAPrimate splicejunction gene sequences, with associated imperfect domain theory.
Stalog Data: splice junctions are points on a DNA sequence at which `superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out).This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a "acceptors'' while EI borders are referred to as "donors''.)
Number of Instances: 3190. Class distribution:
Much better performance is generally observed if attributes closest to the junction are used (middle). In the StatLog version (180 variables), this means using attributes A61 to A120 only.
SSV Decision Tree  our results
Włodzisław Duch, last modification 26.08.2012