Tao Li, Chengliang Zhang and Mitsunori Ogihara
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
Bioinformatics 2004 20(15):2429-2437
[
PDF][
Web Site]
・既存の遺伝子抽出法・識別法・マイクロアレイデータを総当り的に組み合わせて性能比較。
・マイクロアレイデータ
1.ALL-AML-3 [Golub]
2.ALL-AML-4 [Golub]
3.ALL [Yeoh]
4.GCM [Ramaswamy]
5.SRBCT [Khan]
6.MLL-leukemia [Armstrong]
7.Lymphoma [Alizadeh]
8.NCI60 [Ross]
9.HBC [Hedenfalk]
・遺伝子抽出法(ランキング法)(ソフト Rankgene)
1.Information gain
2.Twoing rule
3.Sum minority
4.Max minority
5.Gini index
6.Sum of variances
7.One-dimensional SVM
8.t-statistics
・識別法
1.SVM (one-versus-the-rest method)
2.SVM (pairwise comparison method)
3.SVM (ECOC method - Random coding)
4.SVM (ECOC method - Exhaustive coding)
5.Naive Bayes
6.K-nearest neighbor (KNN)
7.Decision Tree
・評価法:各ランキング結果の上位150個の遺伝子を使って識別。4-fold cross validation で識別率を算出。
・概要「
This paper compares various feature selection methods as well as various state-of-the-art classification methods on various multiclass gene expression datasets.」
・「
While increasing the number of samples is a plausible solution to the problem of accuracy degradation, it is important to develop algorithms that are able to analyze effectively multi-class expression data for these special datasets.」
・結果「
It is difficult to select the best feature selection method. There does not seem to exist a clear winner.」
・結果「
The accuracy of classification is highly dependent on the choice of the classification method. The choice is more important than the choice of feature selection method.」
・結果「
These two datasets have smaller sample sizes than the other datasets, so one may conclude that multiclass classification based on gene expression can be effectively solved when sample size is large.」
・結果「
The study suggests that multiclass classification problems are more difficult binary one in general.」
・「
Is it possible to design a feature selection method that takes into consideration correlations between features?」
・読みやすい英語。
・筆者HPのPublicationを見ると、『Music Artist Style Identification by Semisupervised Learning from both Lyrics and Conent.』なんて興味深い題名の論文が。。。