I.Inza, P.Larrañaga, R.Blanco, A.Cerrolaza
Filter versus wrapper gene selection approaches in DNA microarray domains.
Artificial Intelligence in Medicine, Volume 31, Issue 2, Pages 91-103
[
PDF][
Web Site]
・サンプルのクラス分けに使う遺伝子の抽出法である Filter method や Wrapper method を、複数のデータやクラス分け法を用いて性能比較する。
・データ
1.Colon dataset, 62 samples (22 tumor/40 normal), selected 2000 genes [Alon]
2.Leukemia dataset, 72 samples (25 AML/47 ALL), 7129 genes [Golub]
・遺伝子抽出法
[A]Filter approach
(a)For continuous data
1.P-metric
2.t-score
(b)For discrete data
1.Shannon-entropy
2.Euclidean-distance
3.Kolmogorov-dependence
4.Kullback-Leibler
[B]Wrapper approach:Sequential forward selection (SFS) : {3,5,10,20} genes of highest scoring value
・サンプルクラス分け法(Supervised classifiers)
1.IB1 : Nearest-neighbor (K-NN) classifier
2.Naive-Bayes (NB) rule : Bayes theorem
3.C4.5 : Decision tree
4.CN2 : Set of IF-THEN rules
・クラス分け結果の評価法:LOOCV
・概要「
In this work, a comparison between a group of different filtermetrics and a wrapper sequential search procedure is arried out.」
・「
Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities.」
・目的「
By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.」
・問題点「
For most biological problems, information about the class (or type) of each cell-line exists: reflecting whether the tissue is diseased or healthy, the distinction of the specific tumor type, etc.」
・「
To avoid this 'curse of dimensionality' [12], feature selection plays a crucial role in DNA microarray analysis.」
・問題点「
Most of the supervised learning algorithms perform rather poorly when faced with many irrelevant or redundant (depending on the specific characteristics of the classifier) features.」
・注意「
It must be noted that there are few coincidences in both datasets among the genes selected by the filter and wrapper approaches. It seems that the wrapper approach, by its multivariate selection search procedure, prefers genes which directly cause high accuracy levels in the induced classifiers. On the other hand, the filter approach does not directly take the predictive power of the genes into account, and it univariately selects the genes that are closely related with the class label. Thus, there are no large coincidences between the ‘accurate’genes multivariately selected by the wrapper approach and the class-related genes univariately proposed by the filter metrics.」
・今後「
As future work, we envision to use new filter metrics which, by the use of statistical hypothesis tests, automatically fix the number of genes to induce the classifier. We also plan to use population-based, randomized search algorithms, such as genetic algorithms or estimation of distribution algorithms for the selection of discreminative genes in DNA microarray tasks:」
・発現量データを、{under-expressed, baseline, over-expressed} の三値に分ける Discrete data の手法に興味あり。
・非英語圏の著者らしく、読みやすい英語。