Richard Simon, Michael D. Radmacher, Kevin Dobbin, Lisa M. McShane
Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification
Journal of the National Cancer Institute, Vol.95, No.1, 14-18, January 1, 2003
[
PDF]
・マイクロアレイデータを使って病気診断(クラス分け)を行う際の注意点についてまとめたもの。
・実験:人工の擬似データを使って、Cross-validationの設定と、識別率の関係をみた
・データ:20サンプル(10:10)、6000遺伝子、乱数を発生
・問題点「
Although cluster analysis is approproate for class discovery, it is often not effective for class comparison or class prediction.」
・問題点「
Cluster analysis also does not provide statistically valid quantitative information about which genes are differentially expressed between classes.」
・「
One major limitation of supervised methods is overfitting the predictor. Overfitting means that the number of parameters of the model is too large relative to the number of cases or specimens available.」
・実験「
We performed a simulation to examine the bias in estimated error rates for a class prediction study with various levels of cross-validation」
・「
Simple methods such as diagonal linear discriminant analysis and nearest neighbor classification (18), the weighted voting method (1), and the compound covariate predictor (2,5) have been very effective in cancer studies with small numbers of cases.」
・「
We recommend that supervised methods rather than cluster analyses be used for class prediction and class comparison studies.」
・「
Finally, we urge investigators not to make strong claims about the value of new prediction algorithms without comparing them to more standard prediction methods.」