Ka Yee Yeung and Roger E Bumgarner
Multiclass classification of microarray data with repeated measurements: application to cancer
Genome Biology 2003, 4:R83
[PDF][Web Site]
・USCとEWUSCに基づいたサンプル識別法の提案。
・データ
1.National Cancer Institute NCI 60 data, 5244 genes, 61 samples [Ross]
2.Multiple tumor data, 7129 genes, 123 samples [Ramaswamy]
3.Breast cancer data, 25000 genes, 97 samples [van't Veer]
4.Synthetic data, 1000 genes, 40 samples
・識別アルゴリズムの評価法
1.Prediction accuracy
2.Number of relevant genes
3.Feature stability
・概要「We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken controid (EWUSC) algorithms that are applicable to microarray data with any number of classes.」
・意義「Selection of relevant genes for classification is known as feature selection. This has three main applications: first, the classification accuracy is often improved using a subset instead of the entire set of genes; second, a small set of relevant genes is convenient for developing diagnostic tests; and third, these genes may lead to biologically interesting insights that are charasteristic of the classes of interest.」
・問題点「However, many of these methods are tailored towards binary classification in which there are only two classes [9,14]. Moreover, there has been very limited effort to develop classification and feature-selection algorithms for microarray data with repeated measurements or error estimates.」
・合成データ作成法「Our approach is to start with 'patterned genes' which have a different expression pattern in samples. The next step is to introduce noise (variation in both the class and non-class values) to these patterned genes in order to reflect 'real-life' data. Finally, 'non-patterned genes', which are irrelevant in classfying samples, are added to these synthetic datasets.」
・「Even with this simple synthetic data-generation approach, generating sensible synthetic data turned out to be a nontrivial task.」
・「Surprisingly, removing highly correlated genes does not produce any considerable improvement in prediction accuracy and does not drastically reduce the number of relevant genes.」
・「We showed that the step of removing highly correlated genes in USC is effective in reducing the number of relevant genes without sacrificing prediction accuracy, and hence, USC is an improvement over SC.」
・「Our main contribution is that we use cross-validation to select a correlation threshold (ρ0) for the removal of highly correlated genes.」
・「The EWUSC algorithm is a modification of the SC algorithm with two key differences: noisy measurements are down-weighted and redundant genes (features) are removed.」
・たとえランキング上位でも識別に大きく寄与しない遺伝子は取り除き、より識別の効率化をはかる、という話??
Multiclass classification of microarray data with repeated measurements: application to cancer
Genome Biology 2003, 4:R83
[PDF][Web Site]
・USCとEWUSCに基づいたサンプル識別法の提案。
・データ
1.National Cancer Institute NCI 60 data, 5244 genes, 61 samples [Ross]
2.Multiple tumor data, 7129 genes, 123 samples [Ramaswamy]
3.Breast cancer data, 25000 genes, 97 samples [van't Veer]
4.Synthetic data, 1000 genes, 40 samples
・識別アルゴリズムの評価法
1.Prediction accuracy
2.Number of relevant genes
3.Feature stability
・概要「We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken controid (EWUSC) algorithms that are applicable to microarray data with any number of classes.」
・意義「Selection of relevant genes for classification is known as feature selection. This has three main applications: first, the classification accuracy is often improved using a subset instead of the entire set of genes; second, a small set of relevant genes is convenient for developing diagnostic tests; and third, these genes may lead to biologically interesting insights that are charasteristic of the classes of interest.」
・問題点「However, many of these methods are tailored towards binary classification in which there are only two classes [9,14]. Moreover, there has been very limited effort to develop classification and feature-selection algorithms for microarray data with repeated measurements or error estimates.」
・合成データ作成法「Our approach is to start with 'patterned genes' which have a different expression pattern in samples. The next step is to introduce noise (variation in both the class and non-class values) to these patterned genes in order to reflect 'real-life' data. Finally, 'non-patterned genes', which are irrelevant in classfying samples, are added to these synthetic datasets.」
・「Even with this simple synthetic data-generation approach, generating sensible synthetic data turned out to be a nontrivial task.」
・「Surprisingly, removing highly correlated genes does not produce any considerable improvement in prediction accuracy and does not drastically reduce the number of relevant genes.」
・「We showed that the step of removing highly correlated genes in USC is effective in reducing the number of relevant genes without sacrificing prediction accuracy, and hence, USC is an improvement over SC.」
・「Our main contribution is that we use cross-validation to select a correlation threshold (ρ0) for the removal of highly correlated genes.」
・「The EWUSC algorithm is a modification of the SC algorithm with two key differences: noisy measurements are down-weighted and redundant genes (features) are removed.」
・たとえランキング上位でも識別に大きく寄与しない遺伝子は取り除き、より識別の効率化をはかる、という話??