Michael P.S.Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Walsh Sugnet, Terrence S.Furey, Manuel Ares, Jr., and David Haussler
Knowledge-based analysis of microarray gene expression data by using support vector machines
PNAS vol.97 no,1 262-267, January 4, 2000
[PDFダウンロード][Webサイト][データダウンロード]
・SVM(support vector machine)により、酵母遺伝子をその機能ごとにクラス分けする。
★実験1 8種の識別アルゴリズムの性能を比較し、SVMの有効性を立証する。
・データ:酵母のマイクロアレイ(発現)データ。79サンプル。アノテーションの付いた2467遺伝子。
・8つの識別アルゴリズム
1.SVM(simplest kernel d=1)
2.SVM(simplest kernel d=2)
3.SVM(simplest kernel d=3)
4.SVM(radial basis kernel)
5.Parzen windows
6.Fisher's linear discriminant
7.Two decision tree learners(C4.5)
8.Two decision tree learners(MOC1)
・クラス分けする6つの遺伝子機能(by MYGD(Munich Information Center for Protein Sequences Yeast Genome Database)(現CYGD))
1.Tricarboxylic acid (TCA) cycle
2.Respiration
3.Cytoplasmic ribosomes
4.Proteasome
5.Histones
6.Helix-helix proteins (control group)
・評価法:three-way cross-validated experiment
★実験2 SVMにより機能未知遺伝子の機能推定をする。
・データ:酵母のマイクロアレイ(発現)データ。80サンプル(実験1と65サンプル重複)。6221遺伝子(アノテーション付き2467+アノテーションなし(機能未知)3754)。
・「SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers.」
・「We define Xi to be the logarithm of the ratio of expression level Ri of gene X in the reference state,」
・「The gene functional classes examined here contain very few members relative to the total number of genes in the data set. This leads to an imbalance in the number of positive and negative training examples that, in combination with noise in the data, is likely to cause the SVM to make incorrect classifications.」 これはなかなかきびしい問題。
・一応解決策はコレ→「To judge overall performance, we define the cost of using the method M as C(M)=fp(M) + 2・fn(M),」 はたしてどの程度有効なのか、疑問。
・「In general, these disagreements with MYGD reflect the different perspective provided by the expression data, which represents the genetic response of the cell, and the MYGD definitions, which have been arrived at though experiments or protein structure predictions.」
・「We have demonstrated that support vector machines can accuretely classify genes into some functional categories based on expression data from DNA microarray hybridization experiments and have made predictions aimed at identifying the functions of unannotated yeast genes.」
・非常に読みやすい英語だった。
Knowledge-based analysis of microarray gene expression data by using support vector machines
PNAS vol.97 no,1 262-267, January 4, 2000
[PDFダウンロード][Webサイト][データダウンロード]
・SVM(support vector machine)により、酵母遺伝子をその機能ごとにクラス分けする。
★実験1 8種の識別アルゴリズムの性能を比較し、SVMの有効性を立証する。
・データ:酵母のマイクロアレイ(発現)データ。79サンプル。アノテーションの付いた2467遺伝子。
・8つの識別アルゴリズム
1.SVM(simplest kernel d=1)
2.SVM(simplest kernel d=2)
3.SVM(simplest kernel d=3)
4.SVM(radial basis kernel)
5.Parzen windows
6.Fisher's linear discriminant
7.Two decision tree learners(C4.5)
8.Two decision tree learners(MOC1)
・クラス分けする6つの遺伝子機能(by MYGD(Munich Information Center for Protein Sequences Yeast Genome Database)(現CYGD))
1.Tricarboxylic acid (TCA) cycle
2.Respiration
3.Cytoplasmic ribosomes
4.Proteasome
5.Histones
6.Helix-helix proteins (control group)
・評価法:three-way cross-validated experiment
★実験2 SVMにより機能未知遺伝子の機能推定をする。
・データ:酵母のマイクロアレイ(発現)データ。80サンプル(実験1と65サンプル重複)。6221遺伝子(アノテーション付き2467+アノテーションなし(機能未知)3754)。
・「SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers.」
・「We define Xi to be the logarithm of the ratio of expression level Ri of gene X in the reference state,」
・「The gene functional classes examined here contain very few members relative to the total number of genes in the data set. This leads to an imbalance in the number of positive and negative training examples that, in combination with noise in the data, is likely to cause the SVM to make incorrect classifications.」 これはなかなかきびしい問題。
・一応解決策はコレ→「To judge overall performance, we define the cost of using the method M as C(M)=fp(M) + 2・fn(M),」 はたしてどの程度有効なのか、疑問。
・「In general, these disagreements with MYGD reflect the different perspective provided by the expression data, which represents the genetic response of the cell, and the MYGD definitions, which have been arrived at though experiments or protein structure predictions.」
・「We have demonstrated that support vector machines can accuretely classify genes into some functional categories based on expression data from DNA microarray hybridization experiments and have made predictions aimed at identifying the functions of unannotated yeast genes.」
・非常に読みやすい英語だった。