ぴかりんの頭の中味

主に食べ歩きの記録。北海道室蘭市在住。

【論】Dabney,2005,Classification of microarrays to ~

2008年01月11日 08時02分26秒 | 論文記録
Alan R. Dabney
Classification of microarrays to nearest centroids
Bioinformatics 2005 21(22):4148-4154
[PDF][Web Site]

・サンプルのクラス分け法として、Classification to Nearest Centroids (ClaNC)を提案する。LDAに基づくアルゴリズムの簡便さが特長。
・人工データ:4クラス、各クラス30サンプル、5000遺伝子、発現量の分布を変化させた3種のデータを用意
・生データ
1.Small round blue cell tumors (SRBCT)、2307遺伝子、83サンプル、4クラス [Khan]
2.Lymphoma、4026遺伝子、58サンプル、3クラス [Alizadeh]
3.NCI cancer cell lines、6830遺伝子、60サンプル、10クラス [Ross]
4.Leukemia、3857遺伝子、38サンプル、2クラス [Golub]
・クラス分けの比較法:Prediction Analysis of Microarrays (PAM)[Tibshirani]、デフォルト使用の他に設定を変えた4つの方法を使用
・クラス分けの評価法(error rates):5-fold cross-validation

・問題点「I surprisingly show that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts.
・問題点「For example, with unlimited resources, we may wish to use all relevant genes in the classifier; although, we may be able to find a subset of genes that classify just as well as (or even better than) the complete set. In other settings, it may be necessary to make tradeoffs between accuracy and practicality.
・方法「I present here an alternative LDA-based classifier that I call ClaNC, for Classification to Nearest Centroids. ClaNC (1) does not shrink centroids, (2) uses unmodified t-statistics to select genes, (3) carries out class-specific feature selection, and (4) allows each gene to be active in at most one class.
・結果「LDA-based classifiers that are even simpler than PAM can perform very well.
・展望「I intend to perform a more thorough investigation of shrinkage for classification in future work.
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Bloch,2002,Nonlinear correlation for the analy~

2007年12月21日 08時07分02秒 | 論文記録
Karen M. Bloch, Gonzalo R. Arce
Nonlinear Correlation for the Analysis of Gene Expression Data
Intelligent Systems for Molecular Biology, Edmundton, Aug. 2002.
[PDF][Web Site]

・遺伝子発現差解析の Hierarchical clustering において、データがガウス分布に従うときは線形の測度(Pearson correlation coefficient)は有効に機能するが、非ガウス分布やノイズが含まれる場合は、非線形の測度(Median correlation)の方が優位に機能することを示す。
・データ:Yeast、4クラス、65遺伝子、時間点79 [Eisen]

・利点「The benefit of the median correlation approach is that the correlations can capture both the linear and nonlinear components of the gene expression patterns.
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Wang,2005,HykGene: a hybrid approach for selct~

2007年12月14日 08時05分25秒 | 論文記録
Yuhang Wang, Fillia S.Makedon, James C.Ford and Justin Pearlman
HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data
Bioinformatics 2005 21(8):1530-1537
[PDF][Web Site]

・マイクロアレイデータに基づく遺伝子抽出法、HykGene (hybrid approach for selection marker genes) の提案。
・データ
1.ALL/AML leukemia [Golub]
2.MLL leukemia [Armstrong]
3.Colon tumor [Alon]
・データ処理ステップ
1.Gene ranking
2.Hierarchical clustering
3.Reduce gene redundancy by collapsing clusters
4.Classification
・ランキング法(指標)
1.Relief-F
2.Information Gain
3.χ2-statistic
・クラス分け法
1.k-nearest neighbor (k-NN)
2.Support vector machine (SVM)
3.C4.5 dicision tree
4.Naive Bayes (NB)
・クラス分け結果の評価:LOOCV
・比較法
1.(未処理データそのまま)
2.SOM

・方法「In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters.
・データの特性「Classification using gene expression data poses a major challenge because of the following characteristics:
・M >> N. For typical datasets, M is in the range of 2000?30 000, while N is in the range of 40?200.
・Most features (genes) are not related to the given phenotype classification problem.

・ランキング法の変遷「These gene ranking methods have been based on t-statistic (Golub et al., 1999), information gain (Su et al., 2003; Liu et al., 2002; Li et al., 2004), χ2-statistic (Liu et al., 2002), the threshold number of misclassification (TNoM) score (Ben-Dor et al., 2000) and concatenation of several feature filtering algorithms (Xing et al., 2001).
・特徴「Our approach is different from the previous pre-filtering approaches in that:
・We apply gene ranking methods first.
・We determine the best number of clusters systematically.

・将来の展望「We are currently investigating alternative approaches that use Gene Ontology to guide this selection process.

・話の筋が明快で分かりやすい
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Chilingaryan,2002,Multivariate approach for ~

2007年12月10日 08時20分35秒 | 論文記録
A.Chilingaryan, N.Gevorgyan, A.Vardanyan, D.Jones and A.Szabo
Multivariate approach for selecting sets of differentially expressed genes
Mathematical Biosciences Volume 176, Issue 1, March 2002, Pages 59-69
[PDF][Web Site]

・遺伝子抽出のための多変量解析的方法(multi-start random search method with early stopping (MRSES))の提案。マハラノビス距離を指標にして遺伝子を抽出する。
・データ
1.人工データ
2.Two colon cancer cell lines (HT29, HCT116)

・問題点「Currently used approaches ignore the multidimensional structure of the data. However it is well known that correlation among covariates can enhance the ability to detect less pronounced differences.
・方法「We use the Mahalanobis distance between vectors of gene expressions as a criterion for simultaneously comparing a set of genes and develop an algorithm for maximizing it.
・「It is well known that genes do not work independently; activation of one gene usually triggers changes in the expression level of other genes, that is genes are involved in so-called pathways.
・特徴「In this paper we develop an approach that uses both the mean expression levels and the covariance structure of the data.
・問題点「First, the number of possible gene combinations is enormously large; therefore it is impossible to compare all gene subsets and find the optimal one. On the other hand, if a global optimum could be found, it would be overly training sample specific, because of the phenomenon of overfitting.
・問題点「Unfortunately, the problem of developing a simulation model for microarray data has been largely ignored, we are aware of only one attempt [19] in which a highly specific parametric model assuming independent genes is used.
コメント (3)
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Thomas,2001,An efficient and robust statistica~

2007年11月30日 08時06分20秒 | 論文記録
Jeffrey G.Thomas, James M.Olson, Stephen J.Tapscott, and Lue Ping Zhao,
An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles
Genome Research Vol.11, Issue 7, 1227-1236, July 2001
[PDF]

・マイクロアレイの発現量データに基づく遺伝子抽出法の提案。教師付き学習法。Z-scoreにより遺伝子をランキング。
・データ:Leukemia data (27 ALL/11 AML) [Golub]
・結果:有意水準1%の条件下で、グループ間で発現量に差があると判定された141遺伝子を抽出(1と2)。
1.ALLよりAMLの方が高い発現量の遺伝子
2.AMLよりALLの方が高い発現量の遺伝子
3.AMLサンプル中、TPO(Thrombopoietin)に関連した遺伝子
4.AMLサンプル中、予後(生/死)間で異なる発現量の遺伝子
・比較法:t-tests, Wilcoxon rank sum statistics

・問題点「Cluster analysis is not a sensitive method for this type of study because it focuses on group similarities, not differences within each individual gene. Furthermore, clustering algorithms such as those listed above are also unable to take advantage of preexisting knowledge of the data, such as the sample groupings.
・特性「This methodology makes no distributional assumptions about the data and accounts for high false-positive error rate resulting from multiple comparisons.
・方法「In this work, we calculated the significance value (i.e., P-value) for each probe set using a modified Bonferroni's correction as proposed by Hochberg (Hochberg 1988) (see Methods for details).
・特長「In contrast, the estimating equation technique we used to calculate Z-scores does not require any distributional assumptions or homogeneity of variances (see Methods for details).
・方法「We propose a regression model for the expression level of the jth gene in the kth sample:
 Yjk = δk + λk(aj + bjxk) + εjk (1)
in which (aj, bj) are gene-specific regression coefficients, (δkk) are the sample-specific additive and multiplicative heterogeneity factors, respectively, and εjk is a random variable reflecting variation due to sources other than the one identified by the known covariate and the systematic heterogeneity between samples.
コメント (2)
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Li,2004,Gene mining: a novel and powerful ense~

2007年11月22日 18時22分41秒 | 論文記録
Xia Li, Shaoqi Rao, Yadong Wang and Binsheng Gong
Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling
Nucleic Acids Research, 2004, Vol.32, No.9 2685-2694
[PDF]

・Ensemble dicision approach に基づく遺伝子抽出法の提案。Recursive partition tree の改良版。
・データ
1.Colon data [Alon]
2.Leukemia data [Golub]
・実験
1.Desease relevant genes → Colon 23遺伝子、Leukemia 20遺伝子を抽出。
2.Classification of biological types → 下記5つのクラス分け法により性能比較。
・クラス分け法
1.SVM with five different ketnel functions
2.Fisher linear discriminant
3.Logistic regression
4.K-nearest neighbors
5.Mahalanobis distance

・問題点「However, wrappers and embedded algorithms are often not clearly distinguished, with only slight differences in the feature searching strategies.
・方法「Instead of simply maximizing prediction accuracy, we identify genes that are mostly relevant to a disease itself.
・概要「For this purpose, we introduce a disease-relevance concept and define a relevance intensity (precise mathematical descriptions will be given later) to distinguish between disease-relevant genes and noise features.
・方法「The proposed ensemble selection is a supervised learning approach based on a recursive partition tree.
・「In other words, the genes from the prediction-driven extraction of relevant genes are important for prediction but may not be so for deciphering the complex underlying genetic architecture of the disease itself.
・展望「After this work, the next step is to address a more involved biological question: how do these genes act or interact to lead to the manifestation of a disease phenotype, or so-called target-driven gene networking, which is currently under investigation.
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Bijlani,2003,Prediction of biologically signif~

2007年11月15日 19時49分48秒 | 論文記録
Rahul Bijlani, Yinhe Cheng, David A. Pearce, Andrew I. Brooks and Mitsunori Ogihara
Prediction of biologically significant components from microarray data: Independently Consistent Expression Discriminator (ICED)
Bioinformatics Vol.19 no.1 2003, Pages 62-70
[PDF][Web Site]

・クラス分け法 "ICED" の提案。
・処理手順
1.正規化:平均0、分散1
2.Gene weight generation:独自の重み付けの式
3.Voting methodology:Golub提案の式
4.Vote-based classifier:パラメータはデータに応じて調節
・データ
1.Batten disease [Chattopadhyay]
2.Leukemia (AML/ALL) [Golub]
・比較法
1.SVMs [Furey]
2.Neighborhood Analysis [Golub]

・方法「The four components of ICED iclude (i) normalization of raw data; (ii) assignment of weights to genes from both classes; (iii) counting of votes to determine optimal number of predictor genes for class distinction; (iv) calculation of prediction strengths for classification results.
・特長「The search criteria employed by ICED is designed to identify not only genes that are consistently expressed at one level in one class and at a consistently different level in another class but identify genes that are variable in one class and consistent in another.

・内容がつかみづらい。
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Ben-dor,2002,Overabundance Analysis and Class~

2007年11月06日 22時00分38秒 | 論文記録
Amir Ben-Dor, Nir Friedman, Zohar Yakhini
Overabundance Analysis and Class Discovery in Gene Expression Data
2002
[PDF][Web Site]

・サンプルクラス分け法の "Overabundance analysis" の提案。
・データ
1.Leukemia [Golub]
2.Colon [Alon]
3.Lymphoma [Alizadeh]
・遺伝子ランキング法
1.TNoM (Threshold Number of Misclassification) [Ben-dor]
2.INFO (Conditional Entropy) [Ben-dor]
・クラス分け評価法:LOOCV

・特性「Overabundance analysis allows us to compare the support of different partitions of the same data set.
・方法「In the current work we take a direct unsupervised approach to class discovery. The process we develop consists of two components. We start by defining a figure of merit to putative partitions of the set of samples. We are guided by the fact that biologically meaningful partitions of the samples are typically manifested by a large overabundance of genes that are differentially expressed in the different sample classes.

・歯がたたない。内容理解できず。
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】Pavlidis,2006,Individualized markers optimize ~

2007年11月01日 18時03分29秒 | 論文記録
Pavlos Pavlidis and Panayiota Poirazi
Individualized markers optimize class prediction of microarray data
BMC Bioinformatics 2006, 7:345doi
[PDF][Web Site]

・クラス分けの指標となる遺伝子抽出法の提案。
・データ
(Data Set),(Selection/Classification)
1.AML/ALL, S2N(Signal to Noise)/NA(Neighborhood Analysis)
2.Breast Cancer, CC(Correlation Coefficient)/FA(Factor Analysis)
3.Lung Cancer, 2-tail(2-Tail Student Test),ER(Expression Ratio)
4.AML/MLL/ALL, CC/K-NN(K-Nearest Neighbors)
5.CNS, S2N/K-NN
6.Lymph Node, CC/FA

・問題点「Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker ? selection and classification methods do not consider this variability.
・「Among these, filter methods in which the selection is independent from the optimization criteria of the classifier are most frequently used. Such methods have the advantage of being cost-effective and easy to implement which make them very attractive for microarray data experiments where the set of features is in the order of thousands.
・「In fact, a recent publication [29] showed that a yeast gene expression dataset is better modeled by an alpha distribution (a = 1.3).
・特長「Unlike existing filter feature selection techniques, this method applies no restrictions to the mean expression values of informative genes between the different classes.
・CERsとは「CERs (Consistent Expression Regions) are defined as the intervals enclosing the expression (sorted in ascending order) of a given gene in a significant number of training samples which belong to the same category.

・たいして難しい処理をしているわけでもなさそうなのに、その方法がさっぱり理解できない。
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

【論】The Gene Ontology Consortium,2007,Creating the~

2007年10月24日 20時13分05秒 | 論文記録
The Gene Ontology Consortium
Creating the Gene Ontology Resource: Design and Implementation
GENOME RESEARCH Vol.11, Issue 8, 1425-1433, August 2001
[PDF][Web Site]

・GO(Gene Ontology)の概要について。

・GOとは「GO endeavors to develop cross-species biological vocabularies that are used by multiple databases to annotate genes and gene products in a consistent way. Three extensive ontologies are under development, for (1) molecular function, (2) biological process, and (3) cellular component.
・「Briefly, molecular function describes what a gene product does at the biochemical level. Biological process describes a broad biological objective. Cellular component describes the location of a gene product, within cellular structures and within macromolecular complexes.

・あちこちの機関のデータベースの情報を統合して、系統立った大きな遺伝子のデータベースを作る、ところまではよいが、肝心な、その用語間の関係(Ontology)についてはやはりよく分からず。具体的なイメージは実際使ってみないとチンプンカンプン。
コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする