Generally, the cluster analysis is performed based on input parameters from the user.
In the preferred embodiment, the user may specify a database 116, a table within the database 116, and a list of attributes from the table that will be analyzed for clusters.
The user also may identify a type of cluster analysis (e.g., K-Means or Gaussian Mixture),
the number of clusters to be searched for within the data, a threshold difference in a Log Likelihood value below which the EM iterations will stop, and a maximum number of iterations independent of the change in the Log Likelihood value.
In this embodiment, the Log Likelihood defines the likelihood that a given clustering model could have generated the dataset, i.e., it describes the adequacy of a clustering model fit under the assumptions of a given probabilistic model.
The output comprises a table of values of cluster means, variances and prior probabilities (i.e., the relative number of rows assigned to clusters). A measure of success of cluster identification is provided as the average of all within-cluster variances and a Log Likelihood sum on row-cluster probabilities.
After the user has chosen the number of clusters desired (N), an initialization step randomly associates each row of the table to one of N clusters. This may be accomplished using a sampling function, a (non-random) row modulus function, or some other similar function.
[00276] Identification of the dynamics of the emergence of separate subpopulations within the tπal sample population can be performed with two complementary clusteπng statistics Cubic Clustering Criterion (CCC) and the Pseudo-F which help the end-user establish the number of clusters emerging within the geometry as demonstrated in Fig 21
To detect anomalous user activity, the system may also separately cluster users for each domain to associate single-domain cluster indices with each user, and then cluster the users according to the single-domain cluster indices.
For example, the system may cluster users according to the average number of files accessed daily (or within any predetermined time period), and cluster users according to an average number of e-mails sent and received daily (or within any predetermined time period).
The system associates each user with a single-domain cluster number for the e-mail domain, and associates each user with a single-domain cluster number for the file domain.
The system then clusters the users according to the single-domain cluster numbers from the different domains, thereby generating a discrete distribution for each user.
The system can then compare a user's distribution of single-domain clusters with others that have roles similar to the user to detect anomalies.
Furthermore, the system can compute an anomaly score for each user for each domain, and then compute an aggregate anomaly score by weighting the separate anomaly scores for the domains.
In an implementation, the system may utilize a leave-1-out technique to identify anomalous user activity.
The system analyzes a specific user by fixing the domain values of all domains except for one.
The basic principle is that normal individuals should be predictable. The system attempts to predict a cluster number of that domain. The system may identify the user activity as anomalous if the prediction is incorrect.
For example, the system may set the domain values (e.g., cluster numbers) for a user such that logon=1, device=2, file=3, and e-mail=1. The system then attempts to predict a cluster number for the HTTP domain.
If the prediction is incorrect, the system may label the user activity as anomalous.
The system may compute anomaly scores for each domain and combine the anomaly scores by weighting the individual domains.
The anomaly score for a domain d and user i is
[mathematical formula]

where N is the total number of users and j is each user j from j=1 to N.
The system may adjust the prediction miss value m(d,i) for each domain d to reflect the weighted value of the domain.
The system may then compute an aggregate anomaly score s(i) for user i as s(i)=Σd a (d, i).

Others have attempted to analyze expression heterogeneity using different clustering methods and alternative multiplexing schemes (Gerdes et al. 2013, Qian, et al. 2010).
他の研究者が、様々なクラスタ化(clustering)方法及び選択的多重化(multiplexing)方式を使用して、発現不均一性の分析を試みてきた(Gerdes et al.2013、Qian,et al.2010)。
The hierarchical clustering approach requires significant assumptions to be made.
Knowing the distance between points that determines where to draw the boundary to form a new cluster is a key parameter for hierarchical clustering algorithms.
Alternatively, some hierarchical algorithms (such as Ward's method (Ward 1963)) require entry of the number of clusters as a parameter.
あるいは、一部の階層型アルゴリズム(Ward方式(Ward 1963)など)ではクラスタ数をパラメータとして入力する必要がある。
However, cut-off thresholds (distance) and number of expected clusters are both parameters that are often unknown.
Additionally, some algorithms enforce assumptions about even cluster size (e.g. k-means), distance between points that are members of different clusters (hierarchical clustering) or assumptions about the expected number of clusters to be found (hierarchical clustering, k-means).
Though widely used, hierarchical methods are better suited to variables measured on a discontinuous scale (e.g. +, ++, +++, ++++).
For this reason, hierarchical clustering algorithms are not ideal for the requirements of expression heterogeneity analysis.
Alternative density-based tools such as FLOCK (Qian, et al. 2010) have limitations in that parameters for size of hyper-regions used to calculate density and density cut-off thresholds must be estimated and entered to the algorithm to enable cluster determination.
選択的な密度ベースのツール、例えばFLOCK(Qian,et al.2010)は、密度及び密度カットオフ閾値の計算に使用される過剰領域(hyper-region)のサイズを表すパラメータを推定し、これらをアルゴリズムに入力してクラスタ判定を可能にしなければならない、という制約がある。
(assuming i= l , . . . ,N where N=number of claims and d=T , . . . ,D where D=number of clusters)