On 2014 Oct 07, Igor Zwir commented:
Rationale of the method applied in Arnedo et. al.
Clustering methods have been applied for long time in pattern recognition problems and in a variety of fields and actually are embedded in most of our devices, from a simple refrigerator to a Boing engine. In biomedical sciences there is a trend of considering that few classes of clustering exist (e.g., hierarchical cluster of K-Means). However, datamining, knowledge discovering, and machine leaning are fields of computer sciences with very rich and different either theoretical or empirical methods. Much of the power of these methods is based on the fusion of them, taking advantages of each one.
The methodology and framework proposed in this Arnedo et. al encodes a generalized factorization method (GFM) that was designed to identify structural patterns or clusters (substructures) that characterize complex objects (1-8) embedded in databases (6, 7, 9, 10). Unfortunately, solving these kinds of complex problems cannot be performed by a single clustering method but utilizing and combining the advantages of many of them, as were implemented in the GFM and summarized below.
First, we utilized the notion of Model-based clusters (5, 11, 12), where the centroid/prototype of a cluster is not a point but a model (e.g., line segment, ellipsoids, (5, 13, 14) or see C-varieties algorithm (12)) -or a family of models (8) - defined as a relation between cohesive subset of points that meet certain constraints (e.g., Relational clustering (12), undirected graph (12)). When the subjacent equations of the models are unknown, they can be estimated from the data by identifying relations between subsets of observations that show similar patterns under a specific subset of relevant descriptive features (attributes) (4, 5, 15). Here we proposed that in a pure data driven analysis these unknown models (grey boxes) can be learned as local relationships between subsets of features and observations, which are termed biclusters. The biclusters are often represented as sub-matrices (4, 5, 15). Other models proposed in this work (black boxes) consist of those that are encoded by a particular method devoted to recognize certain classes of patterns.
Second, to identify a family of models (8) or biclusters we planned the use of the NMF (16) factorization algorithms (tensor analysis), which can uncover cohesive data represented as sub-matrices (factors). When sorted and thresholded, each sub-matrix represents a bicluster. Notably, biclustering produces several local models of the data (17), each one capturing intrinsic relationships between subsets of observations sharing subsets –instead of all- features. Diverse biclusters –instead of uniform data explanations- provide a better understanding of the described object that is diluted when together, all features are used in a single global model of data typically performed by clustering methods (17).
Third, we proposed in this work that biclusters can be fuzzy (overlapping, see Fuzzy clustering (18)), where voxels and/or subjects can belong to more than one bicluster), as well as possibilistic (non-exhaustive partitions, see Possibilistic clustering (12)), where certain voxels and/or subjects may not be assigned to any bicluster, and thus, outliers can be detected.
Fourth, we demonstrated that biclusters could be identified without choosing a priori a particular number of clusters (11, 13, 14, 19). Setting this parameter as a small number may eventually generate few but large data partitions here defined as general biclusters that may underfit the data. In contrast, using a big number of clusters will generate many clusters with partitions of small size (i.e., more granular partitions) here defined as specific biclusters that may overfit the data (1, 5, 11, 13, 14, 19-21). Although there are many different validity indices (12, 18) that suggest the best number of clusters for a given dataset, they often produce contradictory results (11, 13, 14, 19). This problem is exacerbated when Centroid-based (e.g., k-means, NMF) or Distribution-based clustering (e.g., EM) is utilized because two successive numbers of clusters may generate completely different cluster arrangements (see (12, 18) for a review).
Instead of fighting over the lost battle of uncovering a single optimal number of clusters, biclusters were calculated from partitions generated by all number of clusters between 2 and √n (12), where n is the number of observations (subjects). We postponed the selection of the best clusters until all partitions were examined to select a set of clusters that together provide an optimal description of the sample. These clusters could be chosen from different partitions that were generated by different number of clusters (see Consensus clustering (1, 5, 11, 13, 14, 19-21)).
Fifth, formulation of the bicluster identification problem from different partitions may produce redundant, excessively specific and/or excessively general biclusters (4, 5, 15, 22, 23). To select optimal biclusters, we proposed a GFM that performs the following Consensus clustering strategy: (i) eliminates all redundant biclusters by comparing the similarity between their subjacent sub-matrices using Hypergeometric statistics (see Methods and Material in Supplement 1), and selecting only one representative bicluster from each set of equivalent sub-matrices. Biclusters harboring <10% of the total number of subjects were not considered to avoid a trend to obtain singleton biclusters. (ii) From the non-redundant biclusters, GMF selects all biclusters that are optimal in terms of specificity, generality and diversity. To do so, it selects all non-dominated biclusters where one bicluster is not worse (i.e. dominated) than another in both objectives: specificity and generality (4, 5, 15, 22, 23). Moreover, to ensure diversity, GMF applies the non-dominance metric only among biclusters within a neighborhood (22, 23) (i.e., biclusters with a relatively high overlapping of subjects and voxels with). This Multiobjective and Multimodal optimization process is analogous to Minimum Description-Length methods (24) and was carried out as described in (1-8). (iii) The remaining optimal biclusters are hierarchically organized by inclusion of their subjects into connected or disjoint hierarchies (i.e., subgraphs or subnetworks).
Sixth, and after the first step of identifying interesting patterns describing objects embedded in the data, we went one step further by also finding characteristic descriptions for each group, where each group represents a concept or class using Conceptual clustering. In this work we incorporated factors such as the generality and simplicity of the derived bicluster descriptions by (1) relational clustering validation against data driven independent descriptions obtained in other domains of knowledge, and (2), incorporate the previous structured knowledge into predictive multiclassifiers mixing machine learning and multiobjetive optimization techniques.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.