Minimum total expected workload to achieve the learning outcomes for this unit is hours per semester typically comprising a mixture of scheduled learning activities and independent study. Independent study may include associated readings, assessment and preparation for scheduled activities. Scheduled activities may include a combination of teacher directed learning, peer directed learning and online engagement. Skip to content Skip to navigation. Search the Handbook Search. The learning goals associated with this unit are to: demonstrate an understanding of the role that multivariate statistical techniques such as factor analysis, structural equation modelling, logistic regression, categorical data analysis, cluster analysis, multidimensional scaling and correspondence analysis play in uncovering relationships and patterns in survey data appraise the strengths and limitations of these techniques apply tools in R to generate solutions for the appropriate statistical techniques demonstrate skills in using the appropriate statistical techniques from a user and provider perspective demonstrate skills in communicating the results of the analysis so that decision making can be implemented.
Nonnegative matrix factorization NMF attempts to decompose a nonnegative matrix into nonnegative loadings and nonnegative factors, thus describing the non-subtractive patterns in the data. After transforming the raw data into input data fulfilling the constraints of nonnegativity as Kim et al. Consequently, for a large number of genomic features, data reduction techniques such as principal component analysis 46 were required in the data preprocessing step, which might result in loss of information. Moreover, network information can be incorporated into NMF.
Network-based stratification NBS 47 minimized the following objective function in order to cluster tumors into subtypes according to somatic mutation profiles with K being an adjacency matrix encoding network information:. As pointed out by the authors, NBS can be further generalized to integrate multiple layers of information 47 ; thus I expect a loss function as a combination of Equations 9 and A major issue with all the factorization approaches mentioned above is that they require proper normalization across data types.
Generally, different data types have different distributions, different variability, and different numbers of genomic features. For instance, without proper scaling, as pointed out by Lock et al. JIVE attempted to handle that issue with normalization first across each row and then scaling across data types.
On the other hand, as mentioned above, iCluster 33 tried to use different penalty functions to take care of different data features. However, it still failed to distinguish between binary, categorical, and continuous data types. The method proposed by Mo et al. Specifically, with i indexing patient and j indexing genomic feature, for binary outcome, it rephrased Equation 6 as. Likelihood for count outcome with Poisson distribution and continuous variables with normal distribution can be derived accordingly. Lasso penalty was also placed on L d for regularization. The tuning parameters for regularization was chosen by Bayesian information criterion BIC , and the model was fitted by the modified Monte Carlo Newton—Raphson algorithm.
A potential future research problem would be how to adapt different distribution assumptions into a more flexible factorization framework such as the joint Bayesian factor model 38 and JIVE. Another problem worth investigation in real application, as pointed out by the referee, is the choice of the number of components or clusters K ; the authors of the above models have tried resampling-based criterion measuring cluster reproducibility, permutation based testing approach, Indian buffet process, and BIC, whereas the Akaike information criterion AIC and Bayesian factor might also be of interest.
Bayesian hierarchical models are another set of popular tools for integrative analysis of heterogeneous data types. They offer the flexibility to model different data-type-specific distributions as well as various types of correlation among data types. In multiple dataset integration MDI , 51 the authors considered the case where multiple genomic data types were measured under a single biological condition for a common set of genomic features. For instance, gene expression data, protein—DNA interaction data, and protein—protein interaction data were measured simultaneously for the same group of genes.
The model assumed that each data type followed a K -component mixture model. Let c id indicate the class membership of feature i in data-set d. Then, MDI modeled the associations among datasets via the following conditional prior for data-type-specific class memberships:. MDI was further extended by incorporating a feature selection step in modeling its data-type-specific distributions and applied to gene expression, copy number variation, methylation, and microRNA data of glioblastoma samples from TCGA.
A more flexible model might allow the association to vary from a cluster of features genes to another cluster of features genes. It first discretized all the genomic features and concatenated them into one vector for each patient. Next, for each patient, it assumed that each genomic feature was generated from a multinomial distribution whose parameters were determined by a K -dimensional Dirichlet distribution. One drawback of this approach is that it requires the discretization of each data type, which may lose a substantial amount of information.
The Bayesian consensus clustering was proposed to model the overall clustering consensus among different data types rather than pairwise associations among data types. Therefore, an overall single clustering can be achieved at patient level, resulting in cancer subtype discoveries. So far, software has been developed with the data-specific distribution specified as normal distribution. All the above models were embedded into the Bayesian framework.
- Bayesian Modelling and Analysis?
- In Defence of Dogs?
- Smashed in the USSR: Fear, Loathing and Vodka on the Steppes?
- Uncertainty Analysis with High Dimensional Dependence Modelling (Wiley Series in Probability and Statistics);
- Integrative Analyses of Cancer Data: A Review from a Statistical Perspective.
- New Orleans con Sabor Latino: The History and Passion of Latino Cooking.
Consequently, one main challenge lies in the computation of the MCMC algorithm for model fitting. Generally speaking, compared to matrix factorization methods, the Bayesian hierarchical model provides more flexibility to model data-type-specific distributions and various dependence structures. Nevertheless, it remains challenging to build models that comprehensively capture the association among different data types, among patients, and among different clusters of genomic features.
Another emerging approach for identifying cancer subtypes is to construct networks for patients and then conduct clustering according to the obtained network graph. Similarity network fusion SNF 55 first constructed a similarity network of patients for each data type, where each node represented a patient and the weight on each edge indicated the similarity between two patients.
Then, SNF normalized each network W d into a matrix P d that captured the global similarities among patients with row sums being 1 and a matrix S d that described only the local similarities among the K nearest neighbors of each patient. Instead of building a graph for each type of data, Katenka et al. A hypothesis testing approach was used to construct an association network according to canonical correlation between two groups of attributes.
Kolar et al. Then, a penalized log likelihood was optimized to estimate the partial canonical correlations for constructing a Markov graph. SNF lacks a rigorous probabilistic model to fuse multiple graphs; the methods of both Katenka et al. Given the burst of statistical literature on multiple graphs estimation, 60 — 66 though usually for single data type across multiple conditions, I expect estimation of multiple graphs constructed from multiple data types and construction of a single graph from heterogeneous data types with data-type-specific distributions will call for novel statistical models, methods, and theories for network research.
One of the major goals of cancer research is to identify the survival curves for cancer patients. Therefore, statistical methods for studying the relationship between survival data and high-dimensional genomic data are of vital clinical importance.
- Applied Statistics Using SPSS, STATISTICA, MATLAB and R.
- Biochemistry For Dummies (2nd Edition).
- The Accursed Share, Volume 1: Consumption.
- Geotechnics and Heritage: Case Histories;
Here, I briefly review recent development in integrating genomic data with survival data. Let T i and C i denote the true underlying failure time and censoring time.
Conditional-independent censoring mechanism given the covariates is usually assumed. Two main approaches to model survival data with high-dimensional genomic data are penalization-based variable selection methods and tree-based ensemble learning methods. The Cox proportional hazard model 67 is one of the most widely used models for survival data.
It assumes that the hazard at time t for x i is. Then, for model fitting, the partial likelihood 68 can be derived as. For penalized variable selection methods as well as other dimension reduction methods developed for survival data before , see Witten and Tibshirani 72 for a detailed review. Along the same road map, when biological pathway information is available, penalty functions were also designed to conduct both group-level selection and within-group-level variable selection 73 as well as enforcing smoothness of regression coefficients for genes connected in a network. Parallel to the development of methods for linear models moving from high dimension to ultrahigh dimension, defined as the dimensionality growing exponentially with the sample size in Fan and Lv, 75 sure independence screening SIS type of methods have also been developed for survival data.https://ustanovka-kondicionera-deshevo.ru/libraries/2020-07-24/132.php
High-Dimensional Data Analysis in Cancer Research
For survival data, Fan et al. In contrast, rather than using the proportional hazard model, the feature aberration at survival times FAST statistic was developed in Gorst-Rasmussen and Scheike 80 as a measure of the aberration of each covariate relative to its at-risk average. Ensemble learning methods such as random forest 29 and boosting 81 are well known for offering outstanding prediction accuracy.
Several methods have attempted to handle the missingness caused by censoring and thus generalized ensemble learning methods to survival data. Hothorn et al. Given a new observation, its survival function is estimated by the Kaplan—Meier curve 84 for all data points in all the trees that belonged to the same leaf node as the new observation. In Hothorn et al. RIST iterated between imputing censored data using conditional survival distributions and refitting the conditional survival distributions by pooling all the trees with imputed data.
Despite the wide use of random forest, theoretical analyses of its consistency and asymptotics 89 — 91 are just emerging. Therefore, at present theoretical properties of tree-based ensemble methods remain significant challenges. An ultimate goal of genomics is to demystify the regulation program of different functional genomic profiles. How is DNA methylation related to gene expression?
How does transcription factor binding control gene expression? What is the relationship between chromatin status and methylation status? The core problem underlying all these questions is whether we can predict one type of genomic profile from another, where both the response and predictor variables are multivariate with at least tens of thousands of variables. In such scenarios, we surpass simple or multiple linear regressions, penalized approaches such as lasso, and sure independence screening for ultrahigh dimensions in that the response variable itself is also an ultrahigh-dimension vector rather than a scalar one.
The small sample size adds another dimension of challenge for inferring the relations between tens of thousands of responses and predictors.
Statistics Institute Research Topics | School of Mathematics Research
To bring this to light, the regression model can be formulated as. Consequently, a sparse orthogonal decomposition algorithm preserving sparsity in U and V was developed to estimate U and V iteratively. It can be seen that the cross-data-type prediction will open another new field for statistical methodology and theoretical research, given that both the predictors and the responses can not only be ultrahigh dimensional but also consist of multiple data types.
More and more efforts have been devoted to the development of statistical models and methods for integrative cancer data.
- Evolutionary Theory and Processes: Modern Horizons: Papers in Honour of Eviator Nevo!
- Postgraduate - Unit;
- Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data.
- Stanford Libraries;
- Implications for exploring gene and protein expression data.
- Janice VanCleaves Rocks and Minerals: Mind-Boggling Experiments You Can Turn Into Science Fair Projects (Spectacular Science Project).
- She Nailed a Stake Through His Head: Tales of Biblical Terror.
Nevertheless, research on integrative analyses for cancer is still in its infancy with many open problems. How can systematic biases such as batch effects be detected and corrected in each new type of high-throughput technology so that meta-analysis across studies can be conducted? How can cancer subtypes be classified according to multiple genomic profiles jointly or determined by only a subset of genomic profiles? How can a single network be constructed with multidimensional genomic profiles?
How can networks constructed from different types of data be modeled jointly? How can survival time of cancer patients be predicted by multiple types of ultrahigh-dimensional genomic profiles? How can one ultrahigh-dimensional vector be predicted from another ultrahigh-dimensional vector, one maybe continuous while the other discrete?
Related High-Dimensional Data Analysis in Cancer Research
Copyright 2019 - All Right Reserved