66 Matching Annotations
  1. Jan 2020
    1. Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
    2. You can also fit generalized additive models (Chapter @ref(polynomial-and-spline-regression)), when linearity of the predictor cannot be assumed. This can be done using the mgcv package:
    3. For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.
    4. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).
    1. An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).
    1. As shown in the Residuals vs Fitted plot, there is a megaphone shape, which indicates that non-constant variance is likely to be an issue.
  2. Dec 2019
    1. So what's our total time cost? O(nlog⁡2n)O(n\log_{2}{n})O(nlog2​n). The log⁡2n\log_{2}{n}log2​n comes from the number of times we have to cut nnn in half to get down to sublists of just 1 element (our base case). The additional nnn comes from the time cost of merging all nnn items together each time we merge two sorted sublists.
  3. Aug 2019
    1. so that instead of predicting the time of event, we are predicting the probability that an event happens at a particular time .
  4. Jul 2019
    1. In practice, we found that it is not appropriate to use Aalen’s additive hazardsmodel for all datasets, because when we estimate cumulativeregression functionsB(t),they are restricted to the time interval where X (X has been defined in Chapter 3) is offull rank, that meansX0Xis invertible. Sometimes we found that X is not of full rank,which was not a problem with the Cox model.
    2. An overall conclusion is that the two models give different pieces of informationand should not be viewed as alternatives to each other, but ascomplementary methodsthat may be used together to give a fuller and more comprehensive understanding ofdata
    3. The effect ofthe covariates on survival is to act multiplicatively on some unknown baseline hazardrate, which makes it difficult to model covariate effects that change over time. Secondly,if covariates are deleted from a model or measured with a different level of precision, theproportional hazards assumption is no longer valid. These weaknesses in the Cox modelhave generated interest in alternative models. One such alternative model is Aalen’s(1989) additive model. This model assumes that covariates act in an additive manneron an unknown baseline hazard rate. The unknown risk coefficients are allowed to befunctions of time, so that the effect of a covariate may vary over time.
    1. Note that, three often used transformations can be specified using the argument fun: “log”: log transformation of the survivor function, “event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence, “cumhaz” plots the cumulative hazard function (f(y) = -log(y))
    2. Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).
    1. in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance

      Use standardization, not min-max scaling, for clustering and PCA.

    2. As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.
    1. Implication means co-occurrence, not causality!
    2. Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other itemsin the transaction
    1. Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items. Different from Apriori-like algorithms designed for the same purpose, the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After the second step, the frequent itemsets can be extracted from the FP-tree.
    1. However, the gain ratio is the most important metric here, ranged from 0 to 1, with higher being better.
    2. en: entropy measured in bits mi: mutual information ig: information gain gr: gain ratio
    1. Feature predictive power will be calculated for all features contained in a dataset along with the outcome feature. Works for binary classification, multi-class classification and regression problems. Can also be used when exploring a feature of interest to determine correlations of independent features with the outcome feature. When the outcome feature is continuous of nature or is a regression problem, correlation calculations are performed. When the outcome feature is categorical of nature or is a classification problem, the Kolmogorov Smirnov distance measure is used to determine predictive power. For multi-class classification outcomes, a one vs all approach is taken which is then averaged to arrive at the mean KS distance measure. The predictive power is sensitive towards the manner in which the data has been prepared and will differ should the manner in which the data has been prepared changes.
    1. Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.
    1. Sidenote: Visually comparing estimated survival curves in order to assess whether there is a difference in survival between groups is usually not recommended, because it is highly subjective. Statistical tests such as the log-rank test are usually more appropriate.
    1. RF is now a standard to effectively analyze a large number of variables, of many different types, with no previous variable selection process. It is not parametric, and in particular for survival target it does not assume the proportional risks assumption.
    1. Thesurvival function gives,for every time,the probability of surviving(or not experiencing the event) up to that time.The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time.
    1. "in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data."

    1. The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to h=2×IQR×n−1/3h=2×IQR×n−1/3h=2\times\text{IQR}\times n^{-1/3}. So the number of bins is (max−min)/h(max−min)/h(\max-\min)/h, where nnn is the number of observations, max is the maximum value and min is the minimum value.

      How to determine the number of bins to use in a histogram.

    1. Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values.


    2. many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
    1. Compared with neural networks configured by a pure grid search,we find that random search over the same domain is able to find models that are as good or betterwithin a small fraction of the computation time.
  5. Jun 2019
    1. To interpret a model, we require the following insights :Features in the model which are most important.For any single prediction from a model, the effect of each feature in the data on that particular prediction.Effect of each feature over a large number of possible predictions

      Machine learning interpretability

    1. In addition, RandomOverSampler allows to sample heterogeneous data (e.g. containing some strings):


    2. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.

      Naive random over-sampling

  6. varsellcm.r-forge.r-project.org varsellcm.r-forge.r-project.org
    1. missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random.

      VarSelLCM package

    1. Success ina data science project comes not from access to any one exotic tool, but from having quantifiablegoals, good methodology, crossdiscipline interactions, and a repeatable workflow.



    1. Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.
    2. The cos2 values are used to estimate the quality of the representation The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components) Variables that are closed to the center of the plot are less important for the first components.
    3. Taken together, the main purpose of principal component analysis is to: identify hidden pattern in a data set, reduce the dimensionnality of the data by removing the noise and redundancy in the data, identify correlated variables
    4. the amount of variance retained by each principal component is measured by the so-called eigenvalue.
    5. These new variables correspond to a linear combination of the originals. The number of principal components is less than or equal to the number of original variables.
    6. Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components.
    1. Thus, when we say that PCA can reduce dimen-sionality, we mean that PCA can compute princi-pal components and the user can choose the smallestnumberKof them that explain 0.95 of the variance.A subjectively satisfactory result would be whenKis small relative to the original number of featuresD.
    1. However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

      Use min-max scaling for image processing & neural networks.

    2. The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0\mu = 0 and σ=1σ=1\sigma = 1 where μμ\mu is the mean (average) and σσ\sigma is the standard deviation from the mean
  7. Jan 2019