66 Matching Annotations
1. Jan 2020
2. www.sthda.com www.sthda.com
1. Logistic regression assumptions

#### URL

3. www.sthda.com www.sthda.com
1. Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
2. You can also fit generalized additive models (Chapter @ref(polynomial-and-spline-regression)), when linearity of the predictor cannot be assumed. This can be done using the mgcv package:
3. For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.
4. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).

#### URL

4. online.stat.psu.edu online.stat.psu.edu
1. An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).

#### URL

5. jbhender.github.io jbhender.github.io
1. As shown in the Residuals vs Fitted plot, there is a megaphone shape, which indicates that non-constant variance is likely to be an issue.

#### URL

6. scotch.io scotch.io
1. Add this code snippet to the bottom of the backend/settings.py file:

#### URL

7. Dec 2019
8. www.interviewcake.com www.interviewcake.com
1. So what's our total time cost? O(nlog⁡2n)O(n\log_{2}{n})O(nlog2​n). The log⁡2n\log_{2}{n}log2​n comes from the number of times we have to cut nnn in half to get down to sublists of just 1 element (our base case). The additional nnn comes from the time cost of merging all nnn items together each time we merge two sorted sublists.

#### URL

9. Aug 2019
10. www.pysurvival.io www.pysurvival.io
1. so that instead of predicting the time of event, we are predicting the probability that an event happens at a particular time .

#### URL

11. Jul 2019
12. archimede.mat.ulaval.ca archimede.mat.ulaval.ca
1. In practice, we found that it is not appropriate to use Aalen’s additive hazardsmodel for all datasets, because when we estimate cumulativeregression functionsB(t),they are restricted to the time interval where X (X has been defined in Chapter 3) is offull rank, that meansX0Xis invertible. Sometimes we found that X is not of full rank,which was not a problem with the Cox model.
2. An overall conclusion is that the two models give different pieces of informationand should not be viewed as alternatives to each other, but ascomplementary methodsthat may be used together to give a fuller and more comprehensive understanding ofdata
3. The effect ofthe covariates on survival is to act multiplicatively on some unknown baseline hazardrate, which makes it difficult to model covariate effects that change over time. Secondly,if covariates are deleted from a model or measured with a different level of precision, theproportional hazards assumption is no longer valid. These weaknesses in the Cox modelhave generated interest in alternative models. One such alternative model is Aalen’s(1989) additive model. This model assumes that covariates act in an additive manneron an unknown baseline hazard rate. The unknown risk coefficients are allowed to befunctions of time, so that the effect of a covariate may vary over time.

#### URL

13. www.sthda.com www.sthda.com
1. Note that, three often used transformations can be specified using the argument fun: “log”: log transformation of the survivor function, “event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence, “cumhaz” plots the cumulative hazard function (f(y) = -log(y))
2. Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).

#### URL

14. fs.blog fs.blog
1. 1. Direction Over Speed

#### URL

15. www.sthda.com www.sthda.com
1. summarizing and visualizing a complex data table in which individuals are described by several sets of variables (quantitative and /or qualitative) structured into groups.

#### URL

16. www.sthda.com www.sthda.com
1. it acts as PCA quantitative variables and as MCA for qualitative variables.

#### URL

17. factominer.free.fr factominer.free.fr
1. MFA is a weighted PCA
2. Study the similarity between individuals with respect to thewhole set of variables AND the relationships between variables

#### URL

18. sebastianraschka.com sebastianraschka.com
1. in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance

Use standardization, not min-max scaling, for clustering and PCA.

2. As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.

#### URL

19. www.csd.uwo.ca www.csd.uwo.ca
1. Implication means co-occurrence, not causality!
2. Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other itemsin the transaction

#### URL

20. www.cs.ubc.ca www.cs.ubc.ca
1. Cluster features and only consider rules within clusters
2. Support Set Pruning

#### URL

21. spark.apache.org spark.apache.org
1. Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items. Different from Apriori-like algorithms designed for the same purpose, the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After the second step, the frequent itemsets can be extracted from the FP-tree.

#### URL

22. livebook.datascienceheroes.com livebook.datascienceheroes.com
1. 2.1.8 Automatic data frame discretization

#### URL

1. Non-proportional hazards is a case of model misspecification.
2. The idea behind the model is that the log-hazard of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time.

#### URL

24. livebook.datascienceheroes.com livebook.datascienceheroes.com
1. However, the gain ratio is the most important metric here, ranged from 0 to 1, with higher being better.
2. en: entropy measured in bits mi: mutual information ig: information gain gr: gain ratio

#### URL

25. www.sthda.com www.sthda.com
1. Balloon plot

Balloon plot

#### URL

26. rdrr.io rdrr.io
1. Feature predictive power will be calculated for all features contained in a dataset along with the outcome feature. Works for binary classification, multi-class classification and regression problems. Can also be used when exploring a feature of interest to determine correlations of independent features with the outcome feature. When the outcome feature is continuous of nature or is a regression problem, correlation calculations are performed. When the outcome feature is categorical of nature or is a classification problem, the Kolmogorov Smirnov distance measure is used to determine predictive power. For multi-class classification outcomes, a one vs all approach is taken which is then averaged to arrive at the mean KS distance measure. The predictive power is sensitive towards the manner in which the data has been prepared and will differ should the manner in which the data has been prepared changes.

#### URL

27. www.scholarpedia.org www.scholarpedia.org
1. Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.

#### URL

28. nbviewer.jupyter.org nbviewer.jupyter.org
1. Sidenote: Visually comparing estimated survival curves in order to assess whether there is a difference in survival between groups is usually not recommended, because it is highly subjective. Statistical tests such as the log-rank test are usually more appropriate.

#### URL

29. pedroconcejero.wordpress.com pedroconcejero.wordpress.com
1. RF is now a standard to effectively analyze a large number of variables, of many different types, with no previous variable selection process. It is not parametric, and in particular for survival target it does not assume the proportional risks assumption.

#### URL

30. www.cscu.cornell.edu www.cscu.cornell.edu
1. Thesurvival function gives,for every time,the probability of surviving(or not experiencing the event) up to that time.The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time.

#### URL

31. www.statisticshowto.datasciencecentral.com www.statisticshowto.datasciencecentral.com
1. Sturge’s rule works best for continuous data that is normally distributed and symmetrical.

#### URL

32. towardsdatascience.com towardsdatascience.com
1. how the features are all on the same relative scale. The relative spaces between each feature’s values have been maintained.

#### URL

33. www.kaggle.com www.kaggle.com
1. "in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data."

#### URL

34. stats.stackexchange.com stats.stackexchange.com
1. The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to h=2×IQR×n−1/3h=2×IQR×n−1/3h=2\times\text{IQR}\times n^{-1/3}. So the number of bins is (max−min)/h(max−min)/h(\max-\min)/h, where nnn is the number of observations, max is the maximum value and min is the minimum value.

How to determine the number of bins to use in a histogram.

#### URL

35. scikit-learn.org scikit-learn.org
1. Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values.

Binning

2. many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

#### URL

36. www.willmcginnis.com www.willmcginnis.com
1. we want to code categorical variables into numbers, but we are concerned about this dimensionality problem

#### URL

37. machinelearningmastery.com machinelearningmastery.com
1. Ensemble Machine Learning Algorithms in Python with scikit-learn

#### URL

38. www.oreilly.com www.oreilly.com
1. Machine learning models are basically mathematical functions that represent the relationship between different aspects of data.

#### URL

39. jmlr.csail.mit.edu jmlr.csail.mit.edu
1. Compared with neural networks configured by a pure grid search,we find that random search over the same domain is able to find models that are as good or betterwithin a small fraction of the computation time.

#### URL

40. Jun 2019
41. towardsdatascience.com towardsdatascience.com
1. To interpret a model, we require the following insights :Features in the model which are most important.For any single prediction from a model, the effect of each feature in the data on that particular prediction.Effect of each feature over a large number of possible predictions

Machine learning interpretability

#### URL

42. christophm.github.io christophm.github.io
1. Instability means that it is difficult to trust the explanations, and you should be very critical.

#### URL

1. When dealing with a mixed of continuous and categorical features, SMOTE-NC is the only method which can handle this case.

SMOTE-NC

#### URL

1. In addition, RandomOverSampler allows to sample heterogeneous data (e.g. containing some strings):

RandomOverSampler

2. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.

Naive random over-sampling

#### URL

45. varsellcm.r-forge.r-project.org varsellcm.r-forge.r-project.org
1. missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random.

VarSelLCM package

#### URL

46. Local file Local file
1. Success ina data science project comes not from access to any one exotic tool, but from having quantifiablegoals, good methodology, crossdiscipline interactions, and a repeatable workflow.

#### Annotators

47. www.sthda.com www.sthda.com
1. Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.
2. The cos2 values are used to estimate the quality of the representation The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components) Variables that are closed to the center of the plot are less important for the first components.
3. Taken together, the main purpose of principal component analysis is to: identify hidden pattern in a data set, reduce the dimensionnality of the data by removing the noise and redundancy in the data, identify correlated variables
4. the amount of variance retained by each principal component is measured by the so-called eigenvalue.
5. These new variables correspond to a linear combination of the originals. The number of principal components is less than or equal to the number of original variables.
6. Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components.

#### URL

48. www.octoberraindrops.com www.octoberraindrops.com
1. Thus, when we say that PCA can reduce dimen-sionality, we mean that PCA can compute princi-pal components and the user can choose the smallestnumberKof them that explain 0.95 of the variance.A subjectively satisfactory result would be whenKis small relative to the original number of featuresD.

#### URL

49. sebastianraschka.com sebastianraschka.com
1. However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

Use min-max scaling for image processing & neural networks.

2. The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0\mu = 0 and σ=1σ=1\sigma = 1 where μμ\mu is the mean (average) and σσ\sigma is the standard deviation from the mean

#### URL

1. Threshold values of 0.8-0.9 are recommended for well separated clusters; to allow for overlapping clusters, we chose a threshold of 0.6.

#### URL

51. Jan 2019
52. stackoverflow.com stackoverflow.com
1. Changing chunk background color in RMarkdown

Change the background colour of code chunks in Rmarkdown using CSS.