179 Matching Annotations
  1. May 2018
    1. cise: Repeat the cross-validation process above for different values of kkk, and plot both MAE and RMSE as a function of kkk (from 1 to 9). What is the best value of kkk according to each measure? Use set.seed(0). Notice that the smaller the MAE and RMSE values the better (more accurate) the results. Exercise: Repeat the previous exercise but now for User-Based KNN instead of Item-Based KNN. ADDITIONAL ACTIVI

      Provide context/solution/comment/feedback for these exercises, activities.

    2. User-based Collaborative Filtering makes the assumption that users with similar preferences tend to rate items similarly. From this assumption, missing ratings for a user

      What's the most popular application of this technique?

    3. In

      Is it still part of the exercise from here, or is it the next part of the content? I'm confused.

      If it's the content continued, please break content into chunks and give them headings.

    4. Exercise

      Make it clear what is the context, what is the task, and what is additional info/guidance/solution. Provide solution/feedback if it's not there already.

    1. Additional Computer Practice (LAB)

      You need to write detailed descriptions of the LAB.

      • The problem statement
      • The task
      • Submission

      Solution/feedback/comments also needs to be created.

    2. cise 5: Repeat the experiment above, producing plots analogous to those in Figs. 14 to 16, but with GLOSH rather than LOF. Use GLOSH Scores higher than 0.9 and 0.95 as a counterpart for LOF Scores higher than 2 and 3, respectively. Exerc

      Solutions/feedback needed for both exercises.

    3. nearest neighbours to estimate the density around the observation. It also estimates the densities around the neighbours, as well as around the neighbours of the neighbours, in an analogous way. The intuition is that, for an observation inside a cluster (i.e., an inlier) the contrast between the density around an observation and the densities around its neighbours is very similar to the contrast between the densities around each neighbour

      Any figure/diagram to illustrate this concept?

    4. local method

      I can't remember, but did you discuss local method vs global method before? If yes, please remind me where. We will link this back to that section so that students can refresh their knowledge on these concepts. If you haven't done so, you need to provide this discussion somewhere, probably when it's first mentioned.

    5. library(ggplot2) set.seed(0) x11 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x1 coordinate) x21 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x2 coordinate) x12 <- rnorm(n = 100, mean = 20, sd = 1) # Cluster 2 (x1 coordinate) x22 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 2 (x2 coordinate) x13 <- rnorm(n = 100, mean = 15, sd = 3) # Cluster 3 (x1 coordinate) x23 <- rnorm(n = 100, mean = 25, sd = 3) # Cluster 3 (x2 coordinate) x14 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x1 coordinate) x24 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x2 coordinate) dat <- data.frame(x1 = c(x11,x12,x13,x14), x2 = c(x21,x22,x23,x24)) ( g0a <- ggplot() + geom_point(data=dat, mapping=aes(x=x1, y=x2), shape = 19) )

      Is this the dataset from R?

    6. upervised outlier detection fit parametric distributions by estimating the parameters of these distributions from the given data. For instance, assuming that the normal observations follow a univariate Gaussian distribution, the parameters (mean and standard deviation) of this distribution can be estimated from the collection of observations and those observations lying on the tail of the distribution will be deemed outliers. A very common rule of thumb, known as the “3σσ\sigma rule”, is that observations deviating more than three times the standard deviation from the mean of a normal distribution may be considered outliers. It is worth noticing that, in the unsupervised setting, when an observation is detected as an outlier because it deviates from the usual pattern, it doesn’t necessarily mean that the outlier has not been generated by the same mechanism as the other observations. It only means that it is unlikely that this outlier has been generated by the same mechanism, or in other words, it is likely that it has been generated by a different mechanism; “how likely” is a measure or score of outlyingness. Observations with high outlier scores are not necessarily anomalies (frauds, computer intrusions, genetic mutations, etc.), they are suspicious though and, therefore, they may deserve further investigation.

      Need a diagram/figure to visualise. Will make it a lot easier to understand.

      Maybe link back to this part on statistics for students to refresh their knowledge?

    7. targets outliers as potentially unknown anomalies yet to be discovered. For instance, they can be a new type of fraud or network intrusion not identified before, so there are no labeled observations to train a classifier.

      Please provide real world examples. What are the most popular/common applications of this algorithm/task?

    1. Choose kkk prototypes randomly Assign each of the NNN observations in the data set to its closest prototype Recompute the prototypes using Eq. (2) Repeat 2 and 3 until nothing changes from the previous iteration (or until a maximum number of iterations is reached).

      In the example below, could you clearly label each of these steps so that students can follow along easily?

    2. |⋅||⋅||\cdot|

      Should we refer students back to Essential Maths for set theories? Maybe we should say this at the beginning of the subject that they will encounter lots of this, and better revisit certain parts of EM?

    3. Notice that xixi\mathbf{x}_i is the iiith row of an N×nN×nN \times n data matrix where rows correspond to observations and columns correspond to variables.

      Please create this conceptual matrix. Also have real world examples to illustrate the concept.

      For example: 1000 customers (N = 1000) For each customers the n variables are: age income gender (0 = male, 1 = female) etc.

      Then a certain observation is:

      x1 = [20 57,000 1] x2 = [49 112,000 0] x3 = .....

      Of course I'm making this up, but you get the idea.

    4. Additional

      Please compile all the information needed for this LAB in one document so that we can provide it to the student in one piece.

      Also please produce the solution.

    5. . Since objects may become noise below a certain value of epsepseps, clusters may “shrink” as we move downwards in the tree (i.e., as epsepseps decreases). That’s why the thickness of the horizontal lines may narrow down as we move top-bottom, possibly vanishing away when all objects of a cluster become noise (the leaves of the tree).

      This would be a cool place for an animation!!!

      If you could provide a series of HDBSCANs with their corresponding eps, we can easily make this happen! It would show how this shrinking happens.

    6. So, a core point is an observation that is dense in the sense that it has minPtsminPtsminPts or more observations in its vicinity and, therefore, it satisfies the minimum density threshold. Any core point is part of a cluster. Core points that fall within each other’s epsepseps-neighbourhood are said to be connected and, therefore, they are part of the same cluster. Clusters are maximal sets of connected core points. Observations that are not core (i.e. not dense) but are within the epsepseps-neighborhood of a core point, are also incorporated into the corresponding cluster. These are called border points, because they are usually located around the border of clusters. The remaining observations (which are neither core nor border) are deemed noise; these are treated as outliers and, therefore, they are not included into any cluster

      Again, a diagram would help understand this a lot better. Anything from a textbook that we can reuse?

    7. that can detect arbitrarily shaped clusters while being robust to noise

      Can we have a real world example to illustrate what it means by "arbitrarily shaped clusters while being robust to noise"?

    8. Exercise: Compute and plot the AL dendrogram for the iris data set. Note: Use only the numerical variables iris[1:4] for the clustering. Use the 5th variable Species only to display the class labels at the bottom of the dendrogram.

      Need solution.

    9. Fig

      WHere is the data for this figure coming from again? In an exercise later this week, students are asked to use "the data used in Figure 1". We need to provide clear instruction to students on how/where to get the data.

    10. In particular, Agglomerative Hierarchical Clustering (HAC) methods, which are the most well-known and widely used in practice

      Could you provide a few examples how how these methods are applied in real world?

    11. The Silhouette Width Criterion (SWC) basically assesses the quality of the assignment of an observation to its cluster, by measuring the difference between the average distance of this observation to observations in another cluster (which should be large) and the average distance of this observation to observations in the same cluster (which should be low). If the observation is assigned to its natural cluster, this difference is expected to be high. The silhouette of an observation is just a normalisation of this difference within [−1,+1][−1,+1][-1,+1], and the silhouette of an entire clustering solution is just the mean of the individual silhouettes of all observations, which therefore is also within [−1,+1][−1,+1][-1,+1]. The closer to +1, the better the clustering solution.

      Can we use a real world example to illustrate this?

    12. The k-Means Algorithm

      Please break the content under this heading into smaller logical chunks so that it can be easier for students to navigate. We will put these content chunks in accordions

    1. and far from rigorous, provided with the sole intent to explain the basic intuition behind this concept, which plays a very important role in statistical learning and data mining. A more formal yet simple and elegant discussion on the bias-variance trade-off, containing practical examples and illustrations, is

      Can we break this section down with the following headings?

      overfitted model underfitted model bias-variance trade-off (and provide the textbook reading here instead of at the end of the week)

    2. Recall that in supervised learning one wants to learn a model that describes a certain numerical or categorical variable as a function of other variable(s). Typically, the variable we want to describe and possibly predict is called the dependent or output variable, YYY, whereas the variable(s) used for the prediction are called independent or input variable(s), X=X1,⋯,XnX=X1,⋯,XnX = {X_{1}, \cdots, X_{n} }, so-called predictor(s). It is presumed that the dependent variable can be described as a function fff of the predictor(s), i.e., Y=f(X)+ϵY=f(X)+ϵY = f(X) + \epsilon, except for a component ϵϵ\epsilon that depends on unobserved variables. The goal is to learn the mapping f(X)f(X)f(X) from data, using past observations (X,Y)(X,Y)(X,Y) as a “teacher”. Since the model can only depend on the observed variable(s), XXX, our data-driven learning problem reduces to find an f^f^\hat{f} that is as close to fff as possible.

      Again, this would benefit from being visual. If the equation walkthroughs in week 1 work, I suggest putting it here too.

    1. In predictive data mining tasks, we want to predict the unknown value of a certain variable using the known values of other variables. Typically, the variable we want to predict is called the dependent or output variable, YYY, whereas the variable(s) used for the prediction are called independent or input variable(s), X={X1,⋯,Xn}X={X1,⋯,Xn}X = \{X_{1}, \cdots, X_{n} \}, so-called predictor(s). Past observations from both (X,Y)(X,Y)(X,Y) serve as a “teacher” for a model to be trained from data, and the training of such a model is referred to as supervised learning. The most common predictive tasks in data mining are regression and classification.

      I suggest that we visualise this, as this is a key concept and would benefit from having a visual representation in addition to textual description here.

      We've come up with an interactive called "equation walkthrough" that can be used to explain this type of equations in a better way.

      In the "Potential interactives" folder on Dropbox, there's a PPT file called "equation_walkthrough".

      If you like the idea, please help me break this content down into components so that we can make the interactive.

      Also, this definition will greatly benefit from an example!

    2. We have seen that in predictive tasks we have a target variable, YYY, and the goal is to learn a model that maps some suitable input variable(s) XXX into YYY, in a supervised way. In many scenarios, however, we don’t have a dependent variable YYY, but only the input variable(s), XXX. Yet, there may be valuable patterns that analysts may want to learn from XXX alone. Since these patterns are learnt in the abscence of a “teacher” YYY, the process is usually referred to as unsupervised learning.

      If we create an equation walkthrough to describe predictive tasks, should we have one for descriptive tasks? That gives students an easy way to compare the two as well.

      If yes, please provide the breakdown of components for the walk through.

      Also, please provide an example to make the definition easier to understand.

    3. isplayed i

      Another idea for a starting activity for this week: a case study. For example, a case study on how data mining has been used in the marketing field, and how various techniques are applied. I found the following site as an example. Of course you would be able to provide materials from trustworthy academic sources.

      https://www.egon.com/blog/666-techniques-data-mining-marketing

      Another interesting link to support this activity:

      http://bigdata-madesimple.com/14-useful-applications-of-data-mining/

  2. Mar 2018
    1. Continued

      So this is continued from week 2, right?

      Please refer to comments in week 1 regarding the following:

      • please write "introduction to week 2"
      • connect what was going on last week with what's going on this week, and how this week fits in the whole subject
      • tell students about assessments related to this week
      • start the week with a case study, an activity, some real world application that triggers students interest. Then introduce the technical concepts in context, as tools to help solve business/real world problems.
      • enrich the content throughout with case studies/videos/current affairs/other resources whenever possible.

      Also, please refer to my comments in weeks 1 and 2 regarding the provision of feedback/comments/solutions for all exercises/activities.

    2. uiz

      I strongly recommend making this ungraded as an exercise for students to recap the main concepts. Assessments should be authentic to align with JCU's policies.

    3. Exercise: In

      Same comment as above.

      So students should produce/provide a piece of code? Please make clear what is required in terms of output/submission (even when it's not graded you might still want them to submit this as part of a code workbook?)

      Also, provide comments/feedback/solution.

    4. Exercise: Instead of randomly splitting the data into a training and a test set only once, repeat this procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the standard deviation of the classification accuracy over the 10 test sets. Exercise:

      Same comment as for other exercises.

      So students should produce/provide a piece of code? Please make clear what is required in terms of output/submission (even when it's not graded you might still want them to submit this as part of a code workbook?)

      Also, provide comments/feedback/solution.

    5. Exercise: Repeat the previous exercise varying parameter kkk of the KNN classifier from 1 to 20, and plot a curve of LOOCV accuracy (in the y axis) as a function of kkk (x axis). What is the best choice(s) of kkk according to this experiment?

      Same comment as for the above exercises.

      So students should produce/provide a piece of code? Please make clear what is required in terms of output/submission (even when it's not graded you might still want them to submit this as part of a code workbook?)

      Also, provide comments/feedback/solution.

    6. Exercise: What are the values of RFP and RTP for a “classifier” that assigns every observation to the positive class, and where is this classifier positioned in the unit square of RFP×RTPRFP×RTPRFP \times RTP? Exercise: Repeat the previous exercise but now considering a “classifier” that assigns every observation to the negative class. Justify your answers.

      Same comment as above.

      So students should produce/provide a piece of code? Please make clear what is required in terms of output/submission (even when it's not graded you might still want them to submit this as part of a code workbook?)

      Also, provide comments/feedback/solution.

    7. Exercise

      So students should produce/provide a piece of code? Please make clear what is required in terms of output/submission (even when it's not graded you might still want them to submit this as part of a code workbook?)

      Also, provide comments/feedback/solution.

    8. Neighbours

      Again, similar to the first topic, there is lots of content under this topic. Please break it into logical chunks and give them descriptive/indicative titles.

    9. Important

      I would break here and add a title "Required/optional reading"

      Please make clear which reading students should do at this point to finish off this topic, and whether it's optional or required.

    10. As

      Let's put a heading here for this whole example, and break the example up into steps. Please give this example a descriptive name.

      For the rest of this week, you will see my several suggestions for breakpoints. They might not be in the right place, as I don't understand the subject matter. The main idea is to logically break the content under Linear Regression into smaller chunks so that we can make the structure of the section explicit, and I can present it online in a cognitively intuitive way for students to navigate.

      Every time you move on to discuss a new/different perspective/aspect/nuance/term, it's worth to break with a title.

      So by all means move the breakpoints to where you think is logical!

    11. Apart

      So the previous example has ended, and this is the continued theoretical discussion? If so, please put a heading here to summarise what the next part of content will be.

    12. described

      Again, there are several code segments throughout the exercises and theoretical discussions. Should students copy this and run in R? Or should they just read it? If it's the latter, it might be good to include a sentence with something along the line of "These operations/tasks can be done in R as follows:", then show the code segment. Then there's another sentence "And this is what the output from R looks like."

      This is just so it's clear to students what they are meant to do with the code each time. Sometimes they might just observe/read, as it's part of the explanation, but sometimes we might expect them to run the code, manipulate it, and compare with what the right output should be, for example.

    13. Regression

      This is a very long topic. Please break it into smaller sub-topics and give them names to make it easier for students to navigate, and to make the important concepts stand out.

      I also suggest to have a little quiz/recap questions at the end of this section for students to make sure they have understood and can recall the main terms/concepts.

    1. 31 January 2018

      Please refer to comments in week 1 regarding the following:

      • please write "introduction to week 2"
      • connect what was going on last week with what's going on this week, and how this week fits in the whole subject
      • tell students about assessments related to this week
      • start the week with a case study, an activity, some real world application that triggers students interest. Then introduce the technical concepts in context, as tools to help solve business/real world problems.

      Regarding the exercise, please refer to my comment in week 1 regarding the provision of feedback/comments/solutions for both students and tutors.

    2. As

      Add a break point here, and title it "Exercise"

      Please rewrite the instruction for the exercise. What are they given, what are they supposed to do, what should be the output? Would they have to submit the output? Then the rest of the content on this exercise we can give as the solution or feedback to students.

      For all exercises, especially those that students have to do themselves, please provide the solution so that we can give that to students if needed, and any other discussions/observation that you deem important/interesting/noteworthy.

    3. Practice your skills by training, testing and comparing the results of the Naive Bayes, LDA, and QDA classifiers in other classification data sets with numerical predictors (other than iris).

      Same comment as above regarding the exercises here and in other weeks.

      Please provide solution.

      What is the output do you want students to produce, for all exercises?

      Would you want students to submit the exercises? Would you expect tutors to give them feedback?

      How would these exercises tie to the assessments?

    4. Exercise: Inste

      Same comment as above.

      Please provide solution.

      What is the output do you want students to produce, for all exercises?

      Would you want students to submit the exercises? Would you expect tutors to give them feedback?

      How would these exercises tie to the assessments?

    5. Instead of randomly splitting the data into a training and a test set only once, repeat this procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the standard deviation of the classification accuracy over the 10 test sets. Compare with the result from the Naive Bayes classifier.

      Please provide solution.

      What is the output do you want students to produce, for all exercises?

      Would you want students to submit the exercises? Would you expect tutors to give them feedback?

      How would these exercises tie to the assessments?

    6. mutually exclusive.

      As this subject uses quite a lot of statistical concepts, it would be beneficial if we can create an index/glossary so that students can look these terms up if they forgot what they mean. Could you come up with this glossary?

    7. h accuracy.

      After this very elaborated guided exercise, could we give students a few problems/exercise to work on their own? We will need to also provide solution for tutors to distribute to students later.

    8. Exercise: Instead of randomly splitting the data into a training and a test set only once, repeat this procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the standard deviation of the classification accuracy over the 10 test sets.

      Could you provide the solution for this?

      Also, is this a good exercise to be turned into a guided interactive exercise? Meaning breaking the exercise into steps and asking students to provide answer at each step before continuing? It can be fill in the blanks with code or numerical/text answers.

    9. -Off

      I need to break this weeks into smaller chunks, each chunk is an html page. Could the following work?

      Topic 1: Supervised Learning and the Bias-Variance Trade-Off

      Topic 2: Bayes Classifiers and Bayes Theorem

      Topic 3: Naive Bayes Classifier

      THe next two headings will be at the beginning of topic 4 Naive Bayes with Numerical Predictors Example and Exercise

      Topic 4: Linear Discriminant Analysis (LDA)

      Topic 5: Quadratic Discriminant Analysis (QDA)

      Topic 6: Relations between Naive Bayes, LDA and QDA

    10. Three very common measures for classification assessment are Precision, Recall, and F1-Measure. Do some research on your own and discuss these measures with your peers. For instance, the Wikipedia Entry on F1 Score is a good starting point.

      Separate this out under another heading called "Measures for classification assessment"

      Could you give students more information as to what they're expected to do? What aspects of the measures would you want them to research and discuss? Should they come back with a concrete real world application/example of that measure?

    11. e

      There's lots of content between here and the end of this exercise. Please break it into steps with headings so that I can make it interactive: students see all the steps at once and can click on each step to reveal information for that step.

    12. In this illustrative data set, there are four categorical predictors, namely, “Outlook” (X1X1X_1), “Temp[erature]” (X2X2X_2), “Humidity” (X3X3X_3), and “Windy” (X4X4X_4), which describe the weather conditions of 14 different days (observations, i.e., rows of the table). The goal is to model whether (“Play” = Yes) or not (“Play” = No) a particular day is suitable to play. Variable “Play” is our dependent variable (YYY), and since it takes two values only, this is a binary classification problem. In order to classify any new observation into classes Yes or No, we need a table of probabilities containing the priors, P(Y=Yes)P(Y=Yes)P(Y = \rm{Yes}) and P(Y=No)P(Y=No)P(Y = \rm{No}), as well as P(Xi|Y)P(Xi|Y)P(X_i|Y) for all possible combinations of values of XiXiX_i (i=1,⋯,4i=1,⋯,4i = 1, \cdots, 4) and YYY. Since 9 out of the 14 observations belong to class Yes, it is trivial to compute the priors as P(Y=Yes)=9/14P(Y=Yes)=9/14P(Y = \rm{Yes}) = 9/14 and P(Y=No)=5/14P(Y=No)=5/14P(Y = \rm{No}) = 5/14. The conditional probabilities P(Xi|Y)P(Xi|Y)P(X_i|Y) are also trivial to compute by inspecting the data set. For example, P(X1=Sunny|Y=No)=3/5P(X1=Sunny|Y=No)=3/5P(X_1 = \rm{Sunny}|Y = \rm{No}) = 3/5, because 3 out of the 5 observations for which Y=NoY=NoY = \rm{No} are such that X1=SunnyX1=SunnyX_1 = \rm{Sunny}. As another example, P(X4=False|Y=Yes)=6/9=2/3P(X4=False|Y=Yes)=6/9=2/3P(X_4 = \rm{False}|Y = \rm{Yes}) = 6/9 = 2/3, because 6 out of the 9 observations for which Y=YesY=YesY = \rm{Yes} are such that X4=FalseX4=FalseX_4 = \rm{False}.

      I plan to turn this and the table above into an interactive or a visual showing the original table as the data set, then extracting the data corresponding to each scenario out, showing probability and other relevant explanation.

      What do you think?

    13. The basic rationale behind the Naive Bayes classifier is to estimate the probabilities P(X1&⋯&Xn|Y)P(X1&⋯&Xn|Y)P(X_1 \& \cdots \& X_n|Y), P(Y)P(Y)P(Y) and P(X1&⋯&Xn)P(X1&⋯&Xn)P(X_1 \& \cdots \& X_n) from data, then apply the Bayes theorem in Eq. (2) to estimate P(Y|X1&⋯&Xn)P(Y|X1&⋯&Xn)P(Y|X_1 \& \cdots \& X_n). In principle, Naive Bayes assumes that not only the dependent variable, YYY, but also the predictor(s), XXX, are categorical (Numerical Predictors will be discussed later). The classifier then uses a frequentist approach to estimate the required probabilities from data. In this approach, probabilities are estimated by the frequency in which the event in question occurs in a set of experiments. For example, if we didn’t know that the probability of any particular outcome of a rolling die experiment is 1/6, we could roll the die several times, count how many times we have observed each particular outcome, and then divide these counts by the total number of times that the die has been rolled. If the die is fair and we roll it many times, we should get a probability of approximately 1/6 for each possible outcome. Likewise, in a classification setting, if we are given a large number of observations (X,Y)(X,Y)(X,Y), then we can estimate from these observations, in a frequentist way, the probability P(X1=x1&⋯&Xn=xn|Y=y)P(X1=x1&⋯&Xn=xn|Y=y)P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y) that the predictors take on particular values, X1=x1,⋯,Xn=xnX1=x1,⋯,Xn=xnX_1 = x_1, \cdots, X_n = x_n, given that an observation belongs to a particular class, Y=yY=yY = y. We can also estimate the prior probability that an observation belongs to class yyy, P(Y=y)P(Y=y)P(Y = y), and the prior probability that the predictors take on those particular values, P(X1=x1&⋯&Xn=xn)P(X1=x1&⋯&Xn=xn)P(X_1 = x_1 \& \cdots \& X_n = x_n). Once those probabilities have been computed, we can then use the Bayes theorem to compute the probabilities P(Y=y|X1=x1&⋯&Xn=xn)P(Y=y|X1=x1&⋯&Xn=xn)P(Y = y|X_1 = x_1 \& \cdots \& X_n = x_n) that a certain observation belongs to any particular class, Y=yY=yY = y, provided that its predictors take on values X1=x1,⋯,Xn=xnX1=x1,⋯,Xn=xnX_1 = x_1, \cdots, X_n = x_n. Recall from the Bayes Rule of Classification that this is all we need to classify a new observation, whose predictors have been measured but the class label is unknown. Actually, in order to apply the Bayes rule of classification we don’t really need the exact probability values, we only need to know which probability is the largest. Since the denominator in Eq. (2) — i.e. the prior P(X1=x1&⋯&Xn=xn)P(X1=x1&⋯&Xn=xn)P(X_1 = x_1 \& \cdots \& X_n = x_n) — is the same irrespective of the class label, we actually only need to compute the numerator, i.e., we only need to estimate P(X1=x1&⋯&Xn=xn|Y=y)P(X1=x1&⋯&Xn=xn|Y=y)P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y) and P(Y=y)P(Y=y)P(Y = y) for each class label Y=yY=yY = y. The problem is that the number of observations required to get a rough frequentist estimate of the former term, in principle, grows exponentially with the number of predictors, nnn, thus making the frequentist approach unfeasible in most practical applications. The Naive Bayes classifier circumvents this problem by making the assumption that the predictors are statistically independent within each class, which means that no particular value taken by a certain predictor affects the probabilities associated with the other predictors within a given class. Mathematically, if the class conditional independence assumption holds true (for any given class label, Y=yY=yY = y), then: P(X1=x1&⋯&Xn=xn|Y=y)=P(X1=x1|Y=y)∗P(X2=x2|Y=y)∗⋯∗P(Xn=xn|Y=y)P(X1=x1&⋯&Xn=xn|Y=y)=P(X1=x1|Y=y)∗P(X2=x2|Y=y)∗⋯∗P(Xn=xn|Y=y)P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y) = P(X_1 = x_1|Y = y)*P(X_2 = x_2|Y = y) * \cdots * P(X_n = x_n|Y = y) This property makes it much easier to estimate P(X1=x1&⋯&Xn=xn|Y=y)P(X1=x1&⋯&Xn=xn|Y=y)P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y) by independently estimating and multiplying the probabilities associated with each individual predictor. Under the independence assumption, the Bayes theorem in Eq. (2) can be rewritten as: Equation (3): P(Y|X1&⋯&Xn)=P(X1=x1|Y=y)P(X2=x2|Y=y)⋯P(Xn=xn|Y=y)P(Y)P(X1=x1|Y=y)P(X2=x2|Y=y)⋯P(Xn=xn|Y=y)=P(Y)∏ni=1P(Xi=xi|Y=y)∏ni=1P(Xi=xi|Y=y)

      Could we give this section a heading? Or 2 if it needs 2? Just to label the content for easier cognition/navigation.

    14. As an example, let’s consider a box with NNN = 100 marbles, each of which has two painted faces (possibly, but not necessarily, in the same colour). We know that there are, say, 40 marbles with at least one face painted in red, 20 marbles with at least one face painted in blue, and 5 marbles with one face in red and the other one in blue. So, the probability that we randomly pick a marble from the box that has a face painted in red is P(X)=40/100=0.4P(X)=40/100=0.4P(X) = 40/100 = 0.4, and the probability that such a randomly picked marble has a blue face is P(Y)=20/100=0.2P(Y)=20/100=0.2P(Y) = 20/100 = 0.2. Now let’s suppose that we blindly draw a marble from the box and look at one of its painted faces only, which turns out to be red. What is the probability that the other face is blue? To answer this question, we can use the Bayes theorem above. In fact, notice that we can easily determine the probability that one of the faces is red given that the other face is blue as P(X|Y)=5/20=0.25P(X|Y)=5/20=0.25P(X|Y) = 5/20 = 0.25, since we know that 5 out of the 20 marbles with a blue face have the other face red. Using the Bayes theorem, the probability we want to compute in order to answer our question above is given by P(Y|X)=(0.25∗0.2)/0.4=0.125P(Y|X)=(0.25∗0.2)/0.4=0.125P(Y|X) = (0.25*0.2)/0.4 = 0.125. Notice that, before looking at one of the faces (which turned out to be red), the prior probability that the picked marble would be red and blue was only P(X&Y)=5/100=0.05P(X&Y)=5/100=0.05P(X \& Y) = 5/100 = 0.05. Given the evidence that one of the faces was certainly red, the probability of observing a red and blue marble increased from 0.050.050.05 to 0.1250.1250.125.

      I would like to turn this into an interactive activity where students are taken through a series of steps, at each step they would need to provide the answer. The content written here will be given as feedback to students after they have submitted an answer at each step.

    15. elow.).

      Can we have an exercise here for students to practice/implement what they've learned about underfitted and overfitted models and bias-variance trade-off before moving onto the next topic?

    16. In general, the more flexible the model and the smaller the training data set, the higher the risk of overfitting and the higher the variance tends to be. Since in most practical applications we cannot indefinitely increase the size of the training sample for a multitude of different reasons, the remedy to variance is to reduce the flexibility of the model, so it cannot adjust itself too much to the specifics of any particular sample. In our pedagogical problem above, if we force our model to be linear, we will always get the very same model when fitting it to any two points drawn from the real linear system. In contrast, notice that there are countless, much more flexible non-linear models that could perfectly fit the same observations while being very different from one another. Two such models would produce highly variable predictions for observations not contained in the training data set.

      An example would be great!

    17. Notice that, even if an ideal model is learnt, i.e., f^=ff^=f\hat{f} = f, any prediction made using this model will still have an error e=Y−Y^=f(X)+ϵ−f^(X)=f(X)+ϵ−f(X)=ϵe=Y−Y^=f(X)+ϵ−f^(X)=f(X)+ϵ−f(X)=ϵe = Y - \hat{Y} = f(X) + \epsilon - \hat{f}(X) = f(X) + \epsilon - f(X) = \epsilon, which depends on unobserved variables and, therefore, cannot be eliminated. So, the goal is to minimize the reducible component of the error, f(X)−f^(X)f(X)−f^(X)f(X) - \hat{f}(X). However, since f(X)f(X)f(X) and ϵϵ\epsilon are unknown, we cannot separate them from each other prior to learning the model from observations (X,Y)(X,Y)(X,Y). As a consequence, our model may try to reduce the irreducible component of the error during the learning process, or in other words, the model may be prone to “learning noise”. Although the component ϵϵ\epsilon cannot be truly learnt from XXX as it does not depend on XXX, if the model in hand is flexible enough it may be possible to learn a mapping f^(X)f^(X)\hat{f}(X) that perfectly maps each observation XXX to its corresponding YYY in a finite set of observations, i.e., an exact model with null error for a training data set. As tempting as this idea may seem at a first glance, such a model has a fundamental problem: by definition, e=Y−Y^=f(X)+ϵ−f^(X)e=Y−Y^=f(X)+ϵ−f^(X)e = Y - \hat{Y} = f(X) + \epsilon - \hat{f}(X) and, by assumption, e=0e=0e = 0 for the training data, which implies f(X)≠f^(X)f(X)≠f^(X)f(X) \neq \hat{f}(X), i.e., the model cannot be an accurate representation of f(X)f(X)f(X) unless the observation error ϵϵ\epsilon is negligible. Putting it differently, the model may fit accurately a particular sample of data, but not necessarily the true system that one wants to describe. In this case, we say that such a model is overfitted. Overfitted models tend to not accurately describe data other than the sample used for training. It is worth noticing that, when we are learning a model from a finite data set (always the case in practical data-driven learning), overfitting can occur even if ϵ=0ϵ=0\epsilon = 0, because there may exist an f^≠ff^≠f\hat{f} \neq f for which Y^=YY^=Y\hat{Y} = Y for a particular set of values of XXX. As an example, assume a regression problem in which the true system we want to learn is given by Y=f(X)=XY=f(X)=XY = f(X) = X, and we have a data set with two observations (X,Y)(X,Y)(X,Y), namely, (0,0)(0,0)(0,0) and (1,1)(1,1)(1,1), which should suffice to uniquely determine the straight line Y=XY=XY = X. However, in practice we may not know that the ideal system is linear, so we could come up with a quadratic model Y^=X2Y^=X2\hat{Y} = X^2 instead, which would perfectly fit the two observations available for training, but is very different from the real system in general. Notice that by replacing observation (1,1)(1,1)(1,1) with another observation drawn from the real (linear) system, namely (2,2)(2,2)(2,2), we can still fit an exact quadratic model to the new training set ((0,0) and (2,2)) as Y^=12X2Y^=12X2\hat{Y} = \frac{1}{2}X^2. However, the new model differs from the previous one. This simple example illustrates a well-known aspect of overfitting: the more overfitted the model, the more it tends to change when trained from different samples of data, which is undesired as the system or phenomenon we want to learn is the same irrespective of any particular data sample. This variability is related to the variance aspect of the so-called bias-variance trade-off.

      All of this is very abstract. Could you use a real world example and keep coming back to this to illustrate all the points? If this goes hand in hand with a tangible example, all the equations and concepts will make sense.

      For example, after you say this:

      Although the component ϵ cannot be truly learnt from X as it does not depend on X, if the model in hand is flexible enough it may be possible to learn a mapping f^(X) that perfectly maps each observation X to its corresponding Y in a finite set of observations, i.e., an exact model with null error for a training data set.

      For example, we cannot truly learn from parents' educational levels about their children's financial strength in 20 years time, blah blah blah...

    1. January

      Another general comment that I have for not just this week or this topic, but for all topics of all weeks is JCU's preference for problem-based learning. You might have seen this and even applied this in your teaching before. For your convenience, I'm providing here a summary of this approach. Whenever possible, please consider adopting this.


      An activity, such as a situated problem or real-world application, should drive the design of the weekly materials. • Consider what you need evidence of students being able ‘to do’. Think about how you’ll get to know what they already know, too. • Think about how you’ll collect that evidence. • Consider the practicality of the activity – when and how do you see them actually doing it? How long should it take them? • How will students be provided feedback on their performance? Does it require peer feedback, facilitator feedback or can the feedback be automated? • Finally, consider all of the materials you’ve already got and those you’ll need to develop in order to: — support students furthering their knowledge — complementing what they already know — model behaviours/skills they need to demonstrate. Note: Activity should be designed as a ‘rehearsal’ space for students to prepare for assessment. Links between activity and assessment should be made explicit.


      Another note on activities of all kinds, whether they are discussions or exercises:

      • always provide feedback/comments/solution so that we can give that to students as they go through the course. This feedback can be built into the pages directly if the activity is a discussion forum, a popup question, or any other interactive exercise. If it's an "offline" exercise, this can be provided to students by the tutors when needed.
    2. Introduction to Data Mining - Week 1

      Before we get into the content of the week, we need to introduce students to this week. Generally in this "Welcome to Week X" section we aim to:

      • provide students with an overview of the week - more of a narrative, not just a list of topics to be covered. This is to give them some context - how this weeks fits in the big picture of the subject, why it's important, connect it with what comes before and after the week
      • what the learning objectives are (you already have this in the current subject outline)
      • describe major learning activities for the week or the flow of activities
      • align the week's objectives/content with the assessments, maybe telling them to get a head start with assessments even though they're not yet due in that week
      • all data/code files students have to download for the week. It would be easier if they do this at the beginning of each week, and have everything ready so that they don't have to download something on the spot, just in case they are doing their study on the move (on the train for example) and it's not convenient to do so.

      I suggest that you write this introduction for all weeks. It provides important narrative and context to motivate students and help them plan out their study better.

    3. do

      One thing we will need to tell students is how they are expected to use R in this subject. Whenever there's a code segment, what does it mean? Does it mean students are supposed to run that code in R? It's not clear to me.

      Also, is it safe to assume at this point students already know how to use R? Will there be any students that haven't been exposed to R at this point in the carousel?

    4. eek 1 Quiz

      I would suggest that we use all quizzes as ungraded. For assessments, JCU's framework requires that we use authentic assessments. We can get students to do bigger tasks that require not only the application of techniques learned but also reflection and judgment to solve a business/real world problem. Please consider.

    5. o

      An idea to start the subject/week 1: Have students engage in a conversation about how they expect to apply the knowledge gained in this subject in their work. Students coming to this course/subject from different backgrounds, and they might see data science/data mining useful in different ways. Maybe someone is a marketeer and want to have better target groups for his campaign. Another one might be a health administrator and wants to personalise treatments based on patients data. Or they might be a farmer who wants to use historical data in the region to cope with climate change. They can say how they think data mining can help them. Tutors can come in and comment oh, this and this areas of data mining will be very useful for you, and in particular the regression technique will help you understand this and this and be more efficient.

      Of course I'm making this up, just so I can explain the idea better.

      This activity will help students engage with the course, obtain some high level thinking going on in terms of the usefulness of data mining in various fields. It's a chance for tutors to build their social presence and make students confident that they will be taken care of.

    6. 31

      Focusing on the WHY

      In the current content there's lots of information on the What and the How, but there is not much on the WHY, which is very important for adult learners. We would like students to understand why a certain concept/technique/topic/week/subject is important, what they can do with it in their work/life so that they see its role/value/importance. Real world application examples would be perfect for providing students with this context, triggering their interest and hence getting their brain warmed up the right way to absorb what comes next.

      For example, this is the beginning of week 1, which is also the beginning of the subject. It would be great if students can see the big picture of data mining: why it's valuable, what it's been used for. Marketing and sales are obvious fields that apply data mining. But who knows it's useful for fire fighting as well (https://blogs.wsj.com/digits/2014/01/24/how-new-yorks-fire-department-uses-data-mining/)? Or for knowledge discovery.

      If you think along this line, I'm sure there are lots of cases you know of that would make the beginning of each week/topic interesting for students.

    7. Discussion Board

      As noted above, I suggest that we spread this content out so that students do the activities as they go through the content. That would help them understand the content in chunks and make it more digestible.

      I also suggest the we have a variety of learning activities instead of all discussions. Having 1 or discussion is good, but if it's just discussions it's unlikely that students will participate. Alternatively, you could give them some context/problems/data, and ask them to compare, evaluate, solve problems, create solutions for.

      You could also use videos/articles/case studies to prompt discussions.

    8. g variants, specialisations, or particular cases

      Going through the content again, it's not clear to me what each of these means. Please make sure the definitions of these are clear enough. It might be because I don't have any background in data science, but a big group of our students also don't have a background in data science, so we'd better be safe.

      Also, once clarified, if you still want to use this, I suggest combining this with the task/question right above.

    9. Discuss with your colleagues which of these tasks you judge the most important ones and why (in general or for you personally, in your workplace).

      Which tasks should they consider for this exercise? All that have been discussed this week? And the ones they research and find?

      Could we consider moving this task to the end of the descriptions on Descriptive Tasks, and instead of a discussion, have various examples of problems that need solutions and ask them to choose a task/method that best suits it, and explain why?

    10. Try to identify tasks that have not been explicitly discussed in the lecture notes above for week 1, and the main application domain(s) of each of them. These can (but not necessarily need to) include the tasks just briefly listed in subsection other tasks.

      What do you mean by "have not been explicitly discussed" this week? Does this mean students should exclude all tasks described under Predictive and Descriptive, but can include those under Other tasks, plus whatever else they can find?

      How should students do this? Or is it completely up to them, Googling, looking in library databases, from the recommended references, etc.? How long should they spend on this research task?

    11. Do some research on your own about important tasks in data mining and their Main Challenges in the Era of Big Data.

      First, is it safe to assume students know what big data means? If not, we need to define/describe/discuss the concept before asking students to do this.

      Second, could students do this exercise at the beginning of this week? If yes this might work as a warm up exercise somewhere above.

      Third, can we give the question a focus? Are you asking them to research ALL important tasks and describe ALL main challenges? what about picking one type of tasks, and focus on one kind of challenge (cost, time, etc.)?

      Finally, you will need to include information in the tutor guide regarding how to encourage students to participate in this exercise, what to watch out for in terms of common challenges.

    12. epartment.

      I suggest that we have an activity here for students as a recap for what they've learned.

      This is an idea: have a matching exercise where students have to match the type of task with the description of the independent variable, the dependent variable, and the main goals. If you're on board with this, please create for me the following table so that I can get the activity developed:

      • each row is for a task type
      • for each task type, include information for the following columns:

      independent variable dependent variable main goals an example

      Of course you might have other ideas of what to include. Just let me know and I'll get the interactive exercise developed.

    13. on mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information” [Wikipedia Entry on Sentiment Analysis] Community Detection in Social Networks: “In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally” [Wikipedia Entry on Community structure] Link Analysis/Prediction: “In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes… Link analysis has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, medical research, and art” [Wikipedia Entry on Link Analysis]

      Again, are there academic references for these?

    14. (5th column

      What are the columns you're referring to? Are they in an excel spreadsheet? We need to tell students what to look at. For example you could tell them what file to open from the package and what to look at before describing the specifics like the columns.

    15. dataset diamonds from the ggplot2 package

      Do students have all these data ready to go? We will have to tell them at the beginning of each week what data files they need for the week and where to get them (we can upload these to Ultra so that they have everything in one place for each week).

    16. For instance, let us consider dataset diamonds from the ggplot2 package in R/RStudio. It seems reasonable to assume that we can approximately model the price of a diamond as a function of its weight. In this case, the dependent variable YYY is price and the single predictor X1X1X_{1} is carat (weight). After some suitable pre-processing of the raw data (in this case a log-transformation of both price and carat), if we assume that YYY can be approximated as a linear function of X1X1X_{1}, i.e., Y≈β0+β1X1Y≈β0+β1X1Y \approx \beta_{0} + \beta_{1}X_{1}, where parameters β0β0\beta_{0} and β1β1\beta_{1} are the slope and the intercept, respectively, then we can determine the values of these parameters so that the resulting model best fits the data:

      Do you think it's worth to have these variables/descriptions on the graph itself? If you could just give me a sketch with variables/terms pointed out on the graph, our developer can turn this into an interactive graphic?

    17. Recall

      I'm not sure what you mean by 'recall'. Recall from what? Have students seen this figure before? Recall from Foundations of Data Science? From previous subjects? At this point students will have taken Statistical Methods, Foundations of Data Science, and Data Visualisation. It's great if you are asking them to think back to these subjects when introducing Data Mining. If that is the case, please write more to elaborate the point. This is the perfect place to tell a story, paint a picture of where things fit in Data Science in general and in Data Mining in particular. This goes back to my comment above regarding the 4-step learning cycle.

      I would suggest that before you describe this KDD cycle, have students do some simple data mining exercise. Something they can do just using their common sense and critical thinking skills, even before learning all these specifics about data mining. Then point out to them that what they've done is actually data mining, and they've gone through some major tasks that are the core of Data Mining (for example, point out to them that they've done a simple regression). Or it could be an engaging short video. Or a case. Anything that would make students think and get them ready to absorb what's coming next.

      Just an idea.

    18. There are numerous problems that data scientists may aim to solve in the realm of data mining, but the vast majority can be categorized as a variant, specialisation or particular case of one of the following core data mining tasks: Regression Classification Anomaly/Outlier Detection Clustering Frequent Pattern Mining Recommendation

      Should we also say that these 6 tasks are grouped into predictive tasks, descriptive tasks, and specialised tasks? This is how the tasks are structured at the moment.

    19. What is Data Mining?

      If I'm not mistaken, this is actually topic 1 (out of 3 topics) of this week: "Data Mining: Contextualisation and Motivation"?

      I think this is a great topic to start the subject/week. Before I get into the specific content, I'd like to describe a general pattern of learning cycle that has been proven to work very well with adult learners, and especially in the online context.This cycle is best applied for each topic, especially when online students tend to study in short bursts (30 minutes here, 45 minutes there) as it breaks the learning sequence into smaller digestible chunks.

      Step 1: Trigger interest/Activate prior knowledge. Have students do an engaging activity that utilises their prior knowledge in the field or simply uses common sense. IT can be solving a problem, reflecting on/observing/analyzing a case/phenomenon, commenting on a case study/a video. Anything.

      Step 2: Present the formal knowledge, like what you've written here, or readings from textbooks and other resources.

      Step 3: Implement/Apply the concepts they've gained through step 2.

      Step 4: Ending: close, expand & connect

      - close the loop, maybe come back/revisit the motivational activity at the beginning
      - connect to the next topic/week
      - further discussion to address other perspectives/limitations/challenges to the theory/future trends/ect.
      

      At the moment I see that you lump all the "activities" at the end. If we could rearrange the content you have created and added more types of content (see paragraph) below, it would be great.

      In this process, try to diversify the sources of content that you use. If you could, please include videos, articles, case studies on real events (current affairs if possible). I'm sure data science/data mining has lots of interesting cases to use as the field has been growing so fast with fascinating outcomes.

    20. “The process of discovering interesting patterns and knowledge from large amounts of data” (Han, Kamber, and Pei 2012) “The process of automatically discovering useful information in large data repositories” (Tan, Steinbach, and Kumar 2006) “The process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data” (Zaki and Meira JR. 2014)

      These three definitions are quite similar. Is there anything we can say at the end along the line of: although data mining can be defined in different ways, they all have this and this in common, or it always utilises this and this, or the end goals is always this and this? I just think that it's a nice way to round off the topic.

  3. Feb 2018
    1. JCU Online’s Master of Nursing will empower you to understand and implement research, and embrace leadership or educational practice roles in a hospital environment. You’ll be equipped to step up and drive the change needed to optimise your patients’ health outcomes.

      aha, it works!

    1. when and where it suits you. Study via an immersive and purpose-built learning environment where you’ll connect and network with peers, industry professionals and academics

      this is a test