19 Matching Annotations
  1. Last 7 days
    1. Because clustering doesn't produce or include a ground "truth" against which you can verify the output, it's important to check the result against your expectations at both the cluster level and the example level. If the result looks odd or low-quality, experiment with the previous three steps. Continue iterating until the quality of the output meets your needs.

      it seems hard to interpret exact results of what we want to see, we don't have an exact loss metric to look at, its more of our interpretation based on our own knowledge of the clustering quality

    2. A clustering algorithm uses the similarity metric to cluster data. This course uses k-means.

      then we well use the clustering algorithm i know shocker

    3. Before a clustering algorithm can group data, it needs to know how similar pairs of examples are. You can quantify the similarity between examples by creating a similarity metric, which requires a careful understanding of your data.

      Then we create some sort of similarity metric, it requires a deeper understanding of our data, I assume this means choosing like euclidian, manhattan, cosine, etc.

    4. As with any ML problem, you must normalize, scale, and transform feature data before training or fine-tuning a model on that data. In addition, before clustering, check that the prepared data lets you accurately calculate similarity between examples.

      Like usual we always need to normally prepare our data, if we accurately calculate a similarity then awesome sauce

    1. Centroid-based clustering

      It basically calculates a bunch of arithmetic means that finds the data closest to the mean, with means/clusters farthest from eachother

    2. The k-means algorithm has a complexity of O(n)O(n), meaning that the algorithm scales linearly with nn. This algorithm will be the focus of this course.

      k-means is the main focus, it's O(n) so its much more efficient, I have yet to learn how right now

    3. clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples nn, denoted as O(n2)O(n^2) in complexity notation

      Algorithms that operate in O(n^2) are VERY INEFFICIENT, specifically ones that calculate the similarity between each individual object of data, out of MILLIONS of data

    1. Clustering YouTube videos replaces this set of features with a single cluster ID, thus compressing the data.

      instead of multiple different sets of features for youtube videos, you can just use a cluster id that represents this data

    2. As discussed, the relevant cluster ID can replace other features for all examples in that cluster. This substitution reduces the number of features and therefore also reduces the resources needed to store, process, and train models on that data. For very large datasets, these savings become significant.

      Funny enough this is a great way to compress data to make features less complex

    3. When some examples in a cluster have missing feature data, you can infer the missing data from other examples in the cluster.

      Clusters can help with imputations, by just infering what that data should have from the same cluster groups

    4. After clustering, each group is assigned a unique label called a cluster ID. Clustering is powerful because it can simplify large, complex datasets with many features to a single cluster ID.

      each group has a unique cluster id, it can help us basically turn a bunch of data into a small amount of features

    5. Different similarity measures may be more or less appropriate for different clustering scenarios, and this course will address choosing an appropriate similarity measure in later sections: Manual similarity measures and Similarity measure from embeddings.

      You would choose different similarity measures depending on the data you're handling

    6. But as the number of features increases, combining and comparing features becomes less intuitive and more complex.

      more features = more complex to compare similarity

    7. similarity measure, or the metric used to compare samples,

      we need a similarity measure, its a metric for comparing how close datapoints actually are

    8. Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. (If the examples are labeled, this kind of grouping is called classification.)

      Clustering, unsupervised learning, no use of labels, tries to predict the grouping and relationship of the data

  2. Oct 2025
    1. Tables are an intuitive input format for machine learning models. You can imagine each row of the table as an example and each column as a potential feature or label. That said, datasets may also be derived from other formats, including log files and protocol buffers.

      tables are just very helpful way to look at data for machine learning modeling

    2. Many datasets store data in tables (grids), for example, as comma-separated values (CSV) or directly from spreadsheets or database tables.

      comma-separated values is what csv stands for i never knew that

    1. Yes, ML practitioners spend the majority of their time constructing datasets and doing feature engineering.

      as everyone has told me, 90% of ML is data

    2. Data trumps all. The quality and size of the dataset matters much more than which shiny algorithm you use to build your model.

      Data is extaordinarily important, literally one of the reasons im planning to go through all this now before i start going through my data, you can have a good loss function, but ultimately its all dependent on your data