Hypothesis

19 Matching Annotations

Last 7 days
developers.google.com developers.google.com

Clustering workflow | Machine Learning | Google for Developers

4
1. vosef 04 Nov 2025
  
  in Public
  
  Because clustering doesn't produce or include a ground "truth" against which you can verify the output, it's important to check the result against your expectations at both the cluster level and the example level. If the result looks odd or low-quality, experiment with the previous three steps. Continue iterating until the quality of the output meets your needs.
  
  it seems hard to interpret exact results of what we want to see, we don't have an exact loss metric to look at, its more of our interpretation based on our own knowledge of the clustering quality
2. vosef 04 Nov 2025
  
  in Public
  
  A clustering algorithm uses the similarity metric to cluster data. This course uses k-means.
  
  then we well use the clustering algorithm i know shocker
3. vosef 04 Nov 2025
  
  in Public
  
  Before a clustering algorithm can group data, it needs to know how similar pairs of examples are. You can quantify the similarity between examples by creating a similarity metric, which requires a careful understanding of your data.
  
  Then we create some sort of similarity metric, it requires a deeper understanding of our data, I assume this means choosing like euclidian, manhattan, cosine, etc.
4. vosef 04 Nov 2025
  
  in Public
  
  As with any ML problem, you must normalize, scale, and transform feature data before training or fine-tuning a model on that data. In addition, before clustering, check that the prepared data lets you accurately calculate similarity between examples.
  
  Like usual we always need to normally prepare our data, if we accurately calculate a similarity then awesome sauce
Visit annotations in context

Annotators

vosef

URL

developers.google.com/machine-learning/clustering/workflow
developers.google.com developers.google.com

Clustering algorithms | Machine Learning | Google for Developers

3
1. vosef 04 Nov 2025
  
  in Public
  
  Centroid-based clustering
  
  It basically calculates a bunch of arithmetic means that finds the data closest to the mean, with means/clusters farthest from eachother
2. vosef 04 Nov 2025
  
  in Public
  
  The k-means algorithm has a complexity of O(n)O(n), meaning that the algorithm scales linearly with nn. This algorithm will be the focus of this course.
  
  k-means is the main focus, it's O(n) so its much more efficient, I have yet to learn how right now
3. vosef 04 Nov 2025
  
  in Public
  
  clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples nn, denoted as O(n2)O(n^2) in complexity notation
  
  Algorithms that operate in O(n^2) are VERY INEFFICIENT, specifically ones that calculate the similarity between each individual object of data, out of MILLIONS of data
Visit annotations in context

Annotators

vosef

URL

developers.google.com/machine-learning/clustering/clustering-algorithms
developers.google.com developers.google.com

Framing: Key ML Terminology | Machine Learning Crash Course | Google Developers

8
1. vosef 04 Nov 2025
  
  in Public
  
  Clustering YouTube videos replaces this set of features with a single cluster ID, thus compressing the data.
  
  instead of multiple different sets of features for youtube videos, you can just use a cluster id that represents this data
2. vosef 04 Nov 2025
  
  in Public
  
  As discussed, the relevant cluster ID can replace other features for all examples in that cluster. This substitution reduces the number of features and therefore also reduces the resources needed to store, process, and train models on that data. For very large datasets, these savings become significant.
  
  Funny enough this is a great way to compress data to make features less complex
3. vosef 04 Nov 2025
  
  in Public
  
  When some examples in a cluster have missing feature data, you can infer the missing data from other examples in the cluster.
  
  Clusters can help with imputations, by just infering what that data should have from the same cluster groups
4. vosef 04 Nov 2025
  
  in Public
  
  After clustering, each group is assigned a unique label called a cluster ID. Clustering is powerful because it can simplify large, complex datasets with many features to a single cluster ID.
  
  each group has a unique cluster id, it can help us basically turn a bunch of data into a small amount of features
5. vosef 04 Nov 2025
  
  in Public
  
  Different similarity measures may be more or less appropriate for different clustering scenarios, and this course will address choosing an appropriate similarity measure in later sections: Manual similarity measures and Similarity measure from embeddings.
  
  You would choose different similarity measures depending on the data you're handling
6. vosef 04 Nov 2025
  
  in Public
  
  But as the number of features increases, combining and comparing features becomes less intuitive and more complex.
  
  more features = more complex to compare similarity
7. vosef 04 Nov 2025
  
  in Public
  
  similarity measure, or the metric used to compare samples,
  
  we need a similarity measure, its a metric for comparing how close datapoints actually are
8. vosef 04 Nov 2025
  
  in Public
  
  Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. (If the examples are labeled, this kind of grouping is called classification.)
  
  Clustering, unsupervised learning, no use of labels, tries to predict the grouping and relationship of the data
Visit annotations in context

Annotators

vosef

URL

developers.google.com/machine-learning/clustering/overview
Oct 2025
developers.google.com developers.google.com

Datasets: Data characteristics | Machine Learning | Google for Developers

2
1. vosef 18 Oct 2025
  
  in Public
  
  Tables are an intuitive input format for machine learning models. You can imagine each row of the table as an example and each column as a potential feature or label. That said, datasets may also be derived from other formats, including log files and protocol buffers.
  
  tables are just very helpful way to look at data for machine learning modeling
2. vosef 18 Oct 2025
  
  in Public
  
  Many datasets store data in tables (grids), for example, as comma-separated values (CSV) or directly from spreadsheets or database tables.
  
  comma-separated values is what csv stands for i never knew that
Visit annotations in context

Annotators

vosef

URL

developers.google.com/machine-learning/crash-course/overfitting/data-characteristics
developers.google.com developers.google.com

Datasets, generalization, and overfitting | Machine Learning | Google for Developers

2
1. vosef 18 Oct 2025
  
  in Public
  
  Yes, ML practitioners spend the majority of their time constructing datasets and doing feature engineering.
  
  as everyone has told me, 90% of ML is data
2. vosef 18 Oct 2025
  
  in Public
  
  Data trumps all. The quality and size of the dataset matters much more than which shiny algorithm you use to build your model.
  
  Data is extaordinarily important, literally one of the reasons im planning to go through all this now before i start going through my data, you can have a good loss function, but ultimately its all dependent on your data
Visit annotations in context

Annotators

vosef

URL

developers.google.com/machine-learning/crash-course/overfitting

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL