 May 2023

writings.stephenwolfram.com writings.stephenwolfram.com

“secondary pathway” that takes the sequence of (integer) positions for the tokens, and from these integers creates another embedding vector
What is this position? Why are we embedding this? What does this embedding mean?

The input is a vector of n tokens (represented as in the previous section by integers from 1 to about 50,000).
What is 'n' here? Is this the number of tokens identified in the given sentence? Once we've found the embedding can't we use a look up instead of a single layer NN?

 Nov 2021

www.analyticsvidhya.com www.analyticsvidhya.com

ClusterBased Over Sampling
Not sure how this will help with the imbalance issue. How does equal representation of subclasses lead to better results?

 Jul 2021

christophm.github.io christophm.github.io

EDx(zA)[1f(x)=f(z)]≥τ,A(x)=1
Not clear to me what this means.


christophm.github.io christophm.github.io

The xaxis shows the feature effect: The weight times the actual feature value.
I do not understand this.


christophm.github.io christophm.github.io

You can subtract the lowerorder effects in a partial dependence plot to get the pure main or secondorder effects
How do we do this?

Well, that sounds stupid. Derivation and integration usually cancel each other out, like first subtracting, then adding the same number. Why does it make sense here? The derivative (or interval difference) isolates the effect of the feature of interest and blocks the effect of correlated features.
Say what? How does it remove the correlation? It remove the offset, but correlation?

ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).
Why are the PDPs biased?


christophm.github.io christophm.github.io

For each of the categories, we get a PDP estimate by forcing all data instances to have the same category. For example, if we look at the bike rental dataset and are interested in the partial dependence plot for the season, we get 4 numbers, one for each season. To compute the value for "summer", we replace the season of all data instances with "summer" and average the predictions.
Why would be change the season for all? This does not make sense. We simply have to take the average of all instances corresponding to a particular season.
Update: I got it now. You do replace every instance by that value and simply run all modified instances through the ML model and average across its output.

An assumption of the PDP is that the features in C are not correlated with the features in S. If this assumption is violated, the averages calculated for the partial dependence plot will include data points that are very unlikely or even impossible (see disadvantages).
I do not follow this.


christophm.github.io christophm.github.io

f(x)=^f(xS,xC)=g(xS)+h(xC)
How do we know we can express it like this?


nba.uth.tmc.edu nba.uth.tmc.edu

innervation ratio
How is this a ratio?


christophm.github.io christophm.github.io

A tree with a depth of three requires a maximum of three features and split points to create the explanation for the prediction of an individual instance.
This means that predicting the value for any instance only requires a maximum of three features. Even though the overall tree itself can can use up to 7 features.

∑j=1feat.contrib(j,x)
How do we get the feature contribution?

Feature importance
I do not follow this.


christophm.github.io christophm.github.io

How will this help with comparison? Are we assuming the other model uses categorization?


christophm.github.io christophm.github.io

Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.
Cannot understand this.

But usually you do not deal with the odds and interpret the weights only as the odds ratios. Because for actually calculating the odds you would need to set a value for each feature, which only makes sense if you want to look at one specific instance of your dataset.
I do not follow this.


christophm.github.io christophm.github.io

days_since_2011 4.9 0.2 28.5
It looks like there was a steady increase in the number of bikes rented every single day.

R2=1−(1−R2)n−1n−p−1
I do not follow this. Is this always guaranteed to be within 0 and 1?


christophm.github.io christophm.github.io

2.4 Evaluation of Interpretability
I do not follow this section.

Application level evaluation (real task)
I do not understand this. Where is the interoperability here? The software could person as well as the radiologist, but we might still have no idea how it is doing it.

 May 2021

www.gammon.com.au www.gammon.com.au

This article discusses interrupts on the Arduino Uno (Atmega328) and similar processors, using the Arduino IDE. The concepts however are very general. The code examples provided should compile on the Arduino IDE (Integrated Development Environment).
This is such a great resource!
Tags
Annotators
URL

 Apr 2021

qz.com qz.com

It’s very hard to get consciousness out of nonconsciousness
If particles can come in and go out of existence from nothing, why can't consciousness?

 Mar 2021

www.scholarpedia.org www.scholarpedia.org

low firing rates,
What is low firing rate?

fenvi=fmin+⎛⎝1−UthMUifmax−fmin⎞⎠∗U
I do not follow this. This means the neuron is always firing at f_min? How foes f_0.5 come into the picture?

model is an assembly of phenomenological models,
Not sure how this avoids the issues of overfitting.

All of the muscle fibers in all of the motor units of a given muscle tend to move together, experiencing the same sarcomere lengths and velocities
How do we know this? What about motor units that are not activated? What about motor units that are activated with different time delays and different rates?

 Jul 2020

en.wikipedia.org en.wikipedia.org

effects of crossover and dropout
I understand dropout, but how crossover?

Randomized clinical trials analyzed by the intentiontotreat (ITT) approach provide unbiased comparisons among the treatment groups
What is the proof for this? Is there a statistical proof?

 Jan 2020

rdgao.github.io rdgao.github.io

Additionally, one of my all time favorite papers (Fröhlich & McCormick, 2010) showed that an applied external field can entrain spiking in ferret cortical slice, parametrically to the oscillatory field frequency.
What are the sources of LFP? If its only the current induced by EPSP and IPSP, then it is not clear if the entertainment through external field makes a strong argument. If LFPs can be modified by other factors like it is pointed out (Ca++ currents and other glial cells currents), then it is possible that these are not epiphenomenal. But we still need to show that they actually play a causal role in the observed system output.

But spikes do not compute! The cells “compute”, dendrites “compute”, the axon hillock “computes”. In that sense, spikes are epiphenomenal: they are the secondary consequences of dendritic computation, of which you can fully infer by knowing the incoming synaptic inputs and biophysical properties of the neuron.
This is correct. But in that case, LFPs and any other oscillations that we are recording are also epiphenomenal.

 Dec 2019

link.springer.com link.springer.com

estimators is the prior covariance ΣΣφφ
How do we know this covariance?

If the sensor noise has an independent identical distribution (IID) across channels, the covariance of the sensor noise in the referenced data will be Σεεrεεr=σ2TrTTr
I do not understand this.

 Mar 2019

docs.microsoft.com docs.microsoft.com

When the number of references drops to zero, the object deletes itself. The last part is worth repeating: The object deletes itself
How does that work? How does an object delete itself?


docs.microsoft.com docs.microsoft.com

CALLBACK is the calling convention for the functio
What is a calling convention?

 Feb 2019


Calculus on Computational Graphs: Backpropagation
Good article on computational graphs and their role in back propagation.


stats.stackexchange.com stats.stackexchange.com

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing minibatch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.


neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.
I do not get this. Epoch 15 indicates that we are already overfitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

If we see that the accuracy on the test data is no longer improving, then we should stop training
This contradicts the earlier statement about epoch 280 being the point where there is overtraining.

It might be that accuracy on the test data and the training data both stop improving at the same time
Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

test data
Shouldn't this be "training data"?

What is the limiting value for the output activations aLj
When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.

zLj=lnaLj+C
How can the constant C be independent of j? It will have a e^{z_j^L} term in it. This is not correct.
