labels.csv contains the labels for the training images
There are not labels for the testing images?
labels.csv contains the labels for the training images
There are not labels for the testing images?
it likely leads to larger values in the Gram matrix
负相关?
the image printing function requires that each pixel has a floating point value from 0 to 1
why?
In practice, we can remove predicted bounding boxes with lower confidence even before performing non-maximum suppression, thereby reducing computation in this algorithm. We may also post-process the output of non-maximum suppression, for example, by only keeping results with higher confidence in the final output.
pre and post process
add an dimension
add 'batch' dimension
The best way to do this is by first using tesseract to get OCR text in whatever languages you might feel are in there, using langdetect to find what languages are included in the OCR text and then run OCR again with the languages found.
how about the accuracy?
For comparison, we define an identical model, but initialize all of its model parameters to random values.
全部保持初始就是随机赋值?
As is observed in the above results, after an nn.Sequential instance is scripted using the torch.jit.script function, computing performance is improved through the use of symbolic programming.
but longer time
In the context of computer vision this schedule can lead to improved results.
图像增强?
The photorealistic text-to-image examples in Fig. 11.9.5 suggest that the T5 encoder alone may effectively represent text even without fine-tuning.
t5和输出之间应该还有网络?
advanced
推进
Without need for manual labeling, large-scale text data from books and Wikipedia can be used for pretraining BERT.
不需要手工标注
Since we use the fixed positional encoding whose values are always between −1 and 1,
?
position
position对应time step?
To compute multiple heads of multi-head attention in parallel, proper tensor manipulation is needed.
不同的head一定要相同的长度吗(num_hiddens / num_heads)?
Note that h heads can be computed in parallel if we set the number of outputs of linear transformations for the query, key, and value to pqh=pkh=pvh=po.
不一致就不能平行运算吗?
In the case of a (scalar) regression with observations (xi,yi) for features and labels respectively, vi=yi are scalars, ki=xi are vectors, and the query q denotes the new location where f should be evaluated.
x_i和q相等?
the conditional probability of each token at time step 3 has also changed in Fig. 10.8.2
why change like that?
Using word-level tokenization, the vocabulary size will be significantly larger than that using character-level tokenization, but the sequence lengths will be much shorter.
the sequence lengths?
flavor
problem
we can easily get a deep-gated RNN by replacing the hidden state computation in (10.3.1) with that from an LSTM or a GRU.
方向是不是错了?
Reset gates help capture short-term dependencies in sequences. Update gates help capture long-term dependencies in sequences.
why?
Note that only the hidden state is passed to the output layer.
上一时间点的输出不是这一时间点的输入?
For instance, if the first token is of great importance we will learn not to update the hidden state after the first observation.
重要 -> 不更新?
neuron
is a neuro a cell?
multiplicative nodes
乘法节点
detaching the gradient
?
Using the chain rule yields
?
Whenever ξt=0 the recurrent computation terminates at that time step t.
?
While we can use the chain rule to compute ∂ht/∂wh recursively, this chain can get very long whenever t is large. Let’s discuss a number of strategies for dealing with this problem.
我不明白为什么可以这么替换
where computation of ht−1 also depends on wh
?
Input from the first step passes through over 1000 matrix products before arriving at the output, and another 1000 matrix products are required to compute the gradient.
forward and backforward
side-effect of limiting the influence
the side-effect is limiting the influence.
At other times, training eventually converges but is unstable owing to massive spikes in the loss.
所以这样的模型不能用?
Having a small value for this upper bound might be viewed as good or bad. On the downside, we are limiting the speed at which we can reduce the value of the objective. On the bright side, this limits by just how much we can go wrong in any one gradient step.
exponentially
why
There will be many plausible three-word combinations that we likely will not see in our dataset.
?
formulae
独立性与unigram, bigram, trigram的关系?
After all, we will significantly overestimate the frequency of the tail, also known as the infrequent words.
why will overestimate?
frequency
为什么有频率的描述?
Even today’s massive RNN- and Transformer-based language models seldom incorporate more than thousands of words of context.
大模型每次输入多少文本呢?
probabilistic classifier
从不同的概率分布的集合中分类?
compare
not prediction, but comparation?
part of speech tagging
词性标注
motivating
主语是we
the expressive power of the network
closer
stable
how to understand "stable"
thus batch normalization layers function differently in training mode (normalizing by minibatch statistics) than in prediction mode (normalizing by dataset statistics).
为什么对整个数据集求平均和标准差不可行,而对于单个layer却可行?
priors
先验
scaling issue
issue?
cannot
why?
recover this degree of freedom
we lost 2 degree of freedom?
.
I don't understand this para.
reduce
how to determine the "ratio"?
They are no longer necessary
what we designed?
features
not just low-level?
add fully connected layers earlier in the network to increase the degree of nonlinearity
why can it add nonlinearity?
and occasionally on the serendipitous discoveries by lucky graduate students
?
given the increase in computation and data
考虑到计算和数据的增加
separately
different numbers?
Pixel utilization for convolutions of size 1×1, 2×2, and 3×3 respectively.
?
on
why (-1000, 1000). the original image contains 1000x1000.
desiderata
需要考虑的因素
recognize a pig were
recognize sth be
Because these networks are invariant to the order of the features, we could get similar results regardless of whether we preserve an order corresponding to the spatial structure of the pixels or if we permute the columns of our design matrix before fitting the MLP’s parameters.
example?
irrespective of the spatial relation between pixels
how respective of them?
Note that in this case, only the first layer requires lazy initialization, but the framework initializes sequentially. Once all parameter shapes are known, the framework can finally initialize the parameters.
?
including parameter initialization and backpropagation
where is forward? customized forward() and init()?
are thereafter constant. This weight is not a model parameter and thus it is never updated by backpropagation
why constant?
it will properly initialize each module’s parameters
automatically?
daisy-chain
串联
convolutional neural networks
I'm looking forward this.
subclass
why not class?
logits
logic or math?
Finally a module must possess a backpropagation method, for purposes of calculating gradients. Fortunately, due to some behind-the-scenes magic supplied by the auto differentiation (introduced in Section 2.5) when defining our own module, we only need to worry about parameters and the forward propagation method.
forward propagation. backward propagation for calculating gradients (we have auto differentiation)
Note that some modules do not require any parameters at all.
for example:
repeating patterns
hear, not various.
recursively
for example?
recognition and detection
the differences between them?
tunable parameters
not the parameters that can be optimized in the training? differentiates to hyperparameters.
softmax regression,
.
these techniques
techniques of this chapter? Do them mean new models and datasets?
I/O constrained
not I/O constrained, but processing constrained.
the number of occurrences
better probability?
Describe the relationships between algorithms, data, and computation.
Algorithms determine how data are computed.
bring you up to
It's a metaphor for being current or operating at the same level of understanding as everyone else.