given that keypoints or regions are inadequately predicted
REALLY?
given that keypoints or regions are inadequately predicted
REALLY?
Note that differentiating the gradient of Lw.r.t to x requires a second-order derivative of the considered parametrized function and L-BFGSneeds to construct a third-order derivative approximation, which is challenging for neural networkswith ReLU units for which higher-order derivatives are discontinuous
非常有用的信息和分析
These works make strong assumptions on the model architecture andmodel parameters that make reconstructions easier, but violate the threat model that we consider inthis work and lead to less realistic scenarios
这篇文章意思就是他这是一个比较通用的框架,之前的都基于一些很强的针对模型结构的假设,这样应用范围其实就会受到限制
anyinput to a fully connected layer can be reconstructed analytically independent ofthe remaining architecture
这里的analytically怎么理解
advanced talking head editing tasks, like pose-manipulation and background-replacement.
难不成就这么两个地方有用?
The final head pose in theoutput image is given by Rd Ð RuRd and td Ð tu `td. Invideo conferencing, we can change a person’s head pose inthe video stream freely despite the original view angle
这就是free的原因
entropy coding scheme
熵编码
apply this processing to each frame in the driving video
逐帧来处理的
we reuse xc,k, which wereextracted from the source image s. This is because theface in the output image must have the same identity as theone in the source image s.
为什么对driving video不提去keypoint的原因
3D keypoint decompositio
重点:3D关键点的分解
We note that theextracted keypoints are meant to be independent of the face’spose and expression. They shall only encode a person’sgeometry signature in a neutral pose and expression.
这就是这里的keypoint的含义
The Jacobian represents how a local patch around thekeypoint can be transformed into the corresponding patchin another image via an affine transformation
所以其实这就是fomm最主要的思想
paramountimportance
paramount importance
最最最重要的
roviding explicit control over the generated face froma pretrained StyleGAN
Q1 如何通过pretrain 的style GAN 来进行generated face的control
inverse mapping from images to latent codes is nontrivial
从image 到 latent codes的映射并不容易
accessories
配件
footage
镜头
ubject-dependent andsubject-agnostic models
目标相关 目标无关
Humans are able to guess the whole scene given a partialobservation of it. In a similar fashion, we aim to build a generator that trains with image patches, andinference images of unbounded arbitrary-large size.
具体的是如何实现的呢?
inbetweening
这个是如何做到的?可以借鉴小姨爱
latent space interpolations
应该值得就是在feature level对目标图像进行操作吧
The properties ofthe latent space
这几个出发点看起来都很秀,但是就是不知道实际上是啥意思
he latent factors of varia-tion.
latent factor of variation值得是属性吗
linearity of the intermediate latent space
中间隐藏空间线性性如何理解?
we further believe that our investigations to the separa-tion of high-level attributes and stochastic effects,
这个节藕这里到底是什么意思?
Multi-Head Attention
为什么要multi-head的attention呢? 因为是dot的。 利用linear学习h次的投影。有点像多个conv channel的感觉 。给了你h词机会,学习h个高纬到低纬到的linear prejection
compatibility function
相似度
To facilitate these residual connections
为了简单起见
auto-regressive
自回归,当前的这个词的生成是基于之前所有的输出来的词的
sequence of symbol representations (x1,...,xn) to a sequenceof continuous representations z = (z1,...,zn). Given z, the decoder then generates an outputsequence (y1,...,ym) of sy
编码的时候可以一次性看完整个句子,但是解码的时候得一个一个的生成
counterac
counteract
albeit
albeit
This makesit more difficult to learn dependencies between distant position
卷积对长序列的信息很难建模,如果两个像素块间隔很远的话,那就需要很多层的卷积最后才能把这两块像素之间的联系建立起来
ore parallelization
更高的并行度
eschewing
回避
sequence transductio
序列转录
leverage
杠杆作用
stray
流浪
approximations
近似
from a collection of single-view 2D photographs
这些2D的图片之间有什么关系,满足多识图集合吗,是直接满足还是见解满足呢?
only a single source image
那驱动这个图片的是图像还是关键点呢?
Deep generative models have had lessof an impact, due to the difficulty of approximating many intractable probabilistic computations thatarise in maximum likelihood estimation and related strategies, and due to difficulty of leveragingthe benefits of piecewise linear units in the generative context. We propose a new generative modelestimation procedure that sidesteps these difficulties.
这篇文章的故事
unrolled approximate inference
对近似推理过程的展开
Avatars
static or moving image or other graphic representation that acts as a proxy for a person or is associated with a specific digital account or identity, as on the internet
an unordered set generation problem
What's set generation?
autoregressively
How to understand this "autoregressively"?
a set of rules
what's those rules?
incorporating
合并
view-dependent emitted radiance
这到底是啥什么?
urface element represen-tation
这个本身是什么也得有深刻的认识才行
Point clouds are a simple representation thatalso supports arbitrary topology [21, 39, 77] and does notrequire data registration, but highly detailed geometry re-quires many points.
所以讲来讲去就都关注了点云了,所以之前的应该是已经做的差不多了吧
producing high-quality point clouds
这个工作和点云有什么关系或者联系?这里难道不是mesh是点云吗?
To enable learn-ing, the choice of representation is the key
这个是是关键
entry
每个元素的意思
A semantic position encoding mechanismis designed to facilitate semantic-level position information andpreserve the texture patterns in the exemplars
记录一下,蛮有新意的
memory cos
不太理解
comes at a price.
有代价
learning thistransformation completely without built-in priors and caneven learn to predict depth in an unsupervised fashion
如何做得到的,非监督的还能估计深度?
To remedy these issues
这也是一种to tackle 的说法
Extrapolation
推断
desired change in viewpoint.
所以这篇文章的guide信息就是相机视角的转变吗?如果是这样的话那就还蛮有意思的
a probabilistic formulation necessary to capture the ambi-guity inherent in predicting novel views from a single image,thereby overcoming the limitations of previous approachesthat are restricted to relatively small viewpoint changes
这个怎么理解?
high fidelity.
高保真度
fully automatic system
何为这个全自动的系统呢?
interpenetrate
interpenetrate
Despite this progress, a significant limi-tation of these environments is that they do not contain peo-ple.
问题所在
conditional variational autoencoder
条件式变分自编码器
placing 3D human bodies in 3D environmentsnaturally
这篇文章其实就是干了这么一件事情
affordance
心理学中的一种表示交互关系的名次,翻译为“示能”
the 3D scene structure and the proximitybetween the body and the scene are not explicitly modeled,especially for the regions that are occluded from the cam-era view, making it hard to effectively enforce constraints in3D, such as no inter-penetration and proper contact.
这篇文章的出发点
primal reason
主要原因的另一种说法
their results are still far behindthe captured real human-scene interaction such as
学会这种表达方式
generated proximal relation-ship
这种 关系是什么意思呢?其实就是距离吗?
basis points
基点
Best results except the ground truth areshown in boldface
这句话这么写就很专业了
purely geometry-
比较核心的novel的地方了
which however often leads toloss of fine spatial information
那么如何让MLP不丧失Spatial的能力呢?如果在MLP中加入卷机是不是就可以替换Transformer从而可能更加轻量级