the agent learns skills that increase in complexity
How do they know what skills the agent learns?
the agent learns skills that increase in complexity
How do they know what skills the agent learns?
σ
What is sigma here?
−V ∗
Why care about this?
(1 − λ)v(st+1) + λV λt+
If lambda is between 0 and 1, then this part of the equation is just equal to v(s_{t+1}), right? In that case, why do we need lambda at all?
zkt , zkt+1
How do they make sure that the object embeddings are aligned e.g. that the k-th object in z_t is the same as the k-th object in z_t+1?
zkt + T k(zt, at), zkt+1
Won't this always be 0 since the first and second term in the distance function is calculated in the same way? The definition of z_t+1 (as defined earlier) is "z_t+1 = z_t + T(z_t,a_t)"
Each feature map mkt = [Eext(st)]k can be interpreted as an object mask correspondingto one particular object slot, where [. . .]k denotes selection of the k-th feature map
How do they make sure that each feature map actually represents an object, and not something else from the image (e.g. just a fixed section of the image)?
For each targe
What is the target?
On each iteration, the single predicate that improves Jsurr the most is added to the se
What if none of the invented predicates makes the problem solvable? Then every predicate should have the same evaluation?
number of blocks (5 for Easy, 6 forMedium, and 7 for Hard
7 blocks seems incredibly hard
spatialreasoning capabilities of VLMs
Do they exist?
as illustrated inFig. 7
Isn't this the same as the original model?
This implies that generating I and G ismore challenging than O
They also depend on a correct O, so this should be expected?
We found from the outputs that ViLaIn tends to omitsome propositions in this domain, making the PDs invalid
Any specific ones?
recall
Why only recall?
Forthe Hanoi domain, L is identical through all problems.
This NL instruction is actually interesting, and perhaps quite complex for the model to handle. But if the instruction is the same across all tasks, it can essentially copy the prompt examples and use that PDDL.
The three pegs arenamed by the number from left to right (e.g., peg1, peg2,and peg3)
What if the pegs would not be named in order? Would that reduce performance? Now the model can rely on the object name index to determine the order.
All four PDDL problem descriptions represent the planning problem of stacking one blockonto another
However, all goal does not reflect the goal state shown in the image, only the first do
Objects can be picked up, dropped and movedaround by the agent.
How does the agent know if it is holding an object?
we call a PDDL Planner
What is the domain and instance in this case?
In real life, the en-vironment is often described with natural languagetexts.
No it is not. The environment is observed through sensory readings e.g. vision or touch.
nother limitation stems from the inherent randomnesswithin LLMs, c
What randomness?
The decoder D reconstructs an image ˆx0 = D(z0), from which thepolicy π predicts the action a0.
Why is not the latent embedding z used as the input to the policy?
setting anew state of the art for methods without lookahead search
Isn't the world model used to do search?
Somewhat surprisingly, the lowest scores in Blocksworldare associated with BlockAmbiguity and KStacksColor; thesetwo problems require the LLM to associate objects basedon their color and we had apriori expected the LLM to becapable of such associations and perform well on this task.
This kind of makes sense, because the colors are not explicitly modelled but they are part of the make of the object e.g. "red_block_1" rather than "red(block_1)". The latter would be a more natural way to express colors, as colors is a property of an object
fan-in of the layer.
What is "fan-in"?
qst , qat are positive over S and A respectively
What does this mean?
ze(x)
What is the dimensions of this?
3 × 3 blocks
What does "3 x 3 blocks" mean?
Our proposal distribution q(z = k|x) is deterministic, and bydefining a simple uniform prior over z we obtain a KL divergence constant and equal to log K.
What?
orcesthe autoencoder to focus on object positions
This is still unclear to me
This is notdirectly feasible with conventional policy gradient formula-tions
Why not?
At is an estimator of the advantage function at timestep t
How is this calculated?
quality
How is the quality measured?
Thedifference is that we only record objects that are either actionarguments or in contact with them.
How do you know, when the model is not learned yet?
A “world” frameserves as a default frame of reference for every object in the environment
Is this what the dataset consists of? Sequences of world frames?
However, these approachesassume high-level actions to be provided as input.
No they don't. At least not Asai et al 2022
A is an uncountably infinite set of primitivedeterministic actions.
So actions are continous and not discrete?
One can view the noise vector z in such a GAN as a featurevector, containing some representation of the transition to o′ from o.
How can it contain a representation of the transition if it is just noise?
Wedefine action predicates PA = {left(1), left(2), right(1), right(2), jump(1), idle(1), ...} and state predicatesPS = {type, closeby, ...}
How did they come up with these?
This dataset will contain a set of tuples, (s, a, s′), of states,actions, and next states
What is a state?
More recently, Chenet al. (2022) explored a variant of DreamerV2 where a Transformer replaces the recurrent network inthe RSSM
Then what is the novelty in this paper?
straight-through estimator
?
bject state vectors
Where do the object state vectors come from?
DeepMind Lab dataset
How is this dataset structured? There is no "fixed" dataset in the DeepMind Lab repo
showing good variability over the irrelevant factors
Not really. For the "white suitcase" scene it only differs in wall colors and floor colors, but the "black and white" representation of the scene is the same. Essentially there could be a way larger range of scenes where a white suitcase appears.
blue wall
Is "blue wall" a compositional concept or an atomic one?
small, round, red
Are these "features" hand-crafted?
few example images of an apple paired with the symbol “apple”
This is not unsupervised data
unlabeled set of image pairs
It's kind of labelled because they know that an action has taken place between the images, just not what action it is.
vψ (sτ )
What is the difference between this and \(V_\lambda\)?
ataset of past experience
Where does this data come from? Random exploration?
finite imagination horizon
What's the alternative, infinite imagination horizon? Seems impossible
blocks1-5 (arm, 5 blocks)
Why only up to 5 blocks?
in many casesoptimally
What does it mean to solve them optimally?
In one case, the input data corresponds to one or morestate graphs Gi assumed to originate from hidden planninginstances Pi = 〈D, Ii〉 that need to be uncovered
Isn't the domain needed in order to generate the state graph?
The latter approaches are less likely to generate crisp represen-tations due to the dependence on images
Why?
a latent policy via behavior cloning
How is this done?
π(ot)
How is this value known?
Moveover
Moreover?
is
Remove
before and after the action of inter-est is taken
Does every next observation depend on an action, or can the environment change "by itself"?
which predicts which action at was taken by the agent between consecutive obser-vations ot and ot+1
How is this trained when the action is not known?
prior distribution over programs likely to solve tasks inthe domain
What does this prior distribution mean? The probability of the program to solve any task in the domain? Is there even any programs that would solve multiple tasks?
positive probability
What does it mean that a program has a "positive probability" of solving a task?
earch for programs
Search for programs where? How are these programs created?
find best program
How is best defined?
We selected the tasks on which Tassa et al. (2018) reportnon-zero performance from image inputs
Why?
Finally, we call a PDDL Planner as the de-terministic solver to obtain A, a plan to accomplishthe goal CSL under the predefined scenario.
With what PDDL domain?
Task descriptions are constructed using PDDL and symbolicplans are generated using the FAST-DOWNWARD planner
To generate a symbolic plan, an initial state (problem file) needs to be given. How does this looks like? Is there only three problem files (one for each problem) representing some "general" state? Shouldn't the initial plan depend on the initial state?
Ourset-up automatically parses LLM-generated language intoa program using our synthetic grammar
How?
Also, how do they handle cases where the parser generates incorrect PDDL? Wouldn't that give the LLM-as-planner a worse score that it actually should have?
The P+S modeloutputs executable PDDL actions
How do you make sure of this?
Even with high Exec, some task GCR are low, becausesome tasks have multiple appropriate goal states, but weonly evaluate against a single “true” goal
This seems like an unfair way to evaluate the model
SR is the fraction of executionsthat achieved all task-relevant goal-conditions
How are the goal-conditions specified and where do they come from?
We provide the available objects in theenvironment as a list of strings
How are these objects retrieved? Automaticall or manually?
“The bowl can also be a container to fillwater”, will be added to the task planner.
Where does this come from? The LLM? Template?
perfectly match gold visual semantic plans us-ing only the text directives as input
Where do they say how they provide the state representation to the model?
Generated strings from all models arepost-processed for common errors in sequence-to-sequence models, including token doubling,completing missing bigrams (e.g. “pick <arg1>”→ “pick up <arg1>”), and heuristics for addingmissing argument tags
Probably won't generalize well to new domains
The ALFRED dataset contains 6,574gold command sequences
Didn't the "Understanding Language in Context" paper mention that it was around 8k data samples?
astly, the goal predicates for each problem were generatedfrom the "PDDL parameters" field of every data sample.
What is this field?
hus, we have created a PDDL domain file usingour knowledge of the objects and actions in the ALFRED world and a PDDL problem file for eachsample
I assume that the domain file is created manually, but are the problem files also created by hand? If so that seems like a lot of work, since the dataset has 8,055 visual samples, the same amount would be needed to be handcoded.
Since in our task we ignorethe vision part of the data, we might encounter some duplicates between our datasets
How do they get the scene representation from the visual data? Is this included in the ALFRED dataset?
ALFWorld uses PDDL - Planning DomainDefinition Language (McDermott et al., 1998) to describe each scene from ALFRED and to constructan equivalent text game using the TextWorld engine.
How is the PDDL created?