- Aug 2024
-
www.semanticscholar.org www.semanticscholar.org
-
RealSense 455 fixed cameras, namely the navigation andthe manipulation camera, both with a vertical field of viewof 59◦ and capable of 1280×720 RGB-D image capture.The navigation camera is placed looking in the agent’s for-ward direction and points slightly down, with the hori-zon at a nominal 30◦
sim2real transfer
-
We discretize the action spaceinto 20 actions: Move Base (±20 cm), Rotate Base (±6◦,±30◦), Move Arm (x, z) (±2 cm, ±10 cm), Rotate Grasper(±10◦), pickup, dropoff, done with subtask, and terminate.
Action Space For Navigation: * move_ahead - +20 cm * rotate left/right +-30 degrees * end
-
Using GPT-3.5, we extract five short de-scriptions of common usages for each synset, given thesynset’s definition, and a score indicating the confidence inthe correctness of the usage.
affordance - gpt 3.5 description for synset
-
A total of 191,568houses are sampled from this distribution, with ratio of10:1:1 across training, validation, and test.
200k houses 10:1:1 train/val/test
-
For the model with detection (SPOC w/ GT Det or SPOCw/ DETIC), we encode the coordinates of the boundingboxes using sinusoidal positional encoding followed by alinear layer with LayerNorm and ReLU. We also add coor-dinate type embeddings to differentiate the 10 coordinates(5 per camera - x1, y1, x2, y2, and area). The coordinate en-codings are then concatenated with the two image featuresand text features before feeding into Transformer EncoderEvisual. When no object is detected in the image, we use1000 as the dummy coordinate value and set area to zero.
Bounding Box decoding
-
Using 16-bit mixedprecision training SPOC trains at approximately 1.2 hoursper 1000 iterations
16-bit - 1.2 hours for 1000 iter, -> 60 hours for 50k
-
SPOC uses SIGLIP image and text encoders that produce84 × 768 (npatch × dimage) features and 768 dimension fea-ture per text token. We use 3-layer transformer encoder anddecoder with 512 dimensional hidden state (dvisual) and 8attention heads for the goal-conditioned visual encoder andaction decoder, respectively. We use a context window of100. All models are trained with a batch size of 224 on8×A100 GPUs (80 GB memory / GPU) with the AdamWoptimizer and a learning rate of 0.0002. Single-task modelsare trained for 20k iterations, while multi-task models aretrained for 50k iterations.
- encoder - SigLIP (84 patch, 768 emb size)
- transformer - 3 lauer, 512 hidden state, 8 heads
- context - 100
- batch size - 224
- 8xA100 gpu (80 gb)
- optim - AdamW
- learning rate - 0.0002
- iterations - 20k - 50k
-
Real world results.
sim obj nav = 85 (detic) real obj_nav = 83.3 (detic)
-
00 houseswith 1000 episodes each, and 10k houses with 10 episodeseach.
10k houses with 10 episodes > 100 houses with 1k episodes
-
Effect of context window
-
Comparing different image encoders
SigLIP (vit-b) > DinoV2 (vit-s) > CLIP (resnet 50)
-
Swapping transformer encoder and decoder with alternative architectures.
table 4: * transformer > gru
-
Training on single tasks, IL outperforms RL even with meticulous reward shaping.
table 3: * SPOC > EmbSigLIP (rl) * SPOC single = multitask * SPOC detector > without detector
-
our IL-trained SPOC dramatically outperforms thepopular RL-trained EmbCLIP architecture
SPOC outperforms EmbCLIP
-
SPOC withground truth detection (provided by the simulator) shows15% absolute average success rate gain across all taskson CHORES-S and an even larger 24.5% absolute gain onCHORES-L where the detection problem is harder due tothe larger object vocabulary.
comparing SPOC with detector and without
-
SPOC uses SIGLIP image andtext encoders. We use 3-layer transformer encoder and de-coder and a context window of 100. All models are trainedwith batch size=224, AdamW and LR=0.0002. Single-taskmodels and multi-task models are trained for 20k and 50kiterations, respectively. Using 16-bit mixed precision train-ing SPOC trains at an FPS of ≈3500, compared to an FPSof ≈175 for RL implemented using AllenAct
training params: * 3-layer transformer encoder, decoder * context 100 * batch_size 224 * optim AdamW * learning rate 0.0002 * iterations 20k-50k
-
To analyze models we first present results on a subset of15 object categories from the full 863 categories, called S.Evaluations on the full category set are named L. Our train-ing data contains an average of 90k episodes per task andon average each task contains 195 episodes in the evalua-tion benchmark.
Training - 90k episodes per task. Eval - 195 episodes per task
-
Practically, RL in complex visual worldsis sample inefficient, especially when using large actionspaces and for long-horizon tasks.
rl minus
-
Phone2Proc [19], where a scanned layout of thereal-world house is used to generate many simulated varia-tions for agent fine-tuning
many environment from one layot
-
Imitation learning has recently gained popularity inrobotics, significantly impacting areas like autonomousdriving [14, 36, 43, 48, 52, 55].
imitation learning in autonomous driving
-
[1, 16, 39, 53, 61, 77].Recent simulators model realistic robotic agents but oftentrade off physical fidelity to increase simulation speed [21,22, 24, 41, 42, 54, 63, 68, 78].
simulators: Matter-port3D, AI2-THOR, VirtualHome, Habitat, Gibson
-
Whentrained jointly on four tasks – Object-Goal Navigation(OBJNAV), Room Visitation (ROOMVISIT), PickUp Object(PICKUP), and Fetch Object (FETCH) – SPOC achieves animpressive average success rate of 49.9% in unseen sim-ulation environments at test time.
succes rate at simulation - 50 %
-
success rate of 85% for OBJNAV.
success rate
-
200,000 procedu-rally generated houses containing 40,000 unique 3D as-sets.
How authors calculate this data?
-