dataset
2025 年 1 月 12 日
An Empirical Study of Autoregressive Pretraining from Videos
title: An Empirical Study of Autoregressive Pretraining from Videos
publish date:
2025-01-09
authors:
Jathushan Rajasegaran et.al.
paper id
2501.05453v1
download
abstracts:
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/
QA:
coming soon
编辑整理: wanghaisheng 更新日期:2025 年 1 月 12 日