TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language UnderstandingShare on Twitter Facebook LinkedIn Previous Next