Technical University of Munich

Seminar: Unsupervised Learning in Computer Vision (5 ECTS)

Winter Semester 2025/26, TU München

Organisers: Dominik Schnaus, Christoph Reich

Please direct questions to ulcv-ws25@vision.in.tum.de

News

2025-07-17: [preliminary meeting] will take place from 13:30 - 14:00 on 17.07.2025 online via Zoom (https://tum-conf.zoom-x.de/j/69788622207?pwd=7bMo0rwFUutWalbbTzfs6yEEUJjYPJ.1). Please attend the meeting to learn more about the course or ask questions. Attendance at the preliminary meeting is not mandatory.

Course Description

Current progress in visual understanding has predominantly been driven by supervised learning (e.g., SAM, DepthAnything). However, acquiring large amounts of annotated data is highly labour-intensive and even infeasible for certain applications. Unsupervised approaches, such as DINO or SMURF, eliminate the need for ground truth annotations. This seminar will discuss some of the most significant and recent advances in unsupervised learning for visual understanding, ranging from self-supervised representation learning to unsupervised segmentation and reconstruction.

Prerequisites

Machine Learning (IN2064), Introduction to Deep Learning (IN2346), or similar

Topics

Please choose a sub-topic, such as "Clustering-Based Methods" or "Paired Pre-Training" not individual papers!

Representation Learning in CV
- Clustering-Based Methods
  1. Deep clustering for unsupervised learning of visual features [Caron et al., ECCV 2018] paper link
  2. Self-labelling via simultaneous clustering and representation learning [Asano et al., ICLR 2020] paper link
  3. Unsupervised learning of visual features by contrasting cluster assignments [Caron et al., NeurIPS 2020] paper link
  4. Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning [Venkataramanan et al., arXiv 2025] paper link
- Contrastive Methods (Fundamental contrastive methods)
  1. Representation learning with contrastive predictive coding [van den Oord et al., arXiv 2018] paper link
  2. A simple framework for contrastive learning of visual representations [Chen et al., ICML 2020] paper link
- Contrastive Methods (Momentum contrast)
  1. Momentum contrast for unsupervised visual representation learning [He et al., CVPR 2020] paper link
  2. Improved baselines with momentum contrastive learning [Chen et al., arXiv 2020] paper link
  3. An empirical study of training self-supervised vision transformers [Chen et al., ICCV 2021] paper link
- Information Maximization
  1. Barlow Twins: Self-supervised learning via redundancy reduction [Zbontar et al., ICML 2021] paper link
  2. VICReg: Variance-Invariance-Covariance Regularization For Self-Supervised Learning [Bardes et al., ICLR 2022] paper link
- Masked Autoencoders (Image)
  1. BEiT: BERT pre-training of image Transformers [Bao et al., ICLR 2022] paper link
  2. Masked autoencoders are scalable vision learners [He et al., CVPR 2022] paper link
  3. BEiT v2: Masked image modeling with vector-quantized visual tokenizers [Peng et al., arXiv 2022] paper link
- Masked Autoencoders (Video)
  1. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training [Tong et al., NeurIPS 2022] paper link
  2. Masked autoencoders as spatiotemporal learners [Feichtenhofer et al., NeurIPS 2022] paper link
  3. VideoMAE v2: Scaling video masked autoencoders with dual masking [Wang et al., CVPR 2023] paper link
- Joint-Embedding Predictive Architecture (Image)
  1. Self-supervised learning from images with a joint-embedding predictive architecture [Assran et al., ICCV 2023] paper link
  2. Masked siamese networks for label-efficient learning [Assran et al., ECCV 2022] paper link
  3. iBot: Image BERT Pre-training with Online Tokenizer [Zhou et al., ICLR 2022] paper link
- Joint-Embedding Predictive Architecture (Video)
  1. V-jepa: Latent video prediction for visual representation learning [Bardes et al., OpenReview 2022] paper link
  2. V-jepa 2: Self-supervised video models enable understanding, prediction and planning [Assran et al., arXiv 2025] paper link
- Self-Distillation Methods (Fundamental self-distillation methods)
  1. Bootstrap your own latent-a new approach to self-supervised learning [Grill et al., NerIPS 2020] paper link
  2. Exploring simple siamese representation learning [Chen et al., CVPR 2021] paper link
  3. Data2vec: A general framework for self-supervised learning in speech, vision and language [Baevski et al., ICML 2022] paper link
- Self-Distillation Methods (DINO)
  1. Emerging properties in self-supervised vision transformers [Caron et al., ICCV 2021] paper link
  2. DINOv2: Learning Robust Visual Features without Supervision [Oquab et al., TMLR 2024] paper link
  3. DINOv3 [Siméoni et al., arXiv 2025] paper link
- Self-Distillation Methods (Video self-distillation)
  1. Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video [Venkataramanan et al., ICLR 2024] paper link
- Diffusion Features
  1. Emergent correspondence from image diffusion [Tang et al., NerIPS 2023] paper link
  2. Cleandift: Diffusion features without noise [Stracke et al., CVPR 2025] paper link
Depth, Optical Flow & 3D Reconstruction
- Monodepth
  1. Unsupervised monocular depth estimation with left-right consistency [Godard et al., CVPR 2017] paper link
  2. Digging into self-supervised monocular depth estimation [Godard et al.,ICCV 2019] paper link
- Monocular Depth for Dynamic Scenes
  1. Dynamo-depth: Fixing unsupervised depth estimation for dynamical scene [Sun et al., NeurIPS 2023] paper link
  2. ProDepth: Boosting self-supervised multi-frame monocular depth with probabilistic fusion [Woo et al., ECCV 2024] paper link
- Optical Flow
  1. What matters in unsupervised optical flow [Jonschkowski et al., ECCV 2020] paper link
  2. SMURF: Self-teaching multi-frame unsupervised raft with full-image warping [Stone et al., CVPR 2021] paper link
- Scene Flow
  1. Self-supervised monocular scene flow estimation [Hur et al., CVPR 2020] paper link
  2. EMR-MSF: Self-supervised recurrent monocular scene flow exploiting ego-motion rigidity [Jiang et al., ICCV 2023] paper link
- Pose Estimation & 3D Reconstruction
  1. AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos [Wimbauer et al., CVPR 2025] paper link
  2. Structure-from-motion revisited [Schonberger et al., CVPR 2016] paper link
- Single-Image 3D Reconstruction
  1. Behind the scenes: Density fields for single view reconstruction [Wimbauer et al., CVPR 2023] paper link
  2. Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion [Jevtic et al., ICCV 2025] paper link
Scene-Understanding
- Object-Centric Learning
  1. Object-centric learning with slot attention [Locatello et al., NeurIPS 2020] paper link
- Instance Segmentation
  1. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization [Melas-Kyriazi et al., CVPR 2022] paper link
  2. Cut and learn for unsupervised object detection and instance segmentation [Wang et al., CVPR 2023] paper link
- Semantic Segmentation
  1. Unsupervised Semantic Segmentation by Distilling Feature Correspondences [Hamilton et al., ICLR 2022] paper link
  2. Unsupervised semantic segmentation through depth-guided feature correlation and sampling [Sick et al., CVPR 2024] paper link
- Panoptic Segmentation
  1. Unsupervised universal image segmentation [Niu et al., CVPR 2024] paper link
  2. Scene-Centric Unsupervised Panoptic Segmentation [Hahn et al., CVPR 2024] paper link
Unsupervised Style Transfer
1. Unpaired image-to-image translation using cycle-consistent adversarial networks [Zhu et al., ICCV 2017] paper link
Language Models
- BERT
  1. Bert: Pre-training of deep bidirectional transformers for language understanding [Devlin et al., NAACL 2019] paper link
  2. Roberta: A robustly optimized bert pretraining approach [Liu et al., arXiv 2019] paper link
- GPT
  1. Improving language understanding by generative pre-training [Radford et al., 2018] paper link
  2. Language models are unsupervised multitask learners [Radford et al., 2019] paper link
  3. Language models are few-shot learners [Brown et al., NeurIPS 2020] paper link
  4. GPR-4 technical report [Achiam et al., arXiv 2024] paper link
- DeepSeek
  1. Deepseek-V3 technical report [Liu et al., arXiv 2024] paper link
  2. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning [Guo et al., arXiv 2025] paper link
- Unsupervised Translation
  1. Word translation without parallel data [Conneau et al., ICLR 2018] paper link
  2. Harnessing the universal geometry of embeddings [Jha et al., arXiv 2025] paper link
Vision-Language
- Paired Pre-Training
  1. Learning transferable visual models from natural language supervision [Radford et al., ICML 2021] paper link
  2. Sigmoid loss for language image pre-training [Zhai et al., ICCV 2023] paper link
- Using Less Pairs
  1. Lit: Zero-shot transfer with locked-image text tuning [Zhai et al., CVPR 2022] paper link
  2. Relative representations enable zero-shot latent space communication [Moschella et al., ICLR 2023] paper link
  3. ASIF: Coupled data turns unimodal models to multimodal without training [Norelli et al., NerIPS 2023] paper link
- Alignment & Translation
  1. The platonic representation hypothesis [Huh et al., ICML 2024] paper link
  2. It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data [Schnaus et al., CVPR 2025] paper link

Topic Assignment

Name	Topic	Advisor	Time Slot
Yagi Ryo	Clustering-Based Methods	Dominik	18.03, 09:00 - 09:45
Tevfik Süha Dorgut	Contrastive Methods (Fundamental contrastive methods)	Dominik	18.03, 09:45 - 10:30
Hieu Pham	Contrastive Methods (Momentum contrast)	Dominik	18.03, 10:45 - 11:30
Vincent Tobias Peik	Masked Autoencoders (Image)	Dominik	18.03, 11:30 - 12:15
Chenyang Yu	Masked Autoencoders (Video)	Dominik	18.03, 12:15 - 13:00
Mykhailo Trushch	Joint-Embedding Predictive Architecture (Image)	Dominik	18.03, 14:00 - 14:45
Filip Alexandrov	Joint-Embedding Predictive Architecture (Video)	Dominik	18.03, 14:45 - 15:30
Jing Chen	Self-Distillation Methods (DINO)	Dominik	18.03, 15:30 - 16:15
Leonhard Alexander Wensch	Diffusion Features	Dominik	18.03, 16:30 - 17:15
Nouri Alexander Hilscher	Scene Flow	Christoph	18.03, 17:15 - 18:00
Tuan Benjamin Bui	Pose Estimation & 3D Reconstruction	Christoph	19.03, 09:00 - 09:45
Paul Hoheisel	Single-Image 3D Reconstruction	Christoph	19.03, 09:45 - 10:30
Stepan Sachkov	Instance Segmentation	Christoph	19.03, 10:45 - 11:30
Magdy Mahmoud	Semantic Segmentation	Christoph	19.03, 11:30 - 12:15
Lara Tompa	Panoptic Segmentation	Christoph	19.03, 12:15 - 13:00
Inna Tarasyuk	BERT	Christoph	19.03, 14:00 - 14:45
Simon Gerrit Schmidt	GPT	Christoph	19.03, 14:45 - 15:30
Anastasiia Chaikova	DeepSeek	Christoph	19.03, 15:45 - 16:30
Karl Oskar Kuuse	Paired Pre-Training	Christoph	19.03, 16:30 - 17:15

Computer Vision Group
TUM School of Computation, Information and Technology
Technical University of Munich

Technical University of Munich

Links

Informatik IX
Computer Vision Group

News

GCPR / VMV 2024

Navigation

Seminar: Unsupervised Learning in Computer Vision (5 ECTS)

News

Course Description

Prerequisites

Topics

Topic Assignment

Rechte Seite

Informatik IX
Computer Vision Group

News

GCPR / VMV 2024

Computer Vision GroupTUM School of Computation, Information and TechnologyTechnical University of Munich

Technical University of Munich

Links

Informatik IX Computer Vision Group

News

GCPR / VMV 2024

Navigation

Seminar: Unsupervised Learning in Computer Vision (5 ECTS)

News

Course Description

Prerequisites

Topics

Topic Assignment

Rechte Seite

Informatik IX Computer Vision Group

News

GCPR / VMV 2024

Computer Vision Group
TUM School of Computation, Information and Technology
Technical University of Munich

Informatik IX
Computer Vision Group

Informatik IX
Computer Vision Group