Shutong (Tony) Zhang

Hello! I am a first-year MSCS student from Stanford University, specializing in Visual Computing Track. Previously, I received my Bachelor of Applied Science in Computer Engineering from the University of Toronto.

I'm interested in Computer Vision, Robotics, and Software Engineering. On the Vision and Robotics side, I wish to integrate visual perceptions into physically plausible actions to bridge the gap between perception and planning, and ​​enable robots to perform complex tasks with strong generalization in the 3D world. On the Software Engineering side, I wish to contribute to collaboration beyond individuals.

Email  /  Linkedin  /  CV  /  Google Scholar  /  Misc

profile photo
News

[2024.9] My first-author paper on SE&UX Collaboration got accepted by CSCW 2025!

[2024.4] I will join Stanford University as a Master's student in Computer Science!

[2024.1] My first-author paper HandyPriors got accepted by ICRA 2024!

Education
Stanford University, CA, United States
Master of Science in Computer Science
2024 - 2026
University of Toronto, ON, Canada
Bachelor of Applied Science
2019 - 2024
Major in Computer Engineering, minor in Artificial Intelligence
CGPA: 3.94/4.0
Publications and Papers
Sun Off, Lights On: Photorealistic Monocular Nighttime Simulation for Robust Semantic Perception
Konstantinos Tzevelekakis, Shutong Zhang, Luc Van Gool, Christos Sakaridis
Accepted by the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025
abstract / paper

Nighttime scenes are hard to semantically perceive with learned models and annotate for humans. Thus, realistic synthetic nighttime data become all the more important for learning robust semantic perception at night, thanks to their accurate and cheap semantic annotations. However, existing data-driven or hand-crafted techniques for generating nighttime images from daytime counterparts suffer from poor realism. The reason is the complex interaction of highly spatially varying nighttime illumination, which differs drastically from its daytime counterpart, with objects of spatially varying materials in the scene, happening in 3D and being very hard to capture with such 2D approaches. The above 3D interaction and illumination shift have proven equally hard to \emph{model} in the literature, as opposed to other conditions such as fog or rain. Our method, named Sun Off, Lights On (SOLO), is the first to perform nighttime simulation on single images in a photorealistic fashion by operating in 3D. It first explicitly estimates the 3D geometry, the materials and the locations of light sources of the scene from the input daytime image and relights the scene by probabilistically instantiating light sources in a way that accounts for their semantics and then running standard ray tracing. Not only is the visual quality and photorealism of our nighttime images superior to competing approaches including diffusion models, but the former images are also proven more beneficial for semantic nighttime segmentation in day-to-night adaptation.

Who to Blame: A Comprehensive Review of Challenges and Opportunities in Designer-Developer Collaboration
Shutong Zhang, Tianyu Zhang, Jinghui Cheng, Shurui Zhou
Accepted by the ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW), 2025
abstract / paper (coming soon...)

Software development relies on effective collaboration between Software Development Engineers (SDEs) and User eXperience Designers (UXDs) to create software products of high quality and usability. While this collaboration issue has been explored over the past decades, anecdotal evidence continues to indicate the existence of challenges in their collaborative efforts. To understand this gap, we first conducted a systematic literature review of 44 papers published since 2005, uncovering three key collaboration challenges and two main best practices. We then analyzed designer and developer forums and discussions on open-source software repositories to assess how the challenges and practices manifest in the status quo. Our findings have broad applicability for collaboration in software development, extending beyond the partnership between SDEs and UXDs. The suggested best practices and interventions also act as a reference for future research, assisting in the development of dedicated collaboration tools for SDEs and UXDs.

NPSim: Nighttime Photorealistic Simulation From Daytime Images With Monocular Inverse Rendering and Ray Tracing
Shutong Zhang
Thesis at ETH Zurich Computer Vision Lab
abstract / paper / slides

Semantic segmentation is an important task for autonomous driving. A powerful autonomous driving system should be capable of handling images under all conditions, including nighttime. Generating accurate and diverse nighttime semantic segmentation datasets is crucial for enhancing the performance of computer vision algorithms in low-light conditions. In this thesis, we introduce a novel approach named NPSim, which enables the simulation of realistic nighttime images from real daytime counterparts with monocular inverse rendering and ray tracing. NPSim comprises two key components: mesh reconstruction and relighting. The mesh reconstruction component generates an accurate representation of the scene’s structure by combining geometric information extracted from the input RGB image and semantic information from its corresponding semantic labels. The relighting component integrates real-world nighttime light sources and material characteristics to simulate the complex interplay of light and object surfaces under low-light conditions. The scope of this thesis mainly focuses on the implementation and evaluation of the mesh reconstruction component. Through experiments, we demonstrate the effectiveness of the mesh reconstruction component in producing high-quality scene meshes and their generality across different autonomous driving datasets. We also propose a detailed experiment plan for evaluating the entire pipeline, including both quantitative metrics in training state-of-the-art supervised and unsupervised semantic segmentation approaches and human perceptual studies, aiming to indicate the capability of our approach to generate realistic nighttime images and the value of our dataset in steering future progress in the field. NPSim not only has the ability to address the scarcity of nighttime datasets for semantic segmentation, but it also has the potential to improve the robustness and performance of vision algorithms under low-lighting conditions.

HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors
Shutong Zhang*, Yiling Qiao*, Guanglei Zhu*, Eric Heiden, Dylan Turpin, Jingzhou Liu, Ming Lin, Miles Macklin,
Animesh Garg
Accepted by Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2023
Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2024
abstract / paper / project page

Various heuristic objectives for modeling hand- object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HANDYPRIORS, a unified and general pipeline for human- object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmenta- tion masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differen- tiable priors as dynamics and observation models, executes faster. We demonstrate that HANDYPRIORS attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.

Fast-Grasp’D: Dexterous Multi-finger Grasp Generation Through Differentiable Simulations
Dylan Turpin, Tao Zhong, Shutong Zhang, Guanglei Zhu, Eric Heiden, Miles Macklin, Stavros Tsogkas, Sven Dickinson, Animesh Garg
Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2023
abstract / paper / project page

Multi-finger grasping relies on high quality training data, which is hard to obtain: human data is hard to transfer and synthetic data relies on simplifying assumptions that reduce grasp quality. By making grasp simulation differentiable, and contact dynamics amenable to gradient-based optimization, we accelerate the search for high-quality grasps with fewer limiting assumptions. We present Grasp’D-1M: a large-scale dataset for multi-finger robotic grasping, synthesized with Fast- Grasp’D, a novel differentiable grasping simulator. Grasp’D- 1M contains one million training examples for three robotic hands (three, four and five-fingered), each with multimodal visual inputs (RGB+depth+segmentation, available in mono and stereo). Grasp synthesis with Fast-Grasp’D is 10x faster than GraspIt! and 20x faster than the prior Grasp’D differentiable simulator. Generated grasps are more stable and contact-rich than GraspIt! grasps, regardless of the distance threshold used for contact generation. We validate the usefulness of our dataset by retraining an existing vision-based grasping pipeline on Grasp’D-1M, and showing a dramatic increase in model performance, predicting grasps with 30% more contact, a 33% higher epsilon metric, and 35% lower simulated displacement.

Research Experience
ETH Zurich Computer Vision Lab , Switzerland
2023.4 - present

Research Intern
Supervisor: Prof. Luc Van Gool and Dr. Christos Sakaridis
PAIR Lab and Vector Institute , Canada
2022.5 (project start date: 2022.8) - present

Research Intern
Supervisor: Prof. Animesh Garg, with Prof. Ming C. Lin
Forcolab , Canada
2022.4 (project start date: 2022.5) - 2023.9

Research Intern
Supervisor: Prof. Shurui Zhou, with Prof. Jinghui Cheng
Work Experience
Intel Corporation , Canada
2022.5 - 2023.4

Engineering Intern
Quality and Execution Team: Project Manager and Software Engineer
Customer Happiness and User Experience Team: Front-End Developer
Core Datapath Team: Compiler Engineer
University of Toronto , Canada
2021.9 - 2023.4

Teaching Assistant
ECE253 Digital and Computer Systems - Fall 2021, Fall 2022
ECE243 Computer Organization - Winter 2022, Winter 2023
Supervisor: Prof. Natalie Enright Jerger, Prof. Jonathan Rose
Teaching and Outreach

University of Toronto ECE243 Computer Organization, Teaching Assistant. 2022 Winter, 2023 Winter

University of Toronto ECE253 Digital and Computer Systems, Teaching Assistant. 2021 Fall, 2022 Fall

University of Toronto Engineering, Mentor. 2020 - present

Rural Teaching Volunteer Program, Physics Teacher. 2020

EngFastlane, Calculas and Mechanics Instructor. 2019 - 2020

Awards and Honors

International Experience Award ($3000). May 2023

University of Toronto Summer Research Exchange Fellowship ($3000). Dec 2022

Edith Grace Buchan Undergraduate Research Fellowship ($5400). Apr 2022

Department of Electrical and Computer Engineering Top Student Award. Oct 2021

University of Toronto In Course Scholarship ($1500). Aug 2021

University of Toronto Scholar. Aug 2021

University of Toronto Summer Research Fellowship ($5000). May 2021

Deans Honor List. 2019 -- 2022

Faculty Of Applied Science & Engineering Admission Scholarship ($5000). Sep 2019


Template borrowed from here.