Postdoctoral Researcher (Computer Vision)
Publications
LinkedIn
Google-Scholar
E-Mail
CV
Maitreya Suin and Rama Chellappa.
Generating fine-grained facial details faithful to inputs remains a challenging problem. Most existing methods produce either overly smooth outputs or alter the identity as they attempt to balance between generation and reconstruction. This may be attributed to the typical trade-off between quality and resolution in the latent space. We introduce a diffusion-based-prior inside a VQGAN architecture that focuses on learning the distribution over uncorrupted latent embeddings. We iteratively recover the clean embedding conditioning on the degraded counterpart. To ensure the reverse diffusion trajectory does not deviate from the underlying identity, we train a separate Identity Recovery Network and use its output to constrain the diffusion process. |
Maitreya Suin, Kuldeep Purohit and A. N. Rajagopalan.
We propose a pixel-adaptive and feature-attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We design a content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighboring pixel information. We further introduce a pixel-adaptive non-uniform sampling strategy that implicitly discovers the difficult-to-restore regions present in the image and, in turn, performs fine-grained refinement in a progressive manner. |
Maitreya Suin, Nithin Gopalakrishnan Nair, Chun Pong Lau, Vishal M. Patel and Rama Chellappa.
Blind face restoration (BFR) from severely degraded face images in the wild is a highly ill-posed problem. Existing generative works typically struggle to restore realistic details when the input is of poor quality. For BFR, maintaining a balance between the fidelity of the restored image and the reconstructed identity information is important. We present a conditional diffusion-based framework for BFR. We alleviate the drawbacks of existing diffusion-based approaches and design a region-adaptive strategy. This leads to a significant improvement in perceptual quality as well as face-recognition scores. |
Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa
We propose a generic “naturalness” preserving loss function, viz., kurtosis concentration (KC) loss, which can be readily applied to any standard diffusion model pipeline to elevate the image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. We validate the proposed approach for three diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. |
Chun Pong Lau, Maiterya Suin and Rama Chellappa
While face detection works well in ideal situations, the performance deteriorates sig- nificantly when the image is degraded, due to factors such as blur, deformation, low resolution, and extreme headpose. This motivates us to develop a face detection and alignment algorithm that could perform effectively on videos captured from a long range and high altitude without groundtruth annotations. We propose a single-stage face localization model ATDetect, which detects face bounding boxes, keypoints, and meta information simultaneously with realistic video captured at range and altitude. |
Praveen Kandula, Maitreya Suin and A. N. Rajagopalan
We propose an unsupervised low-light enhancement network using context-guided illumination-adaptive norm (CIN). We address this task in two stages. In stage- I, a pixel amplifier module (PAM) is used to generate a coarse estimate with an overall improvement in visibility and aesthetic quality. Stage- II further enhances the saturated dark pixels and scene properties of the image using CIN. We propose a region-adaptive single input multiple output (SIMO) model that can generate multiple enhanced images from a single low-light image. The objective of SIMO is to let users choose the image of their liking from a pool of enhanced images. |
Snehal Singh Tomar, Maitreya Suin and A. N. Rajagopalan.
Most existing style-editing methods work on the principle of inverting real images onto their latent space, followed by determining controllable directions. Both inversion of real images and determination of controllable latent directions are computationally expensive operations. This work aims to explore the effcacy of mask-guided feature modulation in the latent space of a Deep Generative Model as a solution to these bottlenecks. To this end, we present the SemanticStyle Autoencoder (SSAE), a deep Generative Autoencoder model that leverages semantic mask-guided latent space manipulation for highly localized photorealistic style editing of real images. |
Snehal Singh Tomar, Maitreya Suin and A. N. Rajagopalan.
Most State-of-the-art (SOTA) works in the self-supervised and unsupervised domain employ a ResNet-based encoder architecture to predict disparity maps from a given input image which are eventually used alongside a camera pose estimator to predict depth without direct supervision. The fully convolutional nature of ResNets makes them susceptible to capturing per-pixel local information only, which is suboptimal for depth prediction. Our key insight for doing away with this bottleneck is to use Vision Transformers, which employ self-attention to capture the global contextual information present in an input image. Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales. |
Maitreya Suin, Kuldeep Purohit and A. N. Rajagopalan.
Image inpainting is a highly ill-posed problem, and existing works often create distorted structures or blurry inconsistent textures. We argue that the problem is rooted in the encoder layers’ ineffectiveness in building a complete and faithful embedding of the missing regions from scratch. We propose a distillation-based approach for inpainting, where we provide direct feature-level supervision while training. We deploy cross and self-distillation techniques and design a dedicated completion-block in encoder. Next, we demonstrate how an inpainting network’s attention module can improve by leveraging a distillation-based attention transfer technique. We conduct evaluations on multiple datasets to validate our method. |
Kuldeep Purohit, Maitreya Suin, A. N. Rajagopalan and Vishnu Naresh Boddeti.
We propose SPAIR, a network design that harnesses distortion-localization information and dynamically adjusts computation to difficult regions in the image. SPAIR comprises of two components, (1) a localization network that identifies degraded pixels, and (2) a restoration network that exploits knowledge from the localization network in filter and feature domain to selectively and adaptively restore degraded pixels. Our architecture is agnostic to physical formation model and generalizes across several types of spatially-varying degradations. We demonstrate the efficacy of SPAIR individually on four restoration tasks. |
Maitreya Suin and A. N. Rajagopalan.
Most of the existing video deblurring works depend on implicit or explicit alignment for temporal fusion, which either increases the computational cost or results in suboptimal performance due to misalignment. We investigate two key factors: how to fuse spatio-temporal information and from where to collect it. We propose a factorized gated spatio-temporal attention module to perform non-local operations across space and time to fully utilize the available information without depending on alignment. It shows superior performance compared to existing non-local fusion techniques while being considerably more efficient. To complement the attention module, we propose a reinforcement learning-based framework for selecting keyframes from the neighborhood with the most complementary and useful information. |
Maitreya Suin, Kuldeep Purohit and A. N. Rajagopalan.
We propose an efficient pixel adaptive and feature attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We design a content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighboring pixel information. We use a patch-hierarchical attentive architecture composed of the above module that implicitly discovers the spatial variations in the blur present in the input image and in turn, performs local and global modulation of intermediate features. |
Maitreya Suin, Kuldeep Purohit and A. N. Rajagopalan.
We present a new approach suitable for handling the image-specific and spatially-varying nature of degradation in images affected by practically occurring artifacts such as rain-streaks, haze, raindrops and motion blur. We decompose the restoration task into two stages of degradation localization and degraded region-guided restoration, unlike existing methods which directly learn a mapping between the degraded and clean images. We demonstrate that the model trained for this auxiliary task contains vital region knowledge, which can be exploited to guide the restoration network’s training using knowledge distillation technique. Further, we propose mask-guided modules to focus on restoring the degraded regions. We conduct an extensive evaluation on multiple datasets corresponding to four different restoration tasks to validate our method. |
Maitreya Suin and A. N. Rajagopalan.
We focus on the task of generating a dense description of temporally untrimmed videos and aim to significantly reduce the computational cost by processing fewer frames while maintaining accuracy. Existing video captioning methods sample frames with a predefined frequency over the entire video or use all the frames. Instead, we propose a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames. The agent needs to watch more frames when it is processing an informative part of the video, and skip frames when there is redundancy. Such an efficient frame selection simplifies the event proposal task considerably. This has the added effect of reducing the occurrence of unwanted proposals. We also leverage the idea of knowledge distillation to improve the accuracy. |
Kuldeep Purohit, Maitreya Suin, Praveen Kandula and A. N. Rajagopalan.
We present a learning-based method for rendering such synthetic depth-of-field effect on input bokeh-free images acquired using ordinary monocular cameras. The proposed network is composed of an efficient densely connected encoder-decoder backbone structure with a pyramid pooling module. Our network leverages the task-specific efficacy of joint intensity estimation and dynamic filter synthesis for the spatially-aware blurring process. Since the rendering task requires distinguishing between large foreground and background regions and their relative depth, our network is further guided by pre-trained salient-region segmentation and depth-estimation modules. Along with extensive ablation analysis and visualizations to validate its components, the effectiveness of the proposed network is also demonstrated by achieving the second-highest score in the AIM 2019 Bokeh Effect challenge: fidelity track. |