Nilesh Kulkarni


I am a CSE Ph.D. student at the University of Michigan . I'm fortunate to be advised by David Fouhey and Justin Johnson. I previously graduated with masters from Carnegie Mellon University where I was advised by Abhinav Gupta. Before that, I was an undergrad in the Computer Science and Engineering department at IIT Bombay .

During my Ph.D. I have been fortunate to intern at Google Research with Prof. Leonidas Guibas and Waymo Research with Xinchen Yan and Charles Qi . I have closely collaborated with Shubham Tulsiani and Ishan Misra. I spent time at Samsung AI Research in Seoul, South Korea for two years as a Research Engineer.



Feel free to reach out in case you are interested in collaborating at nileshk [at] umich.edu


email | github | google scholar | CV


nileshk AT umich DOT edu
  
My picture

News

[Mar 2024] NIFTY , FAR and 3DFIRES accepted to CVPR 2024.
[July 2023] NIFTY website and paper are available now.
[Jun 2023] D2-DRDF website and paper are available now.
[May 2023] Interning at Waymo Research. Hosted by Xinchen Yan and Charles Qi
[Feb 2023] D2-DRDF is accepted at CVPR 2023, Vancouver. Paper coming soon!
[Jul 2022] DRDF is accepted at ECCV 2022, Tel-Aviv
[May 2022] I'm interning at Google Research this summer hosted by Prof. Leonidas Guibas

Publications


[New] 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface
Linyi Jin, Nilesh Kulkarni, David Fouhey
CVPR 2024
abstract   project page   paper

This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level, enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction.

[New] FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation
Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, David Fouhey
CVPR 2024
abstract   project page   paper

Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization.

[New] NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas
CVPR, 2024
pdf   abstract   bibtex   project page

We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.
          @article{kulkarni2023nifty,
            title={Nifty: Neural object interaction fields for guided human motion synthesis},
            author={Kulkarni, Nilesh and Rempe, Davis and Genova, Kyle and Kundu, Abhijit and Johnson, Justin and Fouhey, David and Guibas, Leonidas},
            journal={arXiv preprint arXiv:2307.07511},
            year={2023}
          }
        
        
      

Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data
Nilesh Kulkarni, Linyi Jin, Justin Johnson, David F. Fouhey
CVPR, 2023
pdf   abstract   bibtex   video   project page

We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data.
          @inproceedings{kulkarni2023d2drdf,
              title={Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data},
              author={Kulkarni, Nilesh and Jin, Linyi and Johnson, Justin and Fouhey, David F},
              booktitle={Computer Vision and Pattern Recognition},
              year={2023},
            }
        
        
      

What's behind the couch? Directed Ray Distance Functions for 3D Scene Reconstruction
Nilesh Kulkarni, Justin Johnson, David F. Fouhey
ECCV, 2022
pdf   abstract   bibtex   video   project page

We present an approach for scene-level 3D reconstruction, including occluded regions, from an unseen RGB image. Our approach is trained on real 3D scans and images. This problem has proved difficult for multiple reasons; Real scans are not watertight, precluding many methods; distances in scenes require reasoning across objects (making it even harder); and, as we show, uncertainty about surface locations motivates networks to produce outputs that lack basic distance function properties. We propose a new distance-like function that can be computed on unstructured scans and has good behavior under uncertainty about surface location. Computing this function over rays reduces the complexity further. We train a deep network to predict this function and show it outperforms other methods on Matterport3D, 3D Front, and ScanNet.
          @inproceedings{kulkarni2022directed,
              title={Directed Ray Distance Functions for 3D Scene Reconstruction},
              author={Kulkarni, Nilesh and Johnson, Justin and Fouhey, David F},
              booktitle={European Conference on Computer Vision},
              pages={201--219},
              year={2022},
              organization={Springer}
            }
        
        
      

Collision Replay: What does bumping into scenes tell you about scene geometry?
Alexandar Raistrick, Nilesh Kulkarni, David Fouhey
BMVC, 2021 ( Oral )
pdf   abstract   bibtex   video   overview talk

What does bumping into things in past scenes tell you about scene geometry in a new scene? In this paper, we investigate the idea of learning from collisions. At the heart of our approach is the idea of collision replay, where after a collision an agent associates the pre-collision observations (such as images or sound collected by the agent) with the time until the next collision. These samples enable training a deep network that can map the pre-collision observations to information about scene geometry. Specifically, we use collision replay to train a model to predict a distribution over collision time from new observations by using supervision from bumps. We learn this distribution conditioned on visual data or echolocation responses. This distribution conveys information about the navigational affordances (e.g., corridors vs open spaces) and, as we show, can be converted into the distance function for the scene geometry. We analyze our approach with a noisily actuated agent in a photorealistic simulator.

          @inproceedings{raistrick21,
            title = {Collision Replay: What Does Bumping Into Things Tell You About Scene Geometry?},
            author = {Alexander Raistrick and Nilesh Kulkarni and David F. Fouhey},
            booktitle = {BMVC},
            year = 2021}

Implicit Mesh Reconstruction from Unannotated Image Collections
Shubham Tulsiani, Nilesh Kulkarni, Abhinav Gupta
Arxiv, 2020
pdf   abstract   bibtex   project page

We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image, using only category-level image collections with foreground masks as supervision. We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture. To derive supervisory signal for learning, we enforce that: a) our predictions when rendered should explain the available image evidence, and b) the inferred 3D structure should be geometrically consistent with learned pixel to surface mappings. We empirically show that our approach improves over prior work that leverages similar supervision, and in fact performs competitively to methods that use stronger supervision. Finally, as our method enables learning with limited supervision, we qualitatively demonstrate its applicability over a set of about 30 object categories.
          @article{tulsiani2020implicit,
            title={Implicit mesh reconstruction from unannotated image collections},
            author={Tulsiani, Shubham and Kulkarni, Nilesh and Gupta, Abhinav},
            journal={arXiv preprint arXiv:2007.08504},
            year={2020}
          }
        
        
      

Articulation-aware Canonical Surface Mapping
Nilesh Kulkarni, Abhinav Gupta, David Fouhey, Shubham Tulsiani
CVPR, 2020
pdf   abstract   bibtex   code   project page   video   ppt

We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape , and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on leveraging keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animate object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.

@inProceedings{kulkarni2020acsm,
  title={Articulation-aware Canonical Surface Mapping},
  author={Kulkarni, Nilesh and Gupta, Abhinav and Fouhey, David and Tulsiani, Shubham},
  year={2020},
  booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

Canonical Surface Mapping via Geometric Cycle Consistency
Nilesh Kulkarni, Abhinav Gupta*, Shubham Tulsiani*
ICCV, 2019
pdf   project page   abstract   bibtex   video   code

We explore the task of Canonical Surface Mapping (CSM). Specifically, given an image, we learn to map pixels on the object to their corresponding locations on an abstract 3D model of the category. But how do we learn such a mapping? A supervised approach would require extensive manual labeling which is not scalable beyond a few hand-picked categories. Our key insight is that the CSM task (pixel to 3D), when combined with 3D projection (3D to pixel), completes a cycle. Hence, we can exploit a geometric cycle consistency loss, thereby allowing us to forgo the dense manual supervision. Our approach allows us to train a CSM model for a diverse set of classes, without sparse or dense keypoint annotation, by leveraging only foreground mask labels for training. We show that our predictions also allow us to infer dense correspondence between two images, and compare the performance of our approach against several methods that predict correspondence by leveraging varying amount of supervision.

@inProceedings{kulkarni2019csm,
  title={Canonical Surface Mapping via Geometric Cycle Consistency},
  author={Kulkarni, Nilesh and Gupta, Abhinav and Tulsiani, Shubham},
  year={2019},
  booktitle={International Conference on Computer Vision (ICCV)}
}

3D-RelNet: Joint Object and Relational Network for 3D Prediction
Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta
ICCV, 2019
pdf   project page   abstract   bibtex   code

We propose an approach to predict the 3D shape and pose for the objects present in a scene. Existing learning based methods that pursue this goal make independent predictions per object, and do not leverage the relationships amongst them. We argue that reasoning about these relationships is crucial, and present an approach to incorporate these in a 3D prediction framework. In addition to independent per-object predictions, we predict pairwise relations in the form of relative 3D pose, and demonstrate that these can be easily incorporated to improve object level estimates. We report performance across different datasets (SUNCG, NYUv2), and show that our approach significantly improves over independent prediction approaches while also outperforming alternate implicit reasoning methods.

@inProceedings{kulkarni2019relnet,
  title={3D-RelNet: Joint Object and Relational Network for 3D Prediction},
  author={Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta},
  booktitle={ICCV},
  year={2019}
}

On-Device Neural Language Model based Word Prediction
Seunghak Yu*, Nilesh Kulkarni*, Haejun Lee, Jihie Kim
COLING : System Demonstrations, 2018
pdf   abstract   bibtex

Recent developments in deep learning with application to language modeling have led to success in tasks of text processing, summarizing and machine translation. However, deploying huge language models on mobile devices for on-device keyboards poses computation as a bottle-neck due to their puny computation capacities. In this work, we propose an on-device neural language model based word prediction method that optimizes run-time memory and also provides a realtime prediction environment. Our model size is 7.40MB and has average prediction time of 6.47 ms. The proposed model outperforms existing methods for word prediction in terms of keystroke savings and word prediction rate and has been successfully commercialized..

@inproceedings{yu2018device,
  title={On-device neural language model based word prediction},
  author={Yu, Seunghak and Kulkarni, Nilesh and Lee, Haejun and Kim, Jihie},
  booktitle={Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations},
  pages={128--131},
  year={2018}
}

Syllable-level Neural Language Model for Agglutinative Language
Seunghak Yu*, Nilesh Kulkarni*, Haejun Lee, Jihie Kim
EMNLP workshop on Subword and Character level models in NLP (SCLeM), 2017
pdf   abstract   bibtex

Language models for agglutinative languages have always been hindered in past due to myriad of agglutinations possible to any given word through various affixes. We propose a method to diminish the problem of out-of-vocabulary words by introducing an embedding derived from syllables and morphemes which leverages the agglutinative property. Our model outperforms character-level embedding in perplexity by 16.87 with 9.50 M parameters. Proposed method achieves state of the art performance over existing input prediction methods in terms of Key Stroke Saving and has been commercialized.

          @article{yu2017syllable,
            title={Syllable-level neural language model for agglutinative language},
            author={Yu, Seunghak and Kulkarni, Nilesh and Lee, Haejun and Kim, Jihie},
            journal={arXiv preprint arXiv:1708.05515},
            year={2017}
          }

Robust kernel principal nested spheres
Suyash Awate*, Manik Dhar*, Nilesh Kulkarni*
ICPR, 2016
pdf   abstract   bibtex

Kernel principal component analysis (kPCA) learns nonlinear modes of variation in the data by nonlinearly mapping the data to kernel feature space and performing (linear) PCA in the associated reproducing kernel Hilbert space (RKHS). However, several widely-used Mercer kernels map data to a Hilbert sphere in RKHS. For such directional data in RKHS, linear analyses can be unnatural or suboptimal. Hence, we propose an alternative to kPCA by extending principal nested spheres (PNS) to RKHS without needing the explicit lifting map underlying the kernel, but solely relying on the kernel trick. It generalizes the model for the residual errors by penalizing the L p norm / quasi-norm to enable robust learning from corrupted training data. Our method, termed robust kernel PNS (rkPNS), relies on the Riemannian geometry of the Hilbert sphere in RKHS. Relying on rkPNS, we propose novel algorithms for dimensionality reduction and classification (with and without outliers in the training data). Evaluation on real-world datasets shows that rkPNS compares favorably to the state of the art.

          @article{Awate2016RobustKP,
            title={Robust kernel principal nested spheres},
            author={Suyash P. Awate and Manik Dhar and Nilesh Kulkarni},
            journal={2016 23rd International Conference on Pattern Recognition (ICPR)},
            year={2016},
            pages={402-407}
          }

Patents


Electronic apparatus for compressing language model, electronic apparatus for providing recommendation word and operation methods thereof
Seunghak Yu, Nilesh Kulkarni, Haejun Lee
US Patent App. 15/888,442
patent   abstract

An electronic apparatus for compressing a language model is provided, the electronic apparatus including a storage configured to store a language model which includes an embedding matrix and a softmax matrix generated by a recurrent neural network (RNN) training based on basic data including a plurality of sentences, and a processor configured to convert the embedding matrix into a product of a first projection matrix and a shared matrix, the product of the first projection matrix and the shared matrix having a same size as a size of the embedding matrix, and to convert a transposed matrix of the softmax matrix into a product of a second projection matrix and the shared matrix, the product of the second projection matrix and the shared matrix having a same size as a size of the transposed matrix of the softmax matrix, and to update elements of the first projection matrix, the second projection matrix and the shared matrix by performing the RNN training with respect to the first projection matrix, the second projection matrix and the shared matrix based on the basic data.
Website inspired from here