top of page

We are hiring!

Full-Time Research Assistant

Research Assistant – Multimodal AI for Computational Psychiatry

Weill Cornell Medicine, Department of Psychiatry (Grosenick & Solomonov Labs)

We are seeking a Research Assistant with strong programming and ML expertise spanning computer vision, large language models, and state-space/sequence modeling to join a joint position across the Grosenick Lab and the Solomonov Lab, supervised by Dr. Logan Grosenick and Dr. Nili Solomonov. The role focuses on building multimodal models of psychotherapy interactions, fusing facial dynamics, vocal acoustics, and speech-to-text transcripts, to study mood disorders and the therapeutic process. You will work with modern face-tracking (3DMMs, FACS/Action Units, neural rendering methods like 3DGS and surfels), audio and prosody features, LLM-based analysis of clinical language, and state-space models (Mamba, S4) for temporal fusion across modalities.

This position combines computational research with optional clinical research exposure and is ideal for candidates pursuing PhD/MD training.

Full-time (2-year commitment) | Primarily in-person (1300 York Ave, NYC)

Lab websites for more info - https://www.grosenicklab.org/ and https://www.solomonovlab.com/

Core Responsibilities

Multimodal Modeling (Primary)

  • Face/video: Build face and head tracking pipelines using modern methods, including 3D Morphable Models (FLAME, DECA, EMOCA, MICA), Facial Action Unit detection (OpenFace, Py-Feat), and point-based/neural rendering approaches (3DGS, surfels), bridging and comparing geometric and FACS-based representations

  • Voice/audio: Extract acoustic and prosodic features (librosa, openSMILE, Praat) and apply modern speech models (Whisper, WavLM, HuBERT) for diarization and speech-to-text

  • Language/LLMs: Analyze clinical transcripts and EMA text using transformer embeddings, LLM APIs, and topic/discourse modeling (Hugging Face, spaCy)

  • Temporal fusion: Develop state-space and sequence models (Mamba, S4, linear RNNs, transformers) that integrate face/voice/text streams over the course of therapy sessions

  • Reproducible Git-based workflows, rigorous evaluation, ablations

Secondary

  • Statistical and longitudinal modeling on derived multimodal features

  • Contribute to digital phenotyping and neuroimaging pipelines as projects intersect

Clinical / Research (optional or partial effort)

  • Participant recruitment, screening, and clinical/cognitive assessments

  • MRI session support and neuroimaging data handling

  • Data QC, database management, IRB support

Scholarly Work

  • Contribute to manuscripts, abstracts, and grants

 

Required Qualifications

  • Bachelor’s in Computer Science, Data Science, Engineering, Applied Math, or related field

  • Strong proficiency in Python (required) and PyTorch; pandas, NumPy, scikit-learn

  • Hands-on experience with modern deep learning workflows (training, fine-tuning, evaluation)

  • Demonstrated experience in at least two of three: computer vision, NLP/LLMs, audio or sequence modeling, with willingness to ramp up on the third

  • Experience handling real-world datasets (cleaning, missingness, pipelines)

 

Preferred Technical Skills

Computer vision/face

  • 3DMMs (FLAME, BFM), neural rendering (3DGS, surfels), dense face methods (DECA, EMOCA, MICA)

  • FACS/Action Unit tooling: OpenFace, Py-Feat, GraphAU

  • Landmark/mesh tracking: MediaPipe, FAN, dlib

LLMs/NLP

  • Hugging Face, spaCy, LLM APIs (OpenAI, Anthropic), retrieval / embeddings, fine-tuning

  • Experience with clinical or conversational text a strong plus

Audio/speech

  • librosa, openSMILE, Praat for acoustic / prosodic features

  • Whisper, WavLM, HuBERT or similar for ASR / speech representation

  • Speaker diarization

Sequence/state-space modeling

  • Mamba, S4, linear RNNs, switching state space models, transformer variants for long-context sequence modeling

  • Multimodal fusion architectures

Other

  • Neuroimaging: Nilearn, FSL, SPM, AFNI

  • Open-source contributions

  • Experience with multimodal or longitudinal data

 

Details

  • Start: Immediate

  • Salary: $22.03–$30.00/hour

  • Eligibility: U.S. work authorization required

 

Apply

Send CV and cover letter to Jordan Serrano-Guedea (jas4060@med.cornell.edu)

Rolling review until filled.

1300 York Ave., New York, NY 10065

© 2035 by The Thomas Hill. Powered and secured by Wix

bottom of page