A student and professor sit next to each other looking at a laptop

Episode of 'Friends' Inspires New Tool that Provides Human-like Perception to MLLMs

For Jitesh Jain, conducting a simple experiment while watching one of his favorite TV series became the genesis of a paper accepted into a prestigious computer vision conference.

Jain is the creator of VCoder, a new tool that enhances the visual perception capabilities of multimodal large language models (MLLMs). Jain said MLLMs like GPT-4 with vision (GPT-4V) are prone to miss obscure objects that blend in with other objects in an image.

Jain paused his TV as he watched The One with the Halloween Party episode of the popular TV Series Friends.

Chandler stood out the most in a pink bunny costume while holding hands with Ross in a potato costume. As the two prepared for an arm-wrestling match with Joey and Phoebe, multiple groups socialized behind them.

Jain wondered how accurate GPT-4V would be if he prompted itto describe everything happening in this image.
Read more at cc.gatech.edu