When people look at a scene, they see objects and the relationship between them. On top of your desk, there may be a laptop sitting to the left of a phone that is in front of a computer screen.
Many deep learning models struggle to see the world in this way because they do not understand the intricate relationships between individual objects. Without knowledge of these conditions, a robot designed to help someone in a kitchen would have difficulty following a command such as “pick up the spatula to the left of the stove and place it on top of the cutting board.”
In an attempt to solve this problem, MIT researchers have developed a model that understands the underlying relationships between objects in a scene. Their model represents individual relationships one at a time and then combines these representations to describe the overall scene. This allows the model to generate more accurate images based on text descriptions, even when the scene contains multiple objects arranged in different relationships to each other.
This work can be used in situations where industrial robots have to perform complex manipulation tasks in several steps, such as stacking objects in a warehouse or assembling appliances. It also moves the field one step closer to enabling machines that can learn from and interact with their environments more as humans do.
“When I look at a table, I can not say that there is an object in the XYZ location. Our mind does not work that way. In our mind, when we understand a scene, we really understand it based on the relationship between the objects. We believes that by building a system that can understand the relationships between objects, we can use that system to more efficiently manipulate and change our environments, ”says Yilun Du, PhD student in Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.
You wrote the paper with co-lead authors Shuang Li, a CSAIL PhD student, and Nan Liu, a graduate student at the University of Illinois at Urbana-Champaign; as well as Joshua B. Tenenbaum, Paul E. Newton’s Career Development Professor in Cognitive Science and Computation in the Department of Brain and Cognitive Science and a member of CSAIL; and senior author Antonio Torralba, Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL. The research will be presented at the conference on neural information processing systems in December.
One relationship at a time
The framework the researchers developed can generate an image of a scene based on a textual description of objects and their relationships, such as “A wooden table to the left of a blue stool. A red sofa to the right of a blue stool.”
Their system would break these sentences down into two smaller pieces that describe each relationship (“a wooden table to the left of a blue stool” and “a red sofa to the right of a blue stool”), and then model each part separately. . These pieces are then combined through an optimization process that generates an image of the scene.
The researchers used a machine learning technique called energy-based models to represent the individual object conditions in a scene description. This technique allows them to use one energy-based model to encode each relationship description and then compose them together in a way that derives all objects and relationships.
By breaking the sentences into shorter pieces for each relationship, the system can recombine them in a variety of ways, making it better able to adapt to scene descriptions it has not seen before, Li explains.
“Other systems would take all relationships holistically and generate the image one-shot from the description. However, such approaches fail when we have non-distribution descriptions, such as multi-relationship descriptions, as these models can not really adapt a shot to generate images. “But since we compose these separate, smaller models together, we can model a larger number of relationships and adapt to new combinations,” you say.
The system also works the other way around – given an image, it can find text descriptions that match the relationship between objects in the scene. In addition, their model can be used to edit an image by rearranging the objects in the scene to match a new description.
Understand complex scenes
The researchers compared their model with other deep learning methods that were given text descriptions and were tasked with generating images that showed the corresponding objects and their relationships. In each case, their model performed better than baseline.
They also asked people to assess whether the images generated matched the original scene description. In the most complex examples, where the descriptions contained three contexts, 91 percent of the participants concluded that the new model performed better.
“An interesting thing we found is that for our model, we can increase our theorem from having one relational description to having two, or three or even four descriptions, and our approach continues to be able to generate images, that is correctly described by the descriptions, while other methods fail, ”Du says.
The researchers also showed the model images of scenes it had not seen before, as well as several different text descriptions of each image, and it was able to identify the description that best matched the object relations in the image.
And when the researchers gave the system two relational scene descriptions that described the same image, but in different ways, the model was able to understand that the descriptions were equivalent.
The researchers were impressed with the robustness of their model, especially when working with descriptions it had not encountered before.
“This is very promising because it is closer to how humans work. Humans can only see more examples, but we can extract useful information from just the few examples and combine them to create infinite combinations. And our model has such a property , which allows it to learn from less data but generalize to more complex scenes or image generations, ”says Li.
While these early findings are encouraging, researchers would like to see how their model performs on real-world images that are more complex, with noisy backgrounds and blocking objects.
They are also interested in eventually incorporating their model into robotic systems so that a robot can derive object relations from videos and then apply this knowledge to manipulate objects in the world.
“Developing visual representations that can handle the compositional nature of the world around us is one of the most important open issues in computer vision. This paper makes significant progress on this issue by proposing an energy-based model that explicitly models multiple relationships between the objects depicted. The results are really impressive, “said Josef Sivic, a prominent researcher at the Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University, who was not involved in this research.
This research is supported in part by Raytheon BBN Technologies Corp., Mitsubishi Electric Research Laboratory, National Science Foundation, Office of Naval Research, and IBM Thomas J. Watson Research Center.
###
Written by Adam Zewe, MIT News Office
Paper: “Learning to compose visual relationships:
https://ift.tt/3oGPij9
Article title
Learn to compose visual relationships
Post a Comment