A review on the ACMMM paper: “How ‘How’ Reflects What’s What: Content-based Exploitation of How Users Frame Social Images” (Michael Riegler et al. in MM’14, p397-406)
Since the beginning of the Internet, image, along with text, has been one of the most important part of any search engine. The most naive and appealing approach was to tag image with its content. This approach was simple enough to be effective, but only in very small scale of image databases. The explosion of social networks like Facebook, Twitter, Instagram or Flickr, etc., has given us enormous datasets of daily life photos. This increment of the dataset size requires smart classification algorithms that efficiently classify photos base on their semantic contents and its similarity to the human visual processing mechanism. Thus, users will benefit a lot from effectively searching the right photo as each photo will be classified by the computers closely to the way that we – human being perceives it. Recently, the new-released Google Photo app has reflected the state-of-the-art in semantic image classification. In Google Photo app, it is quite simple to search user own photos with only general keywords, for example “baby”, “two people”, “a dog”, etc.,. In the paper named “How ‘How’ Reflects What’s What: Content-based Exploitation of How Users Frame Social Images” from Michael Riegler et al, the authors have proposed a new perspective called **Intentional Framing** on how to classify photos based on the way they were shot or framed.
There are many researches on image semantic that relied on “what” the photo is depicting. These “what” can be categorized into 3 major research areas: *concepts*, *scenes* and *events*.
Concepts are objects and other entities that literally depicted the photo. It derived from psychology – cognitive science and closely related to how human store, organize and process the world.
Scenes rooted from perception psychology, which are described as “the visual perception of an environment as seen by an observer at a given time”. The gist behind scenes and intentional framing is that they both aim at capturing global image characteristics. However, while scenes are focusing on “what” is depicted, intentional framing is focusing on “how” the subject material is presented.
**Events** are “specific incidents taking place at or over a given time span, involving one or more actors or objects and a specific place”. They focus on classifying the depiction of social events like weddings, parties, sports, etc., and are not in the same range with intentional framing.
What is the problem?
As mentioned from the beginning, image semantic classification is one of important branches in computer vision research. Beside approaches which focused on subject matter that are depicted in the photo, intentional framing is another perspective on semantic classification problem that tries to exploit the information from how the photo was taken, or in other word, how the photographer framed the view.
How does intentional framing address image semantic classification?
By the introduction of intentional framing, which is defined as “the sum of the choices made by photographers on exactly how to portray the subject matter that they have decided to photograph”, the authors tried to exploit the psychological aspect of human in taking and view photo. That is the similarity between photographer in sending messages through framing the photo and viewer in perceiving the message while observing the photo.
**Photographer Choices** The choice of how to frame a photo heavily depends on lighting, color, position of objects, camera angle, depth of field, focus, and also timing. In general, they are not so much personal choices but shares the expectation between photographer and viewer about the way to portray the world. With this convention in framing, photographer actively integrated the way how the photo should be interpreted.
**Viewer Interpretation** Viewer interpretations of an image are usually tightly synchronized with the intention of the photographer (especially in social networks’ photos). This comes from several standard of stereotyped mindset in how an object would be portrayed.
By exploiting these common conventions between photographer and viewer, intentional framing strive to give interpretations which are closer to the way human interpret messages from the photo.
p align=”justify”>Through experiment on a dataset from Flickr, SimSea, the implementation of intentional framing image search, has succeed in overcome best results from the ACMMM Grand Challenge 2013 in mean interpolated aver- aged precision (MiAP ) also processing time:
p align=”justify”>The processing time of SimSea was reduced to 300ms with Windows 7, Intel Core i7, 16GB memory on classification of a single photo. (The other methods, SMaL and SVM, result in 10 minutes and 2.5 seconds on a 24-core Intel Xeon Q6600 2.0Ghz with 128GB RAM).
Why do I like this paper?
p align=”justify”>The reason that I like this paper mostly comes from the motivation of the paper. The paper tried to address the psychological aspect of photography. By focusing on social network photos, authors have pointed out the synchronization between photographer and viewer. Both photographer and viewer share some conventions in framing and interpreting the world. By understanding this characteristic, it is a very good reason to believe that Intentional Framing will help computer to have a look at a photo in a way that is closer to human feeling. Moreover, the solution in the implementation of intentional framing search – SimSea – was classic yet powerful in image semantic classification. This leads to the its good performance on photo classification.