Comparison Of Tag Prediction Techniques For Stack Exchange Posts

Abstract

With the advent of Web 2.0, which allows the user to create dynamic web pages and increasing trend of using social network sites like stack overflow, flicker etc., the concept of tagging is evolved. Tagging is the process by which users annotate their items like question, photo etc. with the relevant keywords called tags. Assigned tags should be such that they explain the subject or context of posted item. These tags are proved to be helpful for the users by suggesting them the items, they are interested in. Also, as the data available on the internet is exponentially increasing, it is becoming a difficult task to organize it. Tagging enables the users to organize their items of interest in a better way. So, the problem that is addressed in this paper is how the tagging process can be automated by recommending most appropriate tags to the user when he posts a new item. Tag recommendation process is categorized into Personalized Tagging and Social Tagging. Recommended approaches for both of these types are presented in the paper.

This research compares different techniques presented for Tag Recommendation. Additionally, topic modelling algorithm named Latent Dirichlet Allocation is also implemented for recommending tags, but experimental evaluations on Stack-Exchange posts revealed that it’s not suitable for this dataset. However, results of another technique named Fuzzy Nearest Neighbor Search that is proposed earlier are significantly better for medium to large scale data-set and SVM based method lead all other techniques in small scale data-set containing Stack-Exchange posts.

Introduction

As per increase in the number of launched sites like stack overflow, flicker, delicious etc., information available on the internet is increasing due to which tagging is becoming popular. Tagging is a process of assigning keyword that describes the topic of post or question. This assigned word is called a tag. Tags help to categorize the posts with similar meaning. Apart from this, tagging also connects professionals by providing them the relevant post or question to which they can answer well. [1]

When we have huge sized data such as large number of posts or webpages, we can keep them in an organized way by making use of keywords. These keywords or tags serve as descriptors for a given resource [2]. Existing Question/Answering sites like stack exchange demands the user to provide appropriate tags for the post they are posting. The provided tags can be a set of words or it can be single word. These assigned words represent the semantic meaning of user’s post.

As a result of tagging, we have powerful technique for categorizing the content in top-down manner – based on the assigned tags and their frequencies. Tagging provides flexibility to the users by allowing them to choose tags according to their own view point. Appropriate tags are also useful for retrieving the most suitable documents for user.

Although tags can be assigned and are currently being assigned manually by the user on some sites like stack exchange, but it is tedious and time-consuming task for large number of posts. It will be more convenient to automate the task or at least provide suggestion to the user at the time of assigning tags.

In tagging system, there is no restriction on users for assigning tags. Users can freely annotate their items with whatever tags they want. Although by freely allowing the users to annotate their items makes the tagging system easy to use and user-friendly but tagging system also have to bear the cost of providing this much flexibility and freedom to users. This cost is a result of extensive vocabulary created as result of tags assigned by users. Furthermore, problems like tag ambiguity and tag redundancy can also occur. Tag ambiguity refers to the problem of assigning tags having different meanings and tag redundancy refers to the problem of assigning tags having similar meanings [3]. Use of redundant tags create the obstacle for algorithms that assign tags to the items on the basis of similarities among items. So, redundant tags complicate the process of identifying similarities among items. While, ambiguous tags falsely represent the items with the dissimilar assigned tags [4]. These problems can be overcome by automating the task of tag prediction.

1.1 Research Problem

The trend of using Question Answering sites is gaining popularity. Thousands of users post on daily basis. It is possible that different users post questions having similar semantic meaning but assign different tags to the questions. Now, although the questions are addressing the same problem, when a new user arrives and he/she searches for that question, then the posts will be retrieved on the basis of tags assigned to the already posted questions. The more relevant the assigned tag is, the more appropriate post will be retrieved to meet the requirement of user.

So, the problem is to organize the similar posts or questions in such a way that user will not need to re-post them if they already exist. The new user will not need to wait for the answer as the similar question posted earlier would already have the answer. In order to solve this problem, the need to automate the tagging task on Question/Answering sites arises, so that the users can easily access their required post.

1.2 Research Question

To find the appropriate or most suitable tags for the posts on Question/Answering sites automatically, we will look into different proposed tagging techniques. In short, the thesis objective is to address the following research question:

How to automate the tagging task for posts on social sites?”

1.3 Structure of Thesis

This thesis is organized as follows:

Chapter 2 provides required background of tag prediction field for the intended audience.

Chapter 3 presents the literature review.

Chapter 4 discusses the experimental setup

Chapter 5 covers result section including the evaluation of explored techniques

Chapter 6 discusses the future work and conclusion.

Extended Background

In this chapter, all the basic concepts concerned with Tagging will be presented.

2.1 Tagging and Folksonomy

When we talk about tagging, we come across the concept of folksonomy [5]. It refers to a system that enables user to publicly assign tags to the content that are available online. Content can be any webpage, video clips, photos, URLs etc. Folksonomy deals with three units. These units include users, tags and resource that needs to be tagged. Creation of tags is done by user. The created tags are then assigned to resources (contents). The assigned tags are helpful for classifying, managing and summarizing the content [6]. It is reported that folksonomy can be categorized in two forms: [7]

  • Narrow Folksonomy: Tagging is restricted to some specific number of users.
  • Broad Folksonomy: Tagging is done mutually or tags are shared among group of users forming a community.

2.2 Subtypes of Tagging

Tagging has two different types: Social Tagging and Personalized Tagging

2.2.1.  Social Tagging:

Social tagging is also known as Collaborative Tagging. As the name indicates, this type of tagging assigns tags to resources by taking into account the tags assigned by other users to the

same item [8]. Social tagging is considered to be the most popular type of tagging. It allows users to assign tags a in free-form to the resources that are available online. Following this kind of tagging, a loosely bound classifier – for online content – based on feedback of large group of users can be built. In social tagging, tags are shared among all users [11].

Social tagging provides several advantages. It allows the users to connect socially with each other. Users can easily manage and access their relevant information. For participating in this type of tagging, it is not required by user to have any sort of expertise. Apart from it, as things are refined day by day so this type of tagging is much responsive to dynamically occurring changes in terminology and modernization in resources.

It is reported that systems following this kind of tag recommendation fail because of two appropriate reasons. First reason is that every user assigns tags according his/her interest for item’s information. Second reason is that each item that is to be tagged can have multiple aspects. For example, we have two users who want to assign tags to pictures available online. One of the users having craze for Mac Systems and other user crazy for fruits, retrieved the images according to their interests. First user retrieved image of Mac System and another retrieved image of apple. Most probably both of them will tag their respective retrieved images as “apple” and after a while if both users want to retrieve their relevant images again, both images i.e. image containing Mac System and apple will be shown to them because of having same assigned tags. So, social tagging leads to this kind of ambiguity [10].

Sites like flicker, delicious make the use of social tagging.

2.2.2.  Personalized Tagging:

As the name indicates, personalized tagging does not take into account the tags used by all users which is the case in social tagging. Instead in personalized tagging, tags that have already been used by a specific user or tags used by the users who have the same profile e.g. age group, and vocabulary are considered. At abstract level, we can say that tagging history of user is considered in personalized tagging. This tagging history maintains the vocabulary of each user.

Social tagging leads to too much freedom in choice of tags and it does not take into account the interest of specific user. Every user has his own need depending on the areas in which he is interested. Let say we have three types of users i.e. zoologist, rich man, and a youngster. All of them are interested in resource tagged as Jaguar. As per zoologist, Jaguar corresponds to an animal. According to rich man, Jaguar is car and youngster might take it as a movie “The Jaguar”. For fulfilling such user-specific needs, personalize tag recommendation came into being [10]. Tag ambiguity problem that was faced in social tagging can be avoided in personalized tagging. Since it deals with the interest of only a specific user instead of dealing with interests of several users, this type of tagging improves the retrieval system for a specific user. Whenever user assigns tag to any resource, the recommender system suggests tags to a user by analyzing the tags used by him or by analyzing tags used by users having same interest.

2.3 Types of Tags

Different types of tags have also been reported in [11]. These are as follows:

TABLE 1:TYPES OF TAGS

The above-mentioned tags are all the types of tags defined at abstract level. When we talk about these types of tags, we are not taking into account any sort of linguistics. Tags have also been defined on the basis of linguistics. These include functional tags, functional collocation tags, origin collocation tags, function and origin tags, taxonomic tags, adjective tags, verb tags and proper name tags. Gaming specific tags have also been found.

2.4 Characteristics of Tags

There should be some pre-defined criteria for good tags. Otherwise, tagging will be done blindly which may lead to poor tag assignment. In [12], criteria for good tags is reported. According to this criterion, good tags should have the following characteristics:

  • Tags should be generic so that they would be covering multiple concepts.
  • Tags should be popular i.e. it should be frequently used so that its chances of being spam would be reduced.
  • Each tag should identify only a limited number of resources.
  • Tags that are used at personal organization level should not be publicly available. So,

exclusion of these tags should be done.

  • Tags should be normalized to avoid any sort of syntactic variance and synonymy problem.

2.5 Motivation of Tagging

It is necessary to present the motivation of tagging. It includes several reasons. Some of these are tags serve as an index for user’s content. Tags aid as a descriptor for an item [22]. We can describe the semantic meaning of a resource or we can say that tags help to tell precisely what the resource is about [6]. Tags also work as a classification tool by classifying the user’s content according to some characteristics [23]. One of the basic reason is they help the user in improving the tagging skill. At the system level, tags are used to increase the size of set of tags for tagging the untagged resources [24]. Apart from all of these reasons, tags make the retrieval system to be an efficient. Only the items that are relevant to user’s requirement are retrieved if annotation is done properly.

Existence of redundant items on the Question/Answering sites can also be overcome by using tagging. Before posting any post, user checks either his required post is already there or not. If user found the content, he/she does not post it. Availability of the content can only be make sure if it is annotated properly so that it will be shown to user whenever he retrieves it.

Other reason for tags motivation include gaining attraction on the sites like Flicker where tags are assigned to photos, opinion expression on sites like Yahoo!, task organization by using tags like “toread”, etc. [11].

Tagging is easy to learn and use. It is not required by user to have any sort of experience or knowledge before getting into the process of tagging. It provides flexibility to a user by allowing him/her to add or remove the tags. Tagging process helps in creation of communities having users who have shared preferences [25].

2.6 Tag Recommendation Methods

Tags can be recommended to the user following different techniques. These techniques can be separated for personalized tagging and social tagging. The tag recommendation process might be taking into account the text of the post that is being tagged or it might be considering several similar posts against the post that is ready to be tagged. Tags can also be recommended on the basis of similar interests among users forming a certain group. On the basis of these methods, different methods have been proposed for giving recommendation to the users [13]. The hierarchical representation of these methods is shown in (Read more section) pdf.

2.6.1.  Personalized Tagging Approaches:

Under Personalized Tagging, following two approaches are proposed: [14]

  • Content-Based Approach:

As the name indicates, content-based approach deals with the textual information of the resource that is being tagged. Tags of those resources are recommended to the user whose textual information or metadata is similar to the current resource and user has already interacted with those resources. For finding the similar resources, similarity-based retrieval is done. One of the simplest method is to calculate cosine similarity between resources [13]. Other techniques for suggesting tags on the basis of contents include Probabilistic based approaches, neighborhood- based approaches and use of classifiers etc. [14] [15]

  • Graph-Based Approach:

Graph-based approach is potentially stronger approach than content-based approach. It considers every similar user and every tagged item. Normally, a threshold is put on the number of tag’s occurrences against the posts. While proposing candidate tags, graph based ranking algorithm considers the relevance among the document and user’s preferences [14].

2.6.2.  Social Tagging Approaches:

For social tagging, collaborative filtering is the suggested technique. Collaborative filtering is proved to be an advantageous in the sense that it can filter any type of items without considering their content. Still it has two major drawbacks. These are cold-start problem and sparsity [16].

Cold-start problem is encountered when we need to tag a new item i.e. cold-start items, or when we encounter a new user i.e. cold-start user. In both of the cases, we would be unable to

recommend high quality tags. Such problem arises when an item or user similar to the currently arrived item or user has never been encountered before.

Second problem is the sparsity that arises when we are available with insufficient data for identifying similar users or similar items. This problem is encountered due to lack of intersection among users or items.

Collaborative filtering is further categorized as following: [17] [18].

Leave a Comment