News Story Ranking Using Blogosphere
- News Story Ranking Using Blogosphere
- Date Issued
- Since the advent of the Internet, it has become one of the most important channels for communicating information among users including individuals and news organizations.Many news organizations have started to distribute news stories on the Internet, and a large number of news stories are published by various news channels, on a daily basis.This makes it difficult to keep track of important news stories.As a result, users' need to identify top news stories has increased, and news story search has played an increasingly important role in users' Internet activity.The objective of this dissertation is to identify important news stories for a given date, using the blogosphere.Blogs consists of blog posts that are user-generated contents, and reflects diverse the opinion of users about news stories.Therefore, a news story that attracts much attention in the blogosphere is likely to be important.In this dissertation, we define the popularity of a news story as the amount of attention it receives from users within the blogosphere.We first evaluate the popularity of a news story in terms of content similarity between the story and blog posts published on a given date.For this purpose, we propose several approaches to estimate language models for each of the story and the blog posts.We also generate a temporal profile of a news story by analyzing the distribution of the number of blog posts relevant to the story over several days, and evaluate the popularity of the story based on the temporal profile.The experimental results on the TREC 2009 and 2010 Blog Track show that our approach is effective in identifying the important news stories.In particular, the proposed approach achieved the state-of-the-art performance.Furthermore, we propose a simple but effective approach to deal with the noisy information of blog posts.In general, blog posts include several types of noisy information including blog templates, advertisements and navigation panels.This noisy information is not user-generated contents, and has a bad influence on our system for identifying important news stories.The motivation for our approach is that most of the noisy contents do not change across several consecutive posts within the same blog.To eliminate the noisy information, we compare two consecutive posts belonging to the same blog.Then, we consider common parts of the two posts as the noisy contents, and remove them.Experimental results from the TREC blog track are remarkable, showing that the retrieval system using the proposed method results in an important performance improvement of about 10% MAP (Mean Average Precision) increase over that of the baseline system.
- Article Type
- Files in This Item:
- There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.