| 摘要: | Performance evaluation is an important issue in Web search engine researches. Traditional evaluation methods rely on much human efforts and are therefore quite time-consuming. With clickthrough data analysis, we proposed an automatic search engine performance evaluation method. This method generates navigational type query topics and answers automatically based on search users' querying and clicking behavior. Experimental results based on a commercial Chinese search engine's user logs show that the automatically method gets a similar evaluation result with traditional assessor-based ones. |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的查询日志信息,即将发表在2007年度国际互联网大会(WWW2007)上。 |
| 引用格式: | Yiqun Liu, Yupeng Fu, Min Zhang, Shaoping Ma, Liyun Ru, Automatic Search Engine Performance Evaluation with Click-through Data Analysis. Poster proceedings of the 16th International World Wide Web Conference (WWW2007), 2007, Banff, Alberta, Canada. |
| 下载链接 | 点此下载 |
| 摘要: | We report on a study that was undertaken to better identify users’ goals behind web search queries by using click through data. Based on user logs which contain over 80 million queries and corresponding click through data, we found that query type identification benefits from click through data analysis; while anchor text information may not be so useful because it is only accessible for a small part (about 16%) of practical userqueries.We also proposed two novel features extracted from click through data and a decision tree based classification algorithm for identifying user queries. Our experimental evaluation shows that this algorithm can correctly identify the goals for about 80% web search queries. |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的查询日志信息,发表在2006年度亚洲信息检索会议上。 |
| 引用格式: | Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma, Automatic Query Type Identification Based on Click Through Information, Asia Information Retrieval Symposium, AIRS 2006, in Lecture Notes in Computer Science Vol. 4182: pp. 593-600, 2006. |
| 下载链接: | 点此下载 |
| 摘要: | 用户行为分析是网络信息检索技术得以前进的重要基石,也是能够在商用搜索引擎中发挥重要作用的各种算法的基本出发点之一。为了更好的理解中文搜索用户的检索行为,本文对搜狗搜索引擎在一个月内的近5千万条查询日志进行了分析。我们从独立查询词分布、同一session内的用户查询习惯及用户是否使用高级检索功能等方面对用户行为进行了分析。分析结论对于改进中文搜索引擎的检索算法和更准确的评测检索效果都有较好的指导意义。 |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的查询日志信息,发表在2006年度第三届学生计算语言学研讨会上,并被评为大会优秀论文。 |
| 引用格式: | 余慧佳,刘奕群,张敏,茹立云,马少平, 基于大规模日志分析的网络搜索引擎用户行为研究. 第三届学生计算语言学研讨会(SWCL2006). |
| 下载链接: | 点此下载 |
| 摘要: | We report on a study that was undertaken to better understand what kinds of Web pages are the most useful for web search engine users by exploiting queryindependent features of retrieval target pages. To our knowledge, there has been little research towards query-independent web page cleansing for web information retrieval. Based on more than 30 million web pages obtained both from TREC and from a widely-used Chinese search engine SOGOU (www.sogou.com), we provide analysis on the differences between retrieval target pages and ordinary ones. We also propose a learning-based data cleansing algorithm for reducing Web pages which are not likely to be useful for user request. The results obtained show that retrieval target pages can be separated from low quality pages using query-independent features and cleansing algorithms. Our algorithm succeeds in reducing 95% web pages with less than 8% loss in retrieval target pages. It makes it possible for web IR tools to meet over 92% users needs with only 5% pages on the Web. |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的网络语料库信息,完整版发表在美国信息科学技术学会会刊上,下文的简版发表在第一届清华大学——京都大学学生论坛上。 |
| 引用格式: | Yiqun Liu, Min Zhang, Rongwei Cen, Liyun Ru, Shaoping Ma, Data Cleansing for Web Information Retrieval using Query Independent Features. Journal of the American Society for Information Science and Technology. DOI: 10.1002/asi.20633. |
| 下载链接: | 完整版下载 简版下载 |
| 摘要: | With the growth of web data, how to estimate web page quality effectively and rapidly becomes more and more important for web information retrieval and knowledge discovery. This paper analyzes the differences between retrieval target pages and ordinary pages using query-independent features. Using these features, an algorithm called Linear Page Estimation (LPE) is proposed for web page quality estimation. Based on experiments on .GOV corpus and SOGOU corpus involving 26 million pages, about 95% pages can be reduced with more than 90% retrieval target pages retained using our algorithm. Experimental results based on TREC datasets also show that retrieval performance on collections selected by our algorithm can be close to or even better than that on the whole collection. |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的网络语料库信息,即将发表在2007年第五届全国搜索引擎和网上信息挖掘研讨会上。 |
| 引用格式: | Rongwei Cen, Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma, Web Page Quality Estimation Based on Linear Discriminant Function, to be appeared at the Journal of Computational Information Systems. |
| 下载链接: | 点此下载 |
| 摘要: | Spam web pages intend to achieve higher-than-deserved ranking by various techniques. While human experts could easily identify spam web pages, the manual evaluating process of a large number of pages is still time consuming and cost consuming. To assist manual evaluation, we propose an algorithm to assign spam values to web pages and semi-automatically select potential spam web pages. We first manually select a small set of spam pages as seeds. Then, based on the link structure of the web, the initial R-SpamRank values assigned to the seed pages propagate through links and distribute among the whole web page set. After sorting the pages according to their R-SpamRank values, the pages with high values are selected. Our experiments and analyses show that the algorithm is highly successful in identifying spam pages, which gains a precision of 99.1% in the top 10,000 web pages with the highest R-SpamRank values. |
|---|---|
| 简介: | 该研究由清华大学智能技术与系统国家重点实验室与搜狐公司研发中心联合进行,使用了SogouLab提供的网络语料库及对应的链接关系库信息,即将发表在2007年第五届全国搜索引擎和网上信息挖掘研讨会上。 |
| 引用格式: | Chenmin Liang, Liyun Ru, Xiaoyan Zhu, R-SpamRank: A Spam Detection Algorithm Based on Link Analysis, to be appeared at the Journal of Computational Information Systems. |
| 下载链接: | 点此下载 |