| Corpus | Web Page Collection(SogouT) |
|---|---|
| Version | 2008 |
| Introduction | SogouT is a large scale of Web page collection collected in the middle of 2008. It contains 135.4 million Web pages from 5.3 million Chinese Web sites, and the total uncompressed storage size is 5.0 TBytes. |
| Format |
<doc> <docno>Page ID</docno> <url>Page URL</url> The original page content </doc> |
| Adopted By | NTCIR 2011 intent task CLEF LogCLEF 2011 task |
| Related Resources | Search performance evaluation benchmark for SogouT Hyperlink structure data for SogouT PageRank scores for SogouT SogouQ query log dataset |
| Realted Publications | 1.Data Cleansing for Web Information Retrieval using Query Independent Features. Yiqun Liu, Min Zhang, Rongwei Cen, Liyun Ru, Shaoping Ma. Journal of the American Society for Information Science and Technology. DOI: 10.1002/asi.20633. 2.R-SpamRank: A Spam Detection Algorithm Based on Link Analysis Chenmin Liang, Liyun Ru, Xiaoyan Zhu, to be appeared at the Journal of Computational Information Systems. 3.Incorporating Web Browsing Information into Anchor Texts for Web Search Bo Zhou, Yiqun Liu, Min Zhang, Yijiang Jin, Shaoping Ma. Information Retrieval Volume 14, Issue 3: 290-314, 2011. |
| Download | Please read the "License for Use of Sogou Lab Data" carefully before downloading. Mini Sample(61KB): tar.gz compressed, zip compressed Full version(1TB): (request a hardware copy) |