Web Spam Challenge 2007 Track II - Secure Computing Corporation Research

author: Sven Krasser, Secure Computing Corporation
published: Jan. 28, 2008,   recorded: September 2007,   views: 5654

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.
  Delicious Bibliography

Description

To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided forWeb Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in aWeb linkage graph, the challenge is to predict other nodes’ class to be spam or normal.We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vector Machines (SVM) and Random Forests (RF) are modeled in the extremely high dimensional space with about 5 million features. Stratified 3-fold cross validation for SVM and out-of-bag estimation for RF are used to tune the modeling parameters and estimate the generalization capability. On the small corpus for Web host classification, the best F-Measure value is 75.46% and the best AUC value is 95.11%. On the large corpus for Web page classification, the best F-Measure value is 90.20% and the best AUC value is 98.92%.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: