Extracting Informative Images from Web News Pages via Imbalanced Classification

© 2008-2009 Wei Gong, Hangzai Luo and Jianping Fan

1. Introduction

Informative multimedia elements extraction problems is listed as a multimedia grand challenge for ACM Multimedia 2009. Before this grand challenge was posted, we have started to evaluate the approaches to extract informative images in news web pages for our news analysis projects. To train the classification model, we have manually annotated 2000 web pages. As a result, we have an evaluation dataset. To help others interested in this task, we put this dataset online. You can download the data as well as other related resources here if you are interested.


All files are compressed by an open source archiver called 7-zip. Most latest version of archivers can open it. However, if your archiver can't open it, you can download an installer from its homepage. If you are using Linux or MacOS, you can download p7zip instead.

3. Data format

There are two datasets, one for Chinese web news pages and the other English web news pages. Each dataset is split to training data and testing data.

1) The files having '_cn_' in name are Chinese web news pages data, so does '_eg_' for English web news pages data.

2) The files with postfix '_test' are test data, "_train" are training data.

3) Annotation data format


The file for web news pages' URL. Each line is a record for one HTML page. For each record, the first column is the ID number, the second column is the URL of the HTML page.

--> annotation_pos_***.txt and annotation_url_***.txt:

The file for informative image annotation. Each page in id_***.txt will have one associated record in this file. Each record of this file starts from the line formated as 'pageID=X', where X correspond to the ID in 'id_***.txt'. For each informative image of page X, its HREF value is written in annotation_url_***.txt and its sequence id of IMG tag is written in annotation_pos_***.txt following the 'pageID=X' line.

4) Feature data format

Extracted features are stored in feature_***.txt. The format is as follows.

The first line is the global information of this file. It is in the following format:
n=XXX dim=YYY class=2
where XXX the number of samples, YYY is the dimension of features.

Each line following the first line represents one record for each IMG tag. In each record, the last column represents the classification class (0 means irrelevant image, 1 means informative image). The other columns are the features extracted from the HTML page:

dim 1~35 : If the image URL have special chars(+, -, etc).
dim 36 : if image file type is '.jpg'.
dim 37 : if image file type is '.png'.
dim 38 : if image file type is '.gif'.
dim 39 : if image file type have '.'.
dim 40 : if image file type is nothing.
dim 41 : image URL length.
dim 42 : if image URL is NULL.
dim 43~47: if the type of image URL start with '.', '\\', 'http', no http or short length.
dim 48 : if the image URL and page URL have the same Internet domain.
dim 49 : if the image URL have some kind of tail like "size=s".
dim 50 : if the link URL of image have the same Internet domain with page URL.
dim 51 : if the image have a link.
dim 52 : the width of image.
dim 53 : the height of image.
dim 54 : width/height*100.
dim 55~59: if the alignment attribute of <img> tag is NULL, 'center', 'right', 'left', or nothing.
dim 60~64: the last tag of the <img> tag is NULL, 'center', 'right', 'left', or nothing.
dim 65 : the style attribute of the <img> tag.
dim 66 : if the image URL and page URL carry the same word.
dim 67 : if the <img> tag surrounded by <LI> tag.
dim 68 : how many this type of <img> tags appear in this page.
dim 69 : if the image URL have a Date.
dim 70~74: if the image URL have positive words: 'photo', 'image', 'img', 'media', or nothing.
dim 75~82: if the image URL have the negative word: button, logo, dot, menu, Noscript, icon, arrow, title.
dim 83 : if the image URL have '%20'.

Site Meter

Leading Cloud Surveillance service

Leading Enterprise Cloud IT Service Since 2003

Powered by FirstCloudIT.com, a division of DriveHQ, the leading Cloud IT and Cloud Surveillance Service provider since 2003.