Event and Movie Datasets Released

We are happy to provide two labeled datasets which were obtained from Web markup.

The  datasets  were extracted from the Web Data Commons Oktober 2016 (WDC 2016)  n-quad dataset. The first dataset contains events with labeled subtypes. The second dataset contains movies with labeled genres. Our datasets contain the 7 classes with the respective greatest number of instances.

In the event dataset, the labels are  event subtypes  which are defined by schema.org and are annotated via the rdf:type property.

In the movie dataset,   the labels are movie genres  and are object of the schema.org/genre property. The labels were unified by string matching the genre-literals against the movie genres defined by the imdb.com. Since movie genre classification is a multi label classification problem, we provide datasets for the training of binary classifiers.

The following datasets contain the  sets of instances which were sampled from the WDC 2016 dataset . Each instance is represented through its node-id, its URL of provenance and its label.

DatasetSizeDistinct pldsAvg. Instances/pldLink
Eventsstratified67444148245.71Link
Eventspld-aware67444206432.82Link
Dramastratified239030360663.97Link
Dramapld-aware239030476502.16Link
Comedystratified239030342698.92Link
Comedypld-aware239030476502.16Link
Actionstratified239030361662.13Link
Actionpld-aware239030476502.16Link
Thrillerstratified239030342698.92Link
Thrillerpld-aware239030476502.16Link
Romancestratified239030347688.85Link
Romancepld-aware239030476502.16Link
Documentarystratified239030337709.29Link
Documentarypld-aware239030476502.16Link
Adventurestratified239030340703.03Link
Adventurepld-aware239030476502.16Link

Cite as:

Tempelmeier, N., Demidova, E., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conference 2018 (WWW2018), 27th edition of the former WWW conference, research track, ACM, Lyon, France, 23-27 April 2018.