Datasets


Wherever possible a goal of our research is to release data to promote open access, research reproducibility, and support extensions to our research. This page collects datasets that you may use freely for non-commercial, research or educational purposes. We only ask that you cite the paper associated with the dataset, and to consider emailing us about your extension so that we may link back to your work. To link to this page, please use the permanent link: http://people.eng.unimelb.edu.au/brubinstein/data

Attribute-Value-Level Matching

From: Zhe Lim and Benjamin I. P. Rubinstein, Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching, in Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI'2015), January 2015, to appear

Description: We explore normalisation of attribute values across multiple data sources, where attributes could be categorical, numerical, otherwise and could multi-valued. For example the genre(s) of a movie or the cuisine(s) of a restaurant. To benchmark our statistical approach (based on Canonical Correlation Analysis) against baselines, we crawled and prepared two datasets of four sources each on movie genres (7852 records across IMDB, Rotten Tomatoes, The Movie DB, Yahoo! Movies) and restaurant cuisines (3120 records across Factual, Foursquare, Google Places, Yelp). After performing a simple entity resolution (record linkage) to align matched records across sources, we extracted the attribute to be matched (genres or cuisines respectively). The data here represents keys to the original records, plus attribute values, all in entity-aligned order. Finally, we also used Amazon Mechanical Turk to produce human judgments to evaluate the attribute-value matchings. More details can be found in the download READMEs and the paper.

Use: You are free to use this data for non-commercial, research or educational purposes. We ask that you cite the following paper if you publish on the data. Please also consider emailing Ben with a link to your work, so that we can link back to papers published on the data.

@inproceedings{LimRubinstein2015,
author = {Zhe Lim and Benjamin I. P. Rubinstein},
year = {2015},
title = {Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching},
booktitle = {Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI'2015)}
}

Downloads:
movies zip (5 tab-separated files, one per source plus one MTurk; readme file) 377KB
restaurants zip (5 tab-separated files, one per source plus one MTurk; readme file) 306KB