Task Details

The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.

Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as Di.

For each dataset Di, participants will be provided with the following resources:

  • Xi : a subset of the instances in Di
  • Yi : matching/non-matching labels for pairs in Xi x Xi
  • Di metadata (e.g., how many instances it contains, what are the main characteristics)

Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Solutions will be evaluated over Zi = Di \ Xi. Note that the instances in Zi will not be provided to participants. More details are available in the Evaluation Process section.

Both Xi and Yi are in CSV format.

Example of dataset Xi

instance_id attr_name_1 attr_name_2 ... attr_name_k
00001 value_1 null ... value_k
00002 null value_2 ... value_k
... ... ... ... ...

Example of dataset Yi

left_instance_id right_instance_id label
00001 00002 1
00001 00003 0
... ... ...

More details about the datasets can be found in the dedicated Datasets section.

Your goal is to find, for each Xi dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.

Example of output.csv

left_instance_id right_instance_id
00001 00002
00001 00004
... ...

More details about the datasets can be found in the dedicated Submitting section.

# Name Description Metadata Download
1 NotebookToy Sample notebook specifications (will not be used for final leaderboard) 128 instances
16 attributes
40 entities
Dataset X1
Dataset Y1
2 Notebook Notebook specifications 538 instances
14 attributes
100 entities
Dataset X2
Dataset Y2
3 NotebookLarge Notebook specifications 605 instances
14 attributes
158 entities
Dataset X3
Dataset Y3
4 Altosight logo Product specifications
Kindly provided by Altosight
1356 instances
5 attributes
193 entities
Dataset X4
Dataset Y4

You can also download these datasets together with Snowman.

Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.

You can download the latest release, which already includes the datasets provided for the contest.

Participants are asked to use ReproZip to pack the solution they want to submit.

ReproZip is a tool for packing input files, libraries and environmental variables in a single bundle (in .rpz format), that can be reproduced on any machine.

A brief guide on how to use ReproZip to package your solution follows.

First of all, you have to install ReproZip on your machine. ReproZip can be installed via pip (pip install reprozip). More details about the installation can be found in the dedicated Documentation page.

Let’s suppose that your code is made up of a Python module called "greedy_matcher.py" and that you launch your program with the following command: python greedy_matcher.py.

First of all, ReproZip needs to track the code execution. For this to happen, it will be sufficient to run the following command: reprozip trace python greedy_matcher.py.

The code will be executed and a hidden folder (called ".reprozip-trace") will be created at the end of the process. This folder contains a "config.yml" file, that is a configuration file containing information about input/output files, libraries, environmental variables, etc. traced during the execution of your code. If you want to omit something you think it’s not useful to be packed, you can manually edit this file. Please, be sure not to remove any libraries or files needed to reproduce the code, to avoid the risk of it being not-reproducible.

Finally, to create the bundle, you have to run the following command: reprozip pack submission.rpz.

Please note that, if your code is made up of more than one file, even if they are written in different programming languages, you can trace the execution by using the option "--continue". You can find more details in the dedicated Documentation page.

At this point, a file called "submission.rpz" will be created and you can submit it using our dashboard.

Inside your code, it is important to refer to each input dataset Xi using its original name "Xi.csv".

Submitted solutions will be unpacked and reproduced using ReproUnzip on an evaluation server with the following characteristics:

Processor 32 CPU x 2.1 GHz
Main Memory 64 GB
Storage 2 TB
Operating System Linux

In particular, before to run ReproUnzip, the Xi dataset (i.e., the original input you worked on) will be replaced with the Zi dataset, which contains the hidden instances.

Here is the detailed sequence of operations used for the evaluation process:

  • reprounzip docker setup <bundle> <solution>, to unpack the uploaded bundle
  • reprounzip docker upload <solution> Zj.csv:Xi.csv to replace the input datasets (X2.csv, X3.csv, ...) with the hidden ones (Z2.csv, Z3.csv, ...), potentially in shuffled order (e.g., X2 can be replaced by Z3)
  • reprounzip docker run <solution>
  • reprounzip docker download <solution> output.csv (i.e., you must produce just one file "output.csv" cumulative for all the datasets)
  • evaluation of "output.csv"

Note that, in order to be evaluated, your submission must be reproduced correctly (i.e., the process must end with the creation of the "output.csv" file without errors) and must run on all the datasets (total) in no more than a given timeout (please note that the whole cycle above must be executed within the defined timeout). Timeout value is available below and will be updated every time a new dataset is released.

TIMEOUT: 25 min (last updated: 6 April 2021)

For each dataset Di we will compute resulting F-measure with respect to Zi x Zi. Submitted solutions will be ranked on average F-measure over all datasets.

Unfortunately, ReproUnzip sometimes prints useful information about the occurred errors on stdout, causing its exclusion from the submission log. In case you are stuck on a technical error preventing the successful reproduction of your submission, with no useful information appearing in the log, and not even the provided ReproUnzip commands can help you to find out the cause of the error, you can send us an email, so that we can check the content of stdout in order to detect the presence of any useful information about the error.

Rules

  • ACM SIGMOD 2021 Programming Contest is open to undergraduate and graduate students from degree-granting institutions all over the world. However, students associated with the organizers' institutions are not eligible to participate.
  • Teams must consist of individuals currently registered as graduate or undergraduate students at an accredited academic institution. A team may be formed by one or more students, who need not to be enrolled at the same institution. Several teams from the same institution can compete independently, but one person can be a member of only one team. There is no limit on team size. Teams can register on the contest site after 25 February 2021.
  • All submissions must consist only of code written by the team or open source licensed software (i.e., using an OSI-approved license). For source code from books or public articles, clear reference and attribution must be made. Final submissions must be made by 30 April 2021 (anywhere on Earth).
  • All teams must agree to license their code under an OSI-approved open source license. By participating in this contest, each team agrees to publish its source code. The finalists' implementations will be made public on the contest website.