MyMediaLite: KDD Cup 2011

KDD Cup 2011

Since version 1.0, MyMediaLite supports reading in the KDD Cup 2011 data files, so that you can run existing recommenders on that data set, or implement new ones using the MyMediaLite infrastructure.

To find out more about the challenge, please go the official KDD Cup 2011 website.

Lots of Data

The KDD Cup 2011 datasets are huge: The training set for Track 1 contains about 300 million ratings, while the training set for Track 2 contains about 60 million ratings. We took care to implement the MyMediaLite algorithms and data structures so that they do not waste memory, but we recommend to run MyMediaLite for the KDD Cup 2011 on a computer with at least 8 GB (track 1) or 4 GB (track 2).

Patches that improve memory use and runtime are, as well as other improvements, of course always welcome.

If you run into trouble with Mono, please upgrade to Mono 2.10.2 or later and make sure it is compiled with the option --with-large-head=yes.

The Command-Line Tools

We provide one tool for each track, which we describe in the following.

Track 1

Usage:

MyMediaLite KDD Cup 2011 Track 1 tool

 usage:  KDDCup.exe METHOD [ARGUMENTS] [OPTIONS]

  use '-' for either TRAINING_FILE or TEST_FILE to read the data from STDIN

  methods (plus arguments and their defaults):
   - BiasedMatrixFactorization num_factors=10 bias_reg=0.0001 reg_u=0.015 reg_i=0.015 learn_rate=0.01 num_iter=30 bold_driver=False init_mean=0 init_stdev=0.1
   - MatrixFactorization num_factors=10 regularization=0.015 learn_rate=0.01 num_iter=30 init_mean=0 init_stdev=0.1
   - GlobalAverage
   - SlopeOne
   - BipolarSlopeOne
   - UserAverage
   - ItemAverage
   - UserItemBaseline reg_u=0 reg_i=0
   - ItemKNNCosine k=inf reg_u=10 reg_i=5
   - UserKNNCosine k=inf reg_u=10 reg_i=5
   - ItemKNNPearson k=inf shrinkage=10 reg_u=10 reg_i=5
   - UserKNNPearson k=inf shrinkage=10 reg_u=10 reg_i=5
   - ItemAttributeKNN k=inf reg_u=10 reg_i=5 (needs --item-attributes=FILE)
   - UserAttributeKNN k=inf reg_u=10 reg_i=5 (needs --user-attributes=FILE)
method ARGUMENTS have the form name=value

  general OPTIONS have the form name=value
   - option_file=FILE           read options from FILE (line format KEY: VALUE)
   - random_seed=N              set random seed to N
   - data_dir=DIR               load all files from DIR
   - save_model=FILE            save computed model to FILE
   - load_model=FILE            load model from FILE
   - no_eval=BOOL               do not evaluate
   - prediction_file=FILE       write the predictions to  FILE ('-' for STDOUT)
   - cross_validation=K         perform k-fold crossvalidation on the training data
                                 (ignores the test data)
   - sample_data=BOOL           assume the sample data set instead of the real one
   - track2=BOOL                perform rating prediction on track 2 data
   - good_rating_prob=BOOL      try to predict the probability of a good rating (>= 80)

  options for finding the right number of iterations (MF methods)
   - find_iter=N                give out statistics every N iterations
   - max_iter=N                 perform at most N iterations
   - epsilon=NUM                abort iterations if RMSE is more than best result plus NUM
   - rmse_cutoff=NUM            abort if RMSE is above NUM
   - mae_cutoff=NUM             abort if MAE is above NUM
   - compute_fit=BOOL           display fit on training data every find_iter iterations

Track 2

Usage:

MyMediaLite KDD Cup 2011 Track 2 tool

 usage:  KDDTrack2.exe METHOD [ARGUMENTS] [OPTIONS]

  use '-' for either TRAINING_FILE or TEST_FILE to read the data from STDIN

  methods (plus arguments and their defaults):
   - ItemAttributeSVM C=1 Gamma=0.002 (needs --item-attributes=FILE)
   - BPRLinear reg=0.015 num_iter=10 learn_rate=0.05 fast_sampling_memory_limit=1024 init_mean=0 init_stdev=0.1 (needs --item-attributes=FILE)
   - BPRMF num_factors=10 bias_reg=0 reg_u=0.0025 reg_i=0.0025 reg_j=0.00025 num_iter=30 learn_rate=0.05 fast_sampling_memory_limit=1024 init_mean=0 init_stdev=0.1
   - ItemAttributeKNN k=80 (needs --item-attributes=FILE)
   - ItemKNN k=80
   - MostPopular
   - Random
   - UserAttributeKNN k=80 (needs --user-attributes=FILE)
   - UserKNN k=80
   - WeightedItemKNN k=80
   - WeightedUserKNN k=80
   - WRMF num_factors=10 regularization=0.015 c_pos=1 num_iter=30 init_mean=0 init_stdev=0.1
   - Zero
method ARGUMENTS have the form name=value

  general OPTIONS have the form name=value
   - option_file=FILE           read options from FILE (line format KEY: VALUE)
   - random_seed=N              set random seed to N
   - data_dir=DIR               load all files from DIR
   - save_model=FILE            save computed model to FILE
   - load_model=FILE            load model from FILE
   - prediction_file=FILE       write the predictions to  FILE ('-' for STDOUT)
   - sample_data=BOOL           assume the sample data set instead of the real one
   - predict_score=BOOL         predict scores (double precision) instead of 0/1 decisions
   - predict_rated=BOOL         instead of predicting what received a good rating, try to predict what received a rating at all
                                (implies predict_score)

  options for finding the right number of iterations (MF methods)
   - find_iter=N                give out statistics every N iterations
   - max_iter=N                 perform at most N iterations
   - epsilon=NUM                abort iterations if error is more than best result plus NUM
   - err_cutoff=NUM             abort if error is above NUM

Item Data

If you want to use the item data (relations between tracks, artists, albums, genres), then you just need to implement the interface IKDDCupRecommender. Its property ItemInfo gives you access to a KDDCupItems object which contains all item data.

Dates and Times

We have currently no support for dates/times in track 1, it should be not difficult to add this, though.