I had written an application for the Netflix challenge. Here’s the report (sort of), it’s split into 3 parts.

This is my first blog entry, so sorry for the style.

0. The problem.
The problem was announced by Netflix back in 2006. It concerns a challenge for creating an accurate prediction system for the Netflix users’ ratings. In a few words, the new system, given previous user’s ratings, should predict the rating for a movie this user hasn’t rated yet. The winner will be the one with the most precise system and at the least more exact than own Netflix system by 10% (having RMSE at most 0.85-something). The full reference can be found here.
All the participants are supplied with a dataset consisting of:
* movie database of 17763 entries
* ratings database of 100471292 entries
all in .txt files.
In outlines the milestones are:
1. Parsing data into the DB.
2. Classification of existing data.
3. Developing of a system that uses the classification created in the first milestone.
Each of this parts has its own sub-missions.

1. Parsing data into the DB.
But before heading to the main development part, some extra preparation of data might be performed. The system created, whatever it is designed, should use a database for quick data reference, using flat .txt files would be just insane.
The database I used for my experiment was MySQL (for its low RAM footprint and ease of setup, but mainly cause I hadn’t used any other DBMS by then). The provided files are structurally named and stored, so the parsing isn’t big deal. The most important part of this step is the right index setup for fast search and this is the first somewhat interesting issue of this challenge.

1.1 Creating optimized indices for the DB.
A quick look at the data shows that it can be sorted in two dimensions:
* by user
* by movie
whilst the rest of data (date and rating) are attributes of the relationship between the entities [user] and [movie]. That’s what I’ve assumed back then, hence two indices in my model – one on the user and another one on the movie. It’s not enough to just create the indices in an available way (b-tree or hash table) on a table. Due to the concept the database stores the files, a table cannot have two clustered indices, therefore the search would be fast for only one of those (for the other, the DBMS would have to scan the index, also huge in our case). Not to mention the joins, widely used in the continue. The solution is not elegant at all, pretty ugly actually, but at that point I didn’t consider it an issue (neither do I now): to copy the table with over 100 mills entries and create another index on it using different criteria. The final database consisted of table:

+----------------------+
|   Tables in netflix  |
+----------------------+
| 1. movies            |
| 2. ratings           |
| 3. ratings2          |
+----------------------+

where [movies] took in all the data about the movies provided by Netflix, and [ratings] and [ratings2] (latter being a copy of the [ratings] with different index) the given ratings for each movie per user.
1.1.1 [Movies] description.

+-------+--------------+------+-----+---------+
| Field | Type         | Null | Key | Default |
+-------+--------------+------+-----+---------+
| id    | smallint(5)  | NO   | PRI |         |
| year  | smallint(4)  | YES  |     | NULL    |
| title | varchar(255) | YES  |     | NULL    |
+-------+--------------+------+-----+---------+

The structure of this table is pretty straightforward, I just parsed in the movies.txt without any change. The total amount of rows in this table is 17763.
1.1.2 [Ratings] + [Ratings2] description.

+-------------+--------------+------+-----+---------+
| Field       | Type         | Null | Key | Default |
+-------------+--------------+------+-----+---------+
| movie_id    | smallint(5)  | NO   | MUL |         |
| customer_id | mediumint(7) | NO   |     |         |
| rating_date | date         | YES  |     | NULL    |
| rating      | tinyint(1)   | YES  | MUL | NULL    |
+-------------+--------------+------+-----+---------+

To build up this table I parsed the .txt files (one per movie) containing customer IDs, ratings and the date each rating took place. The data comes completely unorganized, so it used to remain in my application. Neither of the fields [movie_id] and [customer_id] can be used as a single primary key itself, but their combination can. I did not entrust the DBMS to create the primary key and consequently the index, instead during the file parsing the data got in randomly and I created an index on each table (after copying [ratings] to [ratings2]) manually. Both indices used by this system should be able to respond quickly to the equation questions, like “get me the ratings of the user with ID 1000”, so hash index was my choice in both cases.
1.1.3 [Cliques] structure.
Last thing to mention is that the output of my application would be another table containging so-called cliques, hence the name of the table – [cliques].

+------------+--------+------+-----+---------+
| Field      | Type   | Null | Key | Default |
+------------+--------+------+-----+---------+
| user_id    | int(7) | NO   | PRI |         |
| clique_id  | int(6) | YES  |     | NULL    |
| clique2_id | int(6) | YES  |     | NULL    |
+------------+--------+------+-----+---------+

The concept of this table which reduces to the application logic, describes its fields purpose. [User_id] refers a user, whilst [clique_id] and [clique2_id] classify this user into two cliques (two ones to allow moreover clique affiliation). One user appears exactly once in this table, making its length equal to the user amount.
That’s pretty much about database structure. Now I can move on to the business logic.

I’ll add a few screenshots with the response times. It’s fun.

In next parts I’ll describe the other two parts which the system consists of.