Data Source
IMDB Data was taken from IMDB Plain Text Data Files which can be found Here.
Data Contains a single plaintext file for each original IMDB table, in our implementation we imported the tables - Actors, Actresses and Movies. A complete list of all available tables is found in this file.
Information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission.
Data Parsing and Processing
In order to parse IMDB's Plain Text Data Files into mySQL tables we used a Java based application called JMDB, which has an Import / Convert feature for IMDB Data called DataBase Convertor.
JMDB DataBase Convertor had created and populated the Actors, Movies and Movies2Actors DB tables.
Pre-Processing Rational:
In order to create the relations (Actor2Actor) table we created a Perl script which finds for each actor the top K most related actors, where K was set to 15.
script logic is as follows:
- For each actor A create a list of movies which A participated in (using Movies2Actor table).
- For each movie Mov in that list find all other actors B1 , B2 , …. Bn which participated in Mov.
- Hold a counter for each actor Bi and increase it by 1.
- Sort the Bi Counters, find top K actors and store in DB.
Pre-Processing Implementation:
The Perl parsing script uses Perl's MySQL driver DBD:MySQL in order to access the DB.
In order to reduce running time, MySQL Transactions were used - enabling to save changes to DB in bulks (using "DB Commit" action) instead of saving them one by one. The script was set to Commit (= save) to the DB after handling 500 actors, creating bulk size of approx 6000 new rows as each actor has up to 15 related actors.
Since the script had created over 21 million new rows in the DB, using transactions had significaly improved running time.
build_relations.pl - Script source code.