Netflix API Forums

I Need Help!

RSS Feed

Has anyone been able to process the entire catalog

    • Steve
    • Topic created 3 years ago

    I'm trying to access the Netflix catalog to do some advanced searches (genre, watch instantly, etc). I'm able to save the catalog to a file but the file size is so large (over 220MB IIRC) that it cripples my computer.

    I've tried loading the file into a strongly typed DataSet, an XPathNavigator document, and a XmlDocument and the result is always the same; my computer grinds away forever and eventually pops-up a virtual memory low message.

    Are there any plans for the API that allow for paged results? In the meantime, has anyone had success handling the entire catalog?

    Message edited by Steve 2 years ago

  1. irons3 years ago

    I'm not sure what environment and platform you're using, but you should probably look into streaming rather than tree-based parsers for this task. There's nothing inherently un-parseable about well-formed XML file in the low hundreds of megabytes.

  2. Anurag3 years ago

    Steve, i wrote a custom script which reads the huge XML line by line, figuring put end of xml tags - and adding the movie object into the db. Though it takes several hours to completely parse the XML, but it does run on my 256MB machine without excessive OS swapping. I'm on Linux btw.

  3. John Haren3 years ago

    If you have access to a java environment, you could take a look at the open-source XML content repository eXist (http://exist.sourceforge.net/). It can handle arbitrarily large documents and provides a nice xquery interface. It's overkill for one-use-only tasks, but if you're handling lots of xml (in a mashup, say) you could do a lot worse.

  4. royce3 years ago

    It took about 9 hours on my 3GB Vista box to parse the 285MB file. I used a simple PHP statements to read and parse it into SQL INSERT statements:

    $doc = new DOMDocument();
    $doc->load( 'netflix_catalog.xml' );

  5. JR Conlin3 years ago

    Hi Royce,

    I see you've discovered a couple of important facts. The first being "The DOM is bulky and incredibly slow".

    There's a great article about why that is: http://ejohn.org/blog/the-dom-is-a-mess/ but basically, it has to do with the fact that the DOM was never really meant to be a database. It's nice that you can do a lot of processing with effectively two lines of code, but much like heading to a 7-11 when you're low on groceries, you pay a price for convenience.

    In order to process anything of real scale (285MB is tiny compared to a lot of other data out on the web), you need to look at a few tools that are built for XML. The best is expat using a SAX calling routine (see http://www.php.net/manual/en/ref.xml.php) Yes, this means doing more work, and writing more code, but i can process the same 285MB block of code doing MySQL inserts in about 20 seconds. Obviously, not something you'd want to do on every request, but far faster than 9 hours.

    If you don't want to roll a solution from scratch, I'd recommend taking a look at the pear libraries at http://pear.php.net or doing a general search for "php xml code".

    By the way, PHP is a fine HTML templating language and is easy to learn, but I'd note that there are other languages that are far better suited for doing what you want, particularly if you are hosting them on your own machine. (This, i'll note, is from a die-hard fan of PHP) You may want to spend the time to learn Python and take advantage of the rich library of packages and functions that other folks have built, all of which run just peachy under Vista. Plus the language syntax is not that different than PHP. There's no real need to have this sort of thing run in PHP and it may be that you are using the wrong tool for the job.

  6. Steve3 years ago

    Thanks for the replies. I'm doing something similar to what Anurag suggested. It was pretty trivial to put together using a .NET StreamReader and looking for <catalog_title_index> opening and closing tags. Performance is acceptable and it doesn't bring my system to it's knees.

    My end goal is to get movie information on all watch instantly titles and stuff the info in a SQL database. This is all for personal use and don't need to the data to be ultra fresh so a once a week update should be more than acceptable.

    Since the basic info provided by the catalog index doesn't include a synopsis, I need to get title details for each watch instantly title. As you might guess this is very slow. Hopefully, parameterized searches will be made available in the very near future.

  7. royce3 years ago

    JRC - Thanks for the tips! I rewrote the function to use XMLParser and got the processing time down to 10 minutes (still using PHP!) .. I feel so encouraged, I may get my SO a Python book for Valentines day!

  8. Jason2 years ago

    Is there any chance somebody could share their code? Royce, your 10 minute script would be great, if you're willing to share.

  9. dhchoi2 years ago

    Since no one seemed to respond to you Jason, I can offer some help.

    Instantwatcher uses a streaming xml parser to find all the instant titles in /catalog/titles/index and generates a CSV file from that.

    It's written in Ruby, but you don't need to know Ruby to run the script. If you're still looking for help and want a copy of the script, just let me know. I'll post it on GitHub as a gist.

  10. royce2 years ago

    Sorry Jason,

    This forum doesn't notify when someone responds to the threads or posts! Its a bit much to expect folks to remember what they posted, and on what forum, and periodically come back and check to see if there is a response. I only came back here because I've managed to misplace my code for my script. If I find it, and remember to come back and check this thread, I'll post it!

  11. Kirsten Jones2 years ago

    Hi all,

    We do try to stay on top of these conversations, but sometimes some get through, for which I apologize. Please remember that the forums can also be monitored via RSS, at the top level or for any specific forum or topic. Check for the bug in the URL bar.

    Thanks,
    Kirsten

[ Page 1 of 1 ]