CS6240 Large-scale Parallel Data Processing

How to Succeed in This Course

This is an advanced graduate course about an evolving topic. It is therefore essential that you go through the online material carefully and methodically, attend the lectures and participate in online discussions. Homework is designed to help you understand the material and prepare for the exam. The following often works well:

When going through the online material, make notes about questions you have or about material you find difficult to understand. Then share these questions through the online forum or in class.
When you get a question in a check-your-knowledge quiz wrong or were not sure about the answer, go back to the corresponding online material and try to find the answer.
Start working on homework assignments as soon as they come out. This way you have time to ask questions and get help.

Is This the Right Course for You?

Programming Language

This really is an algorithms course at heart. You will write plenty of code, but the main emphasis is on learning how to approach big-data analysis problems. You will need solid Java or Python programming skills to succeed, but we are not teaching any Java/Python basics in this course. You do not need advanced Scala skills and should be able to pick up what you need on-the-fly with reasonable effort.

If you believe that programming in Java or Scala presents an insurmountable barrier for you, contact the instructor during the first week of classes to find a solution. It is possible to program in other languages, but we generally cannot promise any support for them—so you may be on your own if you get stuck. Students in the past completed their homework successfully using Python for both MapReduce and Spark. Python is well supported in Spark and the programs often look similar to those written in Scala.

Challenges you may face

We are learning about novel techniques that are only partially understood and explored by the research community. Hence in many cases there are no "certain truths". At times we might find better solutions that could be publishable in a research paper.
We are working with complex cutting-edge software from the open-source community. This means that there will be bugs, lack of documentation, and simply inexplicable behavior at times. Hadoop and Spark also keep changing and updating their API, therefore some code you find in books or on the Web might be outdated or use deprecated features.
When dealing with big data in a complex environment such as MapReduce/Spark and the cloud, developing and debugging code is different compared to traditional settings. Sometimes a task might appear easy but turns out to be much harder and more time-consuming (or the other way round).

You should only take this course if you are prepared to deal with such issues and are willing to put in extra time when necessary. Do not take this course if you want a well-polished and well-tested course without any uncertainty. If you are genuinely interested in the topic and are ready to work around the inevitable frustrations, then this will be a rewarding experience.