Class

Time Mon to Thu 9:50am-11:30am
Location Snell Library 033

Course Team & Getting-in-Touch

Please use Piazza to ask questions which can potentially be answered by your peers (e.g., setting up the programming environment, Amazon account creation, slide content). If you want to send me a private message to the instructor, you can either use Piazza or email. If you wish, you can also leave anonymous feedback via this anonymous Google form, which only the instructor can see.

Nikolaos (Nikos) Tziavelis [Instructor]

E-mail tziavelis.n (at) northeastern (dot) edu
Web https://ntzia.github.io/
Office Hours Tuesdays 2-5pm (Zoom link on Canvas)

Santosh Saranyan [Teaching Assistant]

E-mail saranyan.s (at) northeastern (dot) edu
Web https://ssantosh1999.wixsite.com/eportfolio/
Office Hours Thursdays 2-5pm (Zoom link on Canvas)

Ronhit Neema [Teaching Assistant]

E-mail neema.r (at) northeastern (dot) edu
Web https://ronhitneema.github.io/myportfolio.github.io/
Office Hours Fridays 2-5pm (Zoom link on Canvas)

Sumit Hawal [Teaching Assistant]

E-mail hawal.s (at) northeastern (dot) edu
Office Hours Mondays 5-8pm (Zoom link on Canvas)

Course Topics and Outcomes

  1. Get an overview of the big-data-processing landscape.
    • We will discuss some trends and challenges and briefly survey alternative approaches.
  2. Learn how to design distributed algorithms for processing big data, and how to implement them in Hadoop MapReduce and in Spark. While MapReduce or Spark might be replaced at some point by other systems, the algorithm design patterns taught in this course will remain relevant, because they are concerned with partitioning of a problem and assigning data to many machines.
    • We will cover a variety of fundamental problems and design patterns, including join computation, graph algorithms, information retrieval, and data mining techniques, and analyze how they can be implemented in a scalable manner.
  3. Get hands-on practice writing code and running it on many processors.
    • We will work with Hadoop MapReduce and Spark.
    • We will use the Amazon Cloud to run the code but may work with a different provider if necessary. (Our goal is to provide a real-world commercial-cloud experience at minimal cost—ideally zero—for each student.) Details will be announced with the first homework assignment.
  4. Understand the system architecture and functionality below MapReduce and Spark.
    • We will discuss features and limitations of MapReduce and Spark.
Since we cannot cover all possible parallel-computation approaches, you are encouraged to explore other courses on related topics. Also note that new approaches for big-data processing keep appearing, often trying to address some weakness of existing ones. While we might not be able to cover them, a solid understanding of parallel-data-processing principles will help you evaluate their tradeoffs.

Is this the right course for you?

Programming languages knowledge

This really is an algorithms course at heart. You will write plenty of code, but the main emphasis is on learning how to approach big-data analysis problems. You will need solid Java or Python programming skills to succeed, but we are not teaching any Java/Python basics in this course. You do not need advanced Scala skills and should be able to pick up what you need on-the-fly with reasonable effort.

If you believe that programming in Java or Scala presents an insurmountable barrier for you, contact the instructor as soon as possible to find a solution. It is possible to program in other languages, but we generally cannot promise any support for them—so you may be on your own if you get stuck. Students in the past completed their homework successfully using Python for both MapReduce and Spark. Python is well supported in Spark and the programs often look similar to those written in Scala.

Challenges you may face

  • We are learning about novel techniques that are only partially understood and explored by the research community. Hence in many cases there are no "certain truths". At times we might find better solutions that could be publishable in a research paper.
  • We are working with complex cutting-edge software from the open-source community. This means that there will be bugs, lack of documentation, and simply inexplicable behavior at times. Hadoop and Spark also keep changing and updating their API, therefore some code you find in books or on the Web might be outdated or use deprecated features.
  • When dealing with big data in a complex environment such as MapReduce/Spark and the cloud, developing and debugging code is different compared to traditional settings. Sometimes a task might appear easy but turns out to be much harder and more time-consuming (or the other way round).

You should only take this course if you are prepared to deal with such issues and are willing to put in extra time when necessary. Do not take this course if you want a well-polished and well-tested course without any uncertainty. If you are genuinely interested in the topic and are ready to work around the inevitable frustrations, then this will be a rewarding experience.

Acknowledgements

This course has been designed and taught by Mirek Riedewald. We will for the most part follow the same structure and use the same material. For the course web page design, credit goes to Nate Derbinsky and to Wolfgang Gatterbauer for some modifications.