Class
Time: Tuesday and Thursday 1:30-3:10pm from May 7 to Aug 15, Full Summer 2024
Location: Churchill Hall 103 (in-person only)
People
Instructor:
Zixuan Chen
Office hours: Wednesday 1:00-3:00pm (Zoom link on Canvas)
TA:
Hari Vilas Panjwani
Office hours: Friday 9:00-11:00am (Zoom link on Canvas)
Topics
-
An overview of the big-data processing landscape
- We will discuss some trends and challenges and briefly survey alternative approaches.
-
Distributed algorithms for processing big data
- We will cover a variety of fundamental problems and design patterns, including join computation, graph algorithms, information retrieval and data mining techniques, and analyze how they can be implemented in a scalable manner.
- In addition to the implementations, we will discuss how to evaluate the performance and scalability of programs with parallelization measures.
-
Parallel data processing tools
- We will work with and discuss features/limitations of Hadoop MapReduce and Spark.
- We will cover HBase, Hive and Spark libraries including Spark SQL, Spark Streaming, MLlib and GraphX.
- We will use the Amazon Cloud to run the code but may work with a different provider if necessary. (Our goal is to provide a real-world commercial-cloud experience at minimal cost—ideally zero—for each student.)
Structure
The course consists of lectures, weekly readings and self-check quizzes, four homework assignments, one group project and an exam.
Lecture
Lectures focus on difficult, interesting, and most relevant material. More interaction is expected during lectures, e.g., group activities and guided problem-solving.
Participation
You are expected to attend every lecture but there is no attendence check in any way. Asking/answering questions and posting relevant information in the discussion boards is encouraged, and a small participation bonus will be added to the overall grade of active students.
Weekly Readings
Weekly readings provide the background knowledge, terminology, and examples you need to understand and apply fundamental course concepts. You must complete/view all assigned readings, presentations, and demonstrations included in the lessons. All materials should be completed by the due dates specified.
Self-checks
When available, complete self-checks about the online lecture material designed to enhance your understanding and ability to correctly apply concepts covered in weekly readings and presentations. Getting a few questions wrong does not result in any deduction for your final grade, unless it looks like you are guessing. Notice that you must complete the self-check for a module by midnight on Sunday before the module is discussed.
Homework/Project
You will complete multiple homework assignments that give you the opportunity to program code and practice the concepts you learn and a project to solve a more complicated problem. More information about these assignments and the course project is available in Canvas.
Exam
You will complete an exam designed to test your understanding of the course concepts. The exam is closed-book, i.e., you cannot bring any material, but you need to bring either a pencil or pen to write and devices that can take photos and upload your solutions online. Students must be present in the lecture room for the exam. Exceptions are possible for students with disabilities who can provide an official letter from the corresponding Northeastern office at the beginning of the semester
Acknowledgement
The course has been designed and taught by Prof. Mirek Riedewald. We reuse most of the reading materials and lecture slides. Many thanks to previous instructors of the course Prof. Mirek Riedewald and Nikolaos Tziavelis for their help!