SE256: Scalable Systems for Data Science

Department of Computational and Data Sciences

Scalable Systems for Data Science

  • Instructors: Yogesh Simmhan (email) and Partha Talukdar (email)
  • Course number: SE256
  • Credits: 2:1
  • Semester: Jan, 2016
  • Lecture: Wed 2-4PM, Lab: Fri 10-11AM
  • Room: SERC 202
  • Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required, preferably in Java.

Overview

This course will introduce the fundamental systems aspects of big data platforms, and how these platforms can be used to build large-scale data intensive applications. It will cover topics on: Why Big Data platforms are necessary? How they are designed? What are the programming abstractions (e.g. MapReduce) that are used to compose data science applications? How the programming models are translated to scalable runtime execution on clusters and Clouds (e.g. Hadoop)? How do you design algorithms for analyzing large datasets? and How do you map them to Big Data platforms?

We will discuss platforms used for developing applications over large columnar/tuple based data, such as Map Reduce/Apache Hadoop, as well as those for streaming data like Apache Storm and Spark Streaming, and graph/linked data like Apache Giraph. Besides these imperative programming models, we also discuss easy to use declarative tools such as NoSQL databases and TensorFlow.

In addition, we shall explore how these platforms can be used to scale up data analysis techniques such as clustering, collaborative filtering, frequent itemset mining, classification, graph analytics, etc. and apply them over large datasets.

The emphasis will be on designing applications that show good “weak scaling” as the size, speed or complexity of data increases, and using distributed systems such as commodity clusters and Clouds. A project will be a key part of this course and here you will develop meaningful analytics over large, real-world datasets using the platforms, tools and techniques you learn in the course.

This course complements other breadth courses on data science like the E0 229: Foundations of Data Science and E0 259: Data Analytics. These together cover the foundations, systems and applications of data science.

Intended Learning Objectives

At the end of the course, students will have learned about the following concepts.

  1. Types of Big Data, Design goals of Big Data platforms, and where in the systems landscape these platforms fall.
  2. Distributed programming models for Big Data, including Map Reduce, Stream processing and Graph processing.
  3. Runtime Systems for Big Data platforms and their optimizations on commodity clusters and Clouds.
  4. Developing Data Science algorithms and analytics using Big Data platforms.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, it builds upon prior knowledge that students have on computing and software systems, programming, data structures and algorithms. Students must be familiar with Data Structures (e.g. Arrays, Queues, Trees, Hashmaps, Graphs) and Algorithms (e.g. Sorting, Searching, Graph traversal, String algorithms, etc.).

It is recommended that students be comfortable programming (preferably in Java) to help with the projects. Basic knowledge of probability and statistics is also required. Familiarity with one or more of the following courses will also be helpful (although not mandatory): SE 292 (HPC), SE 295 (Parallel Programming), E0 253 (Operating Systems), E0 264 (Distributed Computing Systems), SE252 (Introduction to Cloud Computing), E0 225 (Design and Analysis of Algorithms), E0 232 (Probability and Statistics), E0 259 (Data Analytics).

Assessment

The total assessment score for the course is based on a 1000 point scale. Of this, the weightage to different activities will be as follows:

30% Homework Two programming assignments (150 points each).
30% Project One final project, to be done individually or in teams (300 points)
35% Exams One Mid-term (150 points) and one Final (200 points) exam.
5% Participation Participation (i.e. not just “attendance”) in classroom discussions and online forum for the course (50 points).

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties. This includes a reduced or failing grade in the course, and recurrent academic violations will be reported to the Institute and may lead to an expulsion.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.

Plagiarism
Verbatim reproduction of material from external sources (web pages, books, papers, etc.) is not acceptable. If you are paraphrasing external content (or even your own prior work) or were otherwise influenced by them while completing your assignments, projects or exams, you must clearly acknowledge them. When in doubt, add a citation!
Cheating
While you may discuss lecture topics and broad outlines of homework problems and projects with others, you cannot collaborate in completing the assignments, copy someone else’s solution or falsify results. You cannot use notes or unauthorized resources during exams, or copy from others. The narrow exception to collaboration is between team-mates when competing the project, and even there, the contribution of each team member for each project assignment should be clearly documented.
Classroom Behavior
Ensure that the course atmosphere, both in the class, outside and on the online forum, is conducive for learning. Participate in discussions but do not dominate or be abusive. There are no “stupid” questions. Be considerate of your fellow students and avoid disruptive behavior.

Resources

  • Textbook:
  • Online Forum: se256.jan16@mailman.serc.iisc.in | Mailman Info Webpage
  • Cluster Access: Students will validate their assignments and projects on the CDS teaching cluster. Details for accessing the cluster and running programs on it are given in the lab session.

Teaching & Office Hours

  • Lecture: Wed 2-4PM, SERC 202 (Yogesh & Partha)
  • Tutorial: Fri 10AM-11AM, SERC202 (Ravikant Dindokar (email))
  • Office Hours:  By appointment

Tentative Schedule

SE256-Jan16 Lecture Schedule

No.DateTopicsSlides
12016-01-06Big Data & Platform Design Goals.
Big Data & other computing platforms
L1 slides
22016-01-13Programming for Large Datasets: MapReduce.L2-4 slides
32016-01-20Programming for Large Datasets: MapReduce.
L2-4 slides
42016-01-27Programming for Large Datasets: MapReduce.L2-4 slides
52016-02-03Runtime Systems: Hadoop, HDFS.L5-6 slides
62016-02-10Runtime Systems: Hadoop, HDFS.L5-6 slides
72016-02-17Prediction over graphsSlides
82016-02-24Streaming Naive BayesSlides
92016-03-02Scalable Logistic Regression (and SGD)Slides, [MR for ML on Multicore, NIPS 06], [Hogwild!], [Bottou, 2010]
102016-03-09Large-scale Matrix FactorizationSlides, [Gemulla et al., KDD 2011]
*2016-03-11MID-TERM EXAM (FRI 10-1130am)
112016-03-16MR Advanced Topics: Inverted Index, PageRankL11
122016-03-23Distributed graph procesing: Apache Giraph, GoFFishL12
132016-03-30Distributed stream processing: Apache StormL13 and Marz's Slides
142016-04-06Parameter ServerPaper, Slides
*2016-04-27FINAL EXAM (WED AM)

SE256-Jan16 Tutorial Schedule

No.DateTopicsSlides
12016-01-11Hadoop/Yarn setupLab 1 slides
22016-01-18IDE setup for HadoopLab 2 slides, Wordcount test file
3
4
5
6
7
8
9
10
11
12

Assignments

  • Assignment 0
    • Install Hadoop v2.6.x+ in a pseudo distributed setup in your local deaktop or laptop. Try the wordcount example. Report in class.
  • Assignment A (100 points)(150 points), posted on Jan 25, 2016
    • Assignment-A [pdf]
    • Starter Source code: se256-alpha.zip
    • Due Wed, 3 Feb, 2016 Sun, 7 Feb, 2016 by midnight IST. See PDF file for submission details.
    • Errata
      • In task 2, change TwitterTopoCounterMap.java:42 to
        Long sink = Long.parseLong(vertexPair[1]);
      • For the Twitter data in task 2, the input is of the format:
         [User] \t [Follower] \n
      • For task 1(d), you do not have to “Show how you’re able to run the default MR PageRank algorithm sample on this graph you generate.” since PR is not shipped with Hadoop. Instead, take the output of 1(d) and verify if the source and sink edge degrees follow a powerlaw distribution, similar to task 2(b). Note that the former is an adjacency list and the latter an edge list, so you will need to modify your code suitably. You do not have to turn in the code to calculate the distribution for 1(d). Instead, give a plot of the in and out edge degree frequency distributions (X Axis edge degree, Y axis frequency of vertices) in your report. This will help compare if social networks and the WWW have similar network structures.
    • Clarification [refer to mailing list for more details]

      • For 3(a), you can assume that the HDFS path to the probability distribution file is given to you as an input of the Job, and that the file is small enough to be loaded into memory for all of the Mapper and Reducer tasks in their initialization.
      • URIs are defined in RFC1738. We will define a unique URL to be of the form below for the assignment:
        scheme://host[:port]][/]path

        Relative URLs that have just a path (without a scheme and host) should be normalized to the above form to determine uniqueness. e.g. a link to “/foo/bar.html” from a page that has a URL “http://tempuri.org/fu/barbar.html” is normalized to “http://tempuri.org/foo/bar.html“, while a link to “fubar.html” (i.e.
        without a leading “/”) from that URL normalizes to “http://tempuri.org/fu/fubar.html

        You should use this normalization to construct a webgraph or to identify unique articles. Anything within a href=”???” can be treated as an outgoing link for the webgraph, where you normalize “???” to the URL scheme above.

  • Assignment B (150 points), posted on Tue, 15 Mar, 2016

Exam Topics

  • Mid-term syllabus (FRI 11 Mar, 2016, 10-1130am)
    • All slides until and including Lecture 9 (till Mar 2 class)
    • Papers related to lectures linked from the course homepage
    • Leskovec, Rajaraman and Ullman (2014): Chapter 1, 2, 12
    • Lin & Dryer (2010): Chapters 2, 3
    • Tom White (Definitive Guide, 4th Ed): Chapters 2, 3, 4, 7
  • Final Exam Syllabus (27 Apr, 2016)
    • TBD

Project

Project topics for the SE256 are listed below. The project carries 30% weightage. You can do the project in teams of 2 (preferred) or individually. The outcomes should reflect the team size. Form a team and prepare a 1-page project proposal LATEST by Apr 4 with the following details:

  1. What is the problem and the dataset?
  2. Why is it important/interesting?
  3. What methods and platforms will you use?
  4. How does it related to other methods and platforms?
  5. How will you evaluate your outcomes: datasets, baselines, metrics (all need to be nailed down), systems/platforms used
  6. What are the deliverables, i.e. what will you submit?

Send the proposal to both YS and PPT. The earlier you send in the proposal, the quicker the approval and you can get started. Project submissions (code, results, report (6-8 pages, IEEE style)) are due by Apr 28. Project demo is on Apr 30. Your project will be evaluated on the following:

  • Problem quality & scale (size, rate, complexity): 20%
  • Technical correctness: 30%
  • Results & Demo: 20%
  • Report quality: 20%
  • Presentation: 10%

TOPICS

  1. The goal of this project is to annotate the entire Clueweb12 corpus (http://lemurproject.org/clueweb12/) with semantic and syntactic tags, and build a search engine on top of it. Existing annotation and search engine tools may be used. Output from this project will be extremely  useful to researchers working with this dataset. The entire ClueWeb12 dataset (5.5TB compressed, 27TB uncompressed) is available in MALL lab’s Hadoop cluster, the project may be carried out on that cluster. Maximum 1 group can take up this project.
  2. Embedding multi-relational graphs (e.g., knowledge graphs) in a vector space has emerged as a very active area of research [1]. In this project, you will develop parallel implementation of a few such algorithms (e.g., TransR, TransH, TransE, etc) and evaluate their performance. Appropriate datasets for this project are available, please check with Partha.
    [1] http://nlp.csai.tsinghua.edu.cn/~lzy/publications/aaai2015_transr.pdf
  3. Using the FAA airline on-time performance dataset for 12 months or more [2], define and develop meaningful analytics that combine both the network (cities are vertices, possible routes are edges) and temporal (travel between cities, distributed across days/months) nature of the data. Verify if graph metrics like centrality, shortest distance, etc. match against the actual airport hubs, route duration, etc.
    [2] http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time
  4. Perform analytics over streaming data such as filtering, counting, moments, etc. such as defined in Chapter 4 of Leskovek text book using Apache Storm. Streaming datasets are available from from email logs [3] or Internet of Things [4].
    [3] https://www.cs.cmu.edu/~./enron/
    [4] http://map.datacanvas.org/#!/data
  5. Compare the performance of common graph social network algorithms when using Hadoop and Giraph. (Chapter 10 of Leskovek). Sample graph datasets available from SNAP and LAW [10].
    [9] http://snap.stanford.edu/
    [10] http://law.di.unimi.it/datasets.php
  6. Use datasets from data.gov.in and other public sources to define meaningful analytics over large, fast or complex data.
    [5] https://data.gov.in/
    [6] https://github.com/caesar0301/awesome-public-datasets
  7. Select challenge problems from KDD Cup [7] or Kaggle.
    [7] http://www.kdd.org/kdd-cup
    [8] https://www.kaggle.com/competitions