CSCE 4930 Big Data Science (Spring 2016)

Overview

Time:
M/W 2:30-3:50 PM
Room:
NTDP B190
Instructor:
Rodney Nielsen
Office:
NTDP F246
Office Hours:
M/W 3:50-4:45, or by appointment
Final:
Monday, May 9th, 1:30-3:30pm

Course Description

Have you ever wondered ...

  • how Amazon knew out of the 32.8 million (32,800,000) books it sells that you would want that new book on Hadoop?
  • how the politicians could tell you were an Independent despite the fact your Facebook posts didn’t say anything about politics; and how they were able to process the posts of over 1.49 billion (1,490,000,000) active users to find you?
  • how the scientists discover the genetic cause of a particular problem given the total length of the human genome is over 3 billion (3,000,000,000) base pairs?
  • how biomedical researchers can determine that 250 out of the 1,500,000 Americans that had heart attacks or strokes last year were the result of an interaction between two specific drugs out of the thousands of medications marketed?
  • how the odds-makers decide by how much the Cowboys are going to win/lose Monday night’s game.
  • how data scientists can determine not only that there is a Right Whale in the 5.3 trillion (5,316,940,922,880) bits of information representing a photograph, but that it is Mimi, when there are no other photos of Mimi at this specific orientation, lighting, distance/size, water cover/context, etc.?
  • Yes, then this class is for you.

    As companies amass more and more data, it becomes increasingly important to be able to move beyond typical database CRUD functions (Create, Report, Update, Delete). This data is generally not very valuable unless we are able to recognize the patterns in it that can lead to actionable intelligence; this is especially true in the case of Big Data (massive datasets). In Big Data Science, we will focus on the practical issues associated with extracting such actionable intelligence.

    This includes investigating a variety of machine learning (ML) algorithms (e.g., for classification, function approximation, clustering, attribute association, etc.), learning how to design scientifically sound experiments to test hypotheses, understanding practical issues involved in selecting and tuning ML algorithms and how they satisfy the needs of stakeholders, and learning about issues in data preprocessing, feature engineering, feature selection and visualization.

    We will also briefly discuss advanced techniques such as semi-supervised learning, and active learning. The course will cover algorithm frameworks, problem settings, learning objectives, practical considerations, applications, and enough theory to understand the implications of utilizing various algorithms.

    Learning Objectives, by the end of the course students should be able to:

    Field of Work:

    • Describe the landscape of data science projects.
    • Recognize the uses and benefits of data science, its components, and related tools
    • Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.
    • Knowledge of resources available to support DM

    Data Manipulation:

    • Identify common problems and issues in source data.
    • Design and apply methods to solve or alleviate such problems and issues.
    • Develop and apply techniques associated with big data manipulation.

    Experimental Design:

    • Design effective experiments for most common data analytics tasks and evaluate the results.
    • Distinguish and utilize appropriate statistical methods to make clear and compelling arguments.
    • Explain and evaluate the role of open data and reproducibility in data science.

    Machine Learning:

    • Explain the concepts underlying commonly used supervised classification/prediction methods (e.g., classification rules, decision trees, Bayesian methods, linear models, ensemble and randomization methods, association rules) and unsupervised learning methods (e.g., k-means).
    • Explain associated optimization methods (e.g., gradient descent) and transformation methods (e.g., kernels).
    • Effectively apply ML tools and techniques
    • Compare, contrast and apply appropriate learning methods given a problem description, including implications of their application to massive data sets.
    • Identify and design appropriate evaluation methods and metrics; and analyze their limitations.
    • Explain and use MapReduce and Hadoop effectively. (time permitting)

    Visualization:

    • Design and critique visualizations and information presentation techniques.

    Privacy and Ethics:

    • Identify and, where possible, mitigate potential privacy and ethics concerns related to big data, open data, and data science.

    Prerequisites

    • CSCE 3110 Data Structures and Algorithms (or Instructor permission)
    • An understanding of Databases, Probability and Statistics would be very useful.