STATISTICS


Course Credits: 3 Units

Prerequisites: CMSC 11, Stat 106 or COI (for non-majors)

CMSC 197 - Special Topics (Introduction to Data Science)

Course Description

This is a 3-unit course that will discuss an overview and foundation of Data Science, covering a broad selection of key challenges in and methodologies for working with big data. Topics to be covered include data collection, data cleaning, integration, management, modeling, analysis, visualization, prediction, and informed decision making. This course also covers the basics needed for collecting, cleaning, and sharing of data. Additionally, the course covers the essential exploratory techniques for summarizing data which includes some of the common multivariate statistical techniques used to visualize high-dimensional data.

Course Learning Outcomes

After completion of the course, the student should be able to:

  1. Write programs in R;
  2. Understand problems solvable with data science and able to attack those problems form a statistical perspective;
  3. Collect, manipulate, blend data from different data sources; and
  4. Visualize Data and Perform Exploratory Data Analysis.
Course Outline

UNIT 1. Data Science Overview

  1. Overview of R
  2. Getting and Cleaning Data Overview
  3. Practical Machine Learning Overview
  4. Regression Models Overview
  5. Reproducible Research
  6. Statistical Inference Overview
  7. Big Data
  8. Experimental Design
  9. Types of Questions

UNIT 2. R Programming

  1. Introduction and History of R
  2. R Data types and Objects
  3. Reading and Writing Data
  4. Control Structures
  5. Functions
  6. Scoping Rules
  7. Dates and Times
  8. Loop Functions
  9. Debugging Tools
  10. Simulation
  11. Code Profiling

UNIT 3. Getting and Cleaning Data

  1. Data Collection
  2. Data Formats
  3. Making Data Tidy
  4. Distributing Data
  5. Scripting for Data Cleaning

UNIT 4. Exploratory Data Analysis

  1. Making Exploratory graphs
  2. Principles of Analytic graphs
  3. Plotting systems and graphics devices in R
  4. The base, lattice, and ggplot2 systems in R
  5. Clustering methods
  6. Dimension reduction techniques