In the new era of big data, we are increasingly faced with the challenges of processing vast volumes of data. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to process the data in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.
CIT 5910 Introduction to Software Development or equivalent programming experience; Broad familiarity with probability and statistics, as well as programming in Python; Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning is helpful (example: ESE 5420 Statistics for Data Science: An Applied Machine Learning Course)