Data Mining, Discovery, and Exploration

Harvard Summer School

CSCI S-108

Section 1

CRN 35899

Begin Registration
Extracting actionable insights and relationships from massive complex data sets is the domain of data mining. Data mining has wide-ranging applications in science and technology. This course addresses several key aspects of data mining including the use of key-value pairs and hashing methods to manage and compute approximate analytics for massive scale datasets; highly scalable approximate similarity search and embedding algorithms for information retrieval, as used in retrieval-augmented generation (RAG) algorithms, web search, image search and recommendation systems; algorithms for ranking search and recommendation results; highly memory efficient sketch algorithms for infinite sized data, such as streaming data and online processing of massive datasets; unsupervised learning, including clustering models and dimensionality reduction algorithms for finding and exploring relationships in massive complex datasets; and graph representations and algorithms for search and social network analysis. The course comprises readings and lectures on theory along with hands-on exercises and projects where students apply theory through Python coding and interpretation of results. The hands-on component of the course uses a variety of libraries in the Python language, Scikit-Learn, NetworkX, FAISS, and deep-learning platforms and packages. Students enrolled for graduate credit are required to perform, present, and report on an independent project. This project must demonstrate a mastery of methods covered in the course as applied to a suitable rea-world data set.

Instructor Info

Stephen Elston, PhD

Principal Data Scientist


Meeting Info

TTh 6:30pm - 9:30pm (6/22 - 8/7)

Participation Option: Online Asynchronous or Online Synchronous

In online asynchronous courses, you are not required to attend class at a particular time. Instead you can complete the course work on your own schedule each week.

Deadlines

Last day to register:

Additional Time Commitments

Optional sections to be arranged.

Prerequisites

Students enrolling in this course are expected to have some exposure to basic machine learning and data science methods equivalent to CSCI S-101, and experience programming using the Python language equivalent to CSCI S-7 or CSCI S-50. For those with limited Python programming experience, some experience programming in any language, such as R, Matlab, or C++ is essential. Knowledge of linear algebra, including eigenvalue-eigenvector decomposition and a bit of differential and integral calculus equivalent to MATH S-21a, is essential.

Notes

This course meets via web conference. Students may attend at the scheduled meeting time or watch recorded sessions asynchronously. Recorded sessions are typically available within a few hours of the end of class and no later than the following business day. See minimum technology requirements. Not open to Secondary School Program students.

Syllabus

All Sections of this Course

CRN Section # Participation Option(s) Instructor Section Status Meets Term Dates
17304 1 Online Asynchronous, Online Synchronous Stephen Elston Open W 6:00pm - 8:00pm
Sep 2 to Dec 20
35899 1 Online Asynchronous, Online Synchronous Stephen Elston Open TTh 6:30pm - 9:30pm
Jun 22 to Aug 7