Data Mining, Discovery, and Exploration
Harvard Summer School
CSCI S-108
Section 1
CRN 35899
Extracting actionable insights and relationships from massive complex data sets is the domain of data mining. Data mining has wide-ranging applications in science and technology. This course addresses several key aspects of data mining including the use of key-value pairs and hashing methods to manage and compute approximate analytics for massive scale datasets; highly scalable approximate similarity search and embedding algorithms for information retrieval, as used in retrieval-augmented generation (RAG) algorithms, web search, image search and recommendation systems; algorithms for ranking search and recommendation results; highly memory efficient sketch algorithms for infinite sized data, such as streaming data and online processing of massive datasets; unsupervised learning, including clustering models and dimensionality reduction algorithms for finding and exploring relationships in massive complex datasets; and graph representations and algorithms for search and social network analysis. The course comprises readings and lectures on theory along with hands-on exercises and projects where students apply theory through Python coding and interpretation of results. The hands-on component of the course uses a variety of libraries in the Python language, Scikit-Learn, NetworkX, FAISS, and deep-learning platforms and packages. Students enrolled for graduate credit are required to perform, present, and report on an independent project. This project must demonstrate a mastery of methods covered in the course as applied to a suitable rea-world data set.
Credits: 4
View Tuition InformationTerm
Summer Term 2026
Part of Term
Full Term
Format
Flexible Attendance Web Conference
Credit Status
Graduate, Noncredit, Undergraduate
Section Status
Open