DS‑4XX: Harnessing Language Models for Data Science

🎓 • Boston University

Course Overview

An advanced seminar exploring the use of large language models (LLMs) in data science workflows. Topics include prompt engineering, retrieval-augmented generation (RAG), automation of analysis and reporting, and the ethical implications of AI-assisted data science.

Proposed Course Design

Notes: Proposed and designed by me. Intended as an advanced elective for upper-level undergraduates and master’s students interested in the intersection of GenAI and data science. Not yet offered.

Learning Objectives

Students will develop expertise in:

Prompt Engineering: Designing effective prompts for data science tasks
RAG Systems: Building retrieval-augmented generation for domain-specific analysis
Automated Analysis: Using LLMs to streamline data exploration and reporting
Code Generation: Leveraging AI for data science programming assistance
Quality Assurance: Validating and verifying AI-generated insights
Ethical AI Use: Responsible practices for AI-assisted data science

Proposed Course Structure

Hands-on Workshops: Practical experience with state-of-the-art LLMs
Project-based Learning: Students build AI-assisted data science workflows
Ethics Integration: Ongoing discussion of responsible AI practices
Industry Case Studies: Real-world examples of LLM use in data science
Technical Deep Dives: Understanding LLM capabilities and limitations

Key Technical Skills

Prompt Design: Crafting effective prompts for different data science tasks
API Integration: Working with OpenAI, Anthropic, and other LLM providers
RAG Implementation: Building systems that combine LLMs with external knowledge
Workflow Automation: Using LLMs to streamline repetitive analysis tasks
Model Evaluation: Assessing LLM performance on data science problems
Fine-tuning: Adapting models for specific data science domains

Application Areas

Exploratory Data Analysis: AI-assisted data exploration and hypothesis generation
Report Generation: Automated creation of data science reports and summaries
Code Documentation: Using LLMs to improve code readability and documentation
Data Cleaning: AI-assisted identification and correction of data quality issues
Insight Discovery: Leveraging LLMs to identify patterns and generate hypotheses

Ethical Considerations

Bias and Fairness: Understanding how LLM biases affect data analysis
Transparency: Maintaining interpretability in AI-assisted workflows
Validation: Ensuring human oversight of AI-generated insights
Attribution: Properly crediting AI assistance in data science work
Privacy: Protecting sensitive data when using external LLM services

Course Innovation

First-of-its-kind course specifically focused on the practical integration of large language models into data science practice, addressing both technical implementation and ethical considerations.