DSโ4XX: Harnessing Language Models for Data Science
๐ โข Boston University
Course Overview
An advanced seminar exploring the use of large language models (LLMs) in data science workflows. Topics include prompt engineering, retrieval-augmented generation (RAG), automation of analysis and reporting, and the ethical implications of AI-assisted data science.
Proposed Course Design
Notes: Proposed and designed by me. Intended as an advanced elective for upper-level undergraduates and masterโs students interested in the intersection of GenAI and data science. Not yet offered.
Learning Objectives
Students will develop expertise in:
- Prompt Engineering: Designing effective prompts for data science tasks
- RAG Systems: Building retrieval-augmented generation for domain-specific analysis
- Automated Analysis: Using LLMs to streamline data exploration and reporting
- Code Generation: Leveraging AI for data science programming assistance
- Quality Assurance: Validating and verifying AI-generated insights
- Ethical AI Use: Responsible practices for AI-assisted data science
Proposed Course Structure
- Hands-on Workshops: Practical experience with state-of-the-art LLMs
- Project-based Learning: Students build AI-assisted data science workflows
- Ethics Integration: Ongoing discussion of responsible AI practices
- Industry Case Studies: Real-world examples of LLM use in data science
- Technical Deep Dives: Understanding LLM capabilities and limitations
Key Technical Skills
- Prompt Design: Crafting effective prompts for different data science tasks
- API Integration: Working with OpenAI, Anthropic, and other LLM providers
- RAG Implementation: Building systems that combine LLMs with external knowledge
- Workflow Automation: Using LLMs to streamline repetitive analysis tasks
- Model Evaluation: Assessing LLM performance on data science problems
- Fine-tuning: Adapting models for specific data science domains
Application Areas
- Exploratory Data Analysis: AI-assisted data exploration and hypothesis generation
- Report Generation: Automated creation of data science reports and summaries
- Code Documentation: Using LLMs to improve code readability and documentation
- Data Cleaning: AI-assisted identification and correction of data quality issues
- Insight Discovery: Leveraging LLMs to identify patterns and generate hypotheses
Ethical Considerations
- Bias and Fairness: Understanding how LLM biases affect data analysis
- Transparency: Maintaining interpretability in AI-assisted workflows
- Validation: Ensuring human oversight of AI-generated insights
- Attribution: Properly crediting AI assistance in data science work
- Privacy: Protecting sensitive data when using external LLM services
Course Innovation
First-of-its-kind course specifically focused on the practical integration of large language models into data science practice, addressing both technical implementation and ethical considerations.