DSโ€‘4XX: Harnessing Language Models for Data Science

๐ŸŽ“ โ€ข Boston University

Course Overview

An advanced seminar exploring the use of large language models (LLMs) in data science workflows. Topics include prompt engineering, retrieval-augmented generation (RAG), automation of analysis and reporting, and the ethical implications of AI-assisted data science.

Proposed Course Design

Notes: Proposed and designed by me. Intended as an advanced elective for upper-level undergraduates and masterโ€™s students interested in the intersection of GenAI and data science. Not yet offered.

Learning Objectives

Students will develop expertise in:

  1. Prompt Engineering: Designing effective prompts for data science tasks
  2. RAG Systems: Building retrieval-augmented generation for domain-specific analysis
  3. Automated Analysis: Using LLMs to streamline data exploration and reporting
  4. Code Generation: Leveraging AI for data science programming assistance
  5. Quality Assurance: Validating and verifying AI-generated insights
  6. Ethical AI Use: Responsible practices for AI-assisted data science

Proposed Course Structure

  • Hands-on Workshops: Practical experience with state-of-the-art LLMs
  • Project-based Learning: Students build AI-assisted data science workflows
  • Ethics Integration: Ongoing discussion of responsible AI practices
  • Industry Case Studies: Real-world examples of LLM use in data science
  • Technical Deep Dives: Understanding LLM capabilities and limitations

Key Technical Skills

  • Prompt Design: Crafting effective prompts for different data science tasks
  • API Integration: Working with OpenAI, Anthropic, and other LLM providers
  • RAG Implementation: Building systems that combine LLMs with external knowledge
  • Workflow Automation: Using LLMs to streamline repetitive analysis tasks
  • Model Evaluation: Assessing LLM performance on data science problems
  • Fine-tuning: Adapting models for specific data science domains

Application Areas

  • Exploratory Data Analysis: AI-assisted data exploration and hypothesis generation
  • Report Generation: Automated creation of data science reports and summaries
  • Code Documentation: Using LLMs to improve code readability and documentation
  • Data Cleaning: AI-assisted identification and correction of data quality issues
  • Insight Discovery: Leveraging LLMs to identify patterns and generate hypotheses

Ethical Considerations

  • Bias and Fairness: Understanding how LLM biases affect data analysis
  • Transparency: Maintaining interpretability in AI-assisted workflows
  • Validation: Ensuring human oversight of AI-generated insights
  • Attribution: Properly crediting AI assistance in data science work
  • Privacy: Protecting sensitive data when using external LLM services

Course Innovation

First-of-its-kind course specifically focused on the practical integration of large language models into data science practice, addressing both technical implementation and ethical considerations.