DATA309-24S2 Data Science Capstone Project

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

DATA309-24S2 Data Science Capstone Project
DATA309 Project Description - 1
2024

Project Title: Training a Custom Large Language Model

Objective:

The objective of this project is to enable students to build their own language models using their own text datasets. You will start with NanoGPT, a lightweight version of the GPT model, and fine-tune it using your chosen data.

Alternatively, you could choose to train a Llama based model such as allamo, a simple, hackable and fast implementation for training/finetuning medium-sized LLaMA-based models.

By completing this project, students will gain hands-on experience in:

  • data sourcing/scraping
  • data cleaning/wrangling
  • data engineering
  • data preprocessing
  • building pipelines
  • configuring and using Azure Virtual Labs (or equivalent)
  • model training, and evaluation
  • fundamental data science skills of working in a group, report writing and presenting
It is envisioned that students will work in groups of 2-4; If student numbers allow multiple groups of three would be the ideal scenario.
Prerequisites:
These prerequisites are not hard and fast, over the course of the project you will have a chance to pick up some of these skills.


  • Basic understanding of Python programming language
  • Familiarity with deep learning concepts and frameworks (e.g., PyTorch)
  • Knowledge of text pre-processing techniques
  • Access to a GPU-enabled machine or cloud-based GPU resources (recommended for faster training; 


AWS does provide free accounts for students - they provide free credit).

Instructions:

1. Understanding NanoGPT:
  • Familiarize yourself with nanoGPT, a lightweight variant of the GPT model. Review the available resources, including research papers, documentation, and code examples, to understand the architecture and training process.
  • A good place to start is the Github for nanoGPT, https://github.com/karpathy/nanoGPT. nanoGPT was created by Andrej Karpathy, a legendary AI researcher, engineer, and educator. He’s the former director of AI at Tesla and a founding member of OpenAI, the creator of ChatGPT.
  • Another good resource is this youtube video by Andrej Karpathy, https://www.youtube.com/watch?v =kCc8FmEb1nY. There are also numerous blogs about nanoGPT, such as on “medium”.
  • Getting started, install the software stack as recommended in the nanoGPT Github, it would be very advantageous if you used Ubuntu (I will talk to the IT department about getting access to a Linux environment with a GPU). Once the software stack is installed run the ‘Quick Start’ to verify your installation.
  • Allamo may be found here https://github.com/chrisociepa/allamo.
2. Data Acquisition:
  • Select a domain or topic of interest for your language model. It could be news articles, scientific papers, movie scripts, social media posts, or any other text data.
  • Collect a substantial amount of text data for training. Aim for at least tens of thousands to hundreds of thousands if not millions of text documents to ensure the model’s effectiveness.
  • Checkout “Data Science datasets for Natural Language Processing” at website https://www.knowledg ehut.com/blog/data-science/data-science-datasets for text data sources.
  • When gathering and/or scraping data, be sure to follow appropriate ethical guidelines, such as respecting copyright laws, privacy rights, and terms of service for any data sources utilized.
3. Data Preprocessing:
  • Clean and preprocess the acquired data. Perform necessary text preprocessing tasks, such as removing special characters, lower-casing, tokenization, and removing stop words. You can use libraries like NLTK or SpaCy to assist in these tasks.
  • Familiarize yourself with the nanoGPT documentation so that you know the correct format that nanoGPT is expecting for your training data.
4. Train NanoGPT:
  • Initialize your language model using NanoGPT as a starting point. You can find the NanoGPT codebase and pre-trained weights in the open-source repositories available online (see above).
  • Fine-tune the NanoGPT model using your preprocessed data. Split your dataset into training and validation sets (e.g., 80:20 ratio).
  • Utilize a deep learning framework like PyTorch to train the model. Implement the necessary training pipeline, including loading the pre-trained weights, defining the loss function, and setting up the optimizer. nanoGPT should run without any changes required (assuming the appropriate software environment has been installed).
  • Stretch Goal: Experiment with different hyperparameters, such as learning rate, batch size, and number of training epochs, to find the optimal configuration. Monitor the model’s performance on the validation set during training. Time and compute resources may restrict how much hyper-parameter tuning maybe achieved, so don’t be disappointed if you cannot achieve all that you set out to achieve. (This would something to mention in the “Future Work” section of your report).
5. Model Evaluation:
  • You will need to evaluate the performance of your trained language model. One way to do this is to come up with a list of 10-12 questions (or more) relevant to your chosen domain. Ask them to the base model, record the answers; ask them again to your fine-tuned model and again record the answers. You can then compare the two sets of answers to assess the model’s effectiveness in generating coherent and contextually relevant text (what metric would you use?). You could also prepare model answers to the your questions and compare to both sets of nanoGPT generated answers.
  • Generate sample text using your trained language model to get a qualitative understanding of its capabilities. Analyze the generated samples for grammar, coherence, and overall quality.
6. Model Refinement:
  • Analyze the shortcomings and limitations of your trained language model. Identify areas for improvement based on evaluation results and sample outputs.
  • Experiment with different training techniques or model architectures to enhance the performance of your language model. You can explore concepts like transfer learning, model ensembling, or larger model architectures to improve the quality of text generation.
7. Useful Resources:

The following list of resources may be useful, it is by no means exhaustive, please do your own research.

  • https://github.com/kaushikb11/awesome-llm-agents?tab=readme-ov-file
  • What can you build with LangChain? Question answering with RAG, Extracting structured output, Chatbots. For further information see https://github.com/langchain-ai/langchain.
  • https://promptengineering.org/what-are-large-language-model-llm-agents/
8. Documentation and Presentation:
  • Prepare a detailed report documenting your project, including the project objectives, data collection and preprocessing techniques, model architecture, training methodology, evaluation results, and any improvements made to the base NanoGPT model.
  • Create a presentation summarizing your project and key findings. Clearly explain your approach, challenges faced, and lessons learned during the project.
  • Guidelines for writing the report and presentation may be found on the DATA309 LEARN page.
  • Be sure to document all the effort you have made.

发表评论

电子邮件地址不会被公开。 必填项已用*标注