Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
INT 405: Natural Language Processing, 2024
Final Report
Formal Semantic Parsing with Language Models
1 Overview
In this project, students will explore the task of generating SQL queries from natural language inputs using language models (LMs). The project will progressively build on basic prompting, fine-tuning, and grammar-constrained generation of SQL queries. The SQL data used in this project will come from the GeoQuery dataset, available at https://github.com/jkkummerfeld/text2sql-data, and we will utilize Python packages such as torch, sqlite3, transformers, and the transformers-cfg library for context free grammar (CFG) integration. The project will be conducted in groups of up to 3 members, with a joint presentation and each student submitting an individual report (maximum 6 pages, use ACL format) highlighting their contributions.
Alternatively, a group can propose a NLP project of their own interest with the approval from module leader. On the one hand, a self-proposed project should be sufficiently novel so that it has the potential to be published with further work after this semester. On the other hand, a self-proposed project should be sufficiently simple so that its’ core results can be delivered at the end of this semester.
2 Project Components
Throughout the project, we will use ”HuggingFaceTB/SmolLM2-360M-Instruct” model as our base model. One can attempt other models (large model is fine if one can find additional computing resources), but this 360M-Instruct base model has to be used as a baseline in all the first three phases.
The project will be divided into three main phases:
2.1 Phase 1: Basic Prompting
- Finish data loading and evaluating the sql results by running gold sql with sqlite3 against database.
- Provide natural language prompts and generate corresponding SQL queries.
- Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
- Perform other additional analysis
Note that it is very likely that with small language models, you can not get any good accuracy. This phase mainly helps you to set up the evaluation pipeline. Additional analysis can be flexible, for example,providing case by case error analysis.
2.2 Phase 2: Fine-tuning the Language Model
Tasks:
- Fine-tune a pre-trained language model on the GeoQuery dataset using the transformers library.
- Train the model on pairs of natural language inputs and SQL queries.
- Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
- Perform other additional analysis
2.3 Phase 3: Constraining Output with CFG
Tasks:
- Define a CFG for SQL queries using the Extended Backus-Naur Form (EBNF).
- Integrate this CFG into your fine-tuned model using the transformers-cfg library.
- Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
- Perform other additional analysis
Note that SQL in general might not be context free, but having a loose grammar can still be helpful.
2.4 Phase 4: Additional Experiments
3 Tools and Libraries
- torch/jax for training the language model.
- sqlite3 for running and evaluating SQL queries.
- transformers for working with pre-trained language models.
- transformers-cfg for integrating context-free grammars into the language model.
4 Dataset
5 Submission Guidelines
We have a group submission deadline before the presentation and an individual submission after the presentation deadline.
Each group need to submit a single zip (or tar.gz) file, named p03 <group id>.zip where <group id> isyour student id number. The zip file should include the following items:
- prompts.pdf It must include all AI usage from coding to text generation.
- llm for sql.ipynb The main ipython notebook that should be runnable and include all the runningresults. In case, youneed to run several notebooks, name them llm for sql i.ipynb (shared)
- sql.ebnf The sql grammar file for transformers-cfg (shared)
- data a data folder containing downloaded data (shared)
- model a hyper link to the saved model (shared)
- slides.pdf the slides of group presentation (shared)
- README.md very briefly highlight an agreed individual contributions and other things if necessary.(shared)
- The deadline for submission of individual report is [2024 Dec 13 at 11:55 am noon].
One group member should upload the zip file to the Learning Mall before the group deadline.
For individual report, you can finish the writing after the presentation
- report <student id>.pdf each student must include an individual report, which should be no longerthan 6 pages (excluding references and appendixes), using the ACL 2023 template (https://www.overleaf.com/latex/templates/acl-2023-proceedings-template/qjdgcrdwcnwp). It must includeat least two appendixes: individual contribution section and prompt section (recording all AI usagefrom coding to text generation).
- The deadline for submission of individual report is [2024 Dec 15 at 11:55 pm].
Note that the report nor the presentation has to be grouped as three phases. Presenting them coherently through method and experiments separation is preferred.
6 Academic Honesty
Generative AI is allowed for this assignment, but you need to record all the used prompt and AI generated texts in the appendix.
For the report wrting, no copy-pasting from each other. You can share the data, but make your own tables, charts and etc..
7 Marking
- 20% Group Presentation: Each group should have a 9 minutes presentation, marking will be based on individual performance in terms of visual of slides, content delivery and audience engagement. This will happen in the Dec 13 Friday
- 50% Individual Report: Each individual has a (maximum) 6 pages report. This is marked based on both the result of project (25% performance and analysis) and writing of the project (25% visualization, logic flow, grammar and etc.). The report should clearly explains how fine tuning is done, and illustrate how CFG works in this context. Please read ACL papers for good writing samples. An appendix section should explain what the individual contributions are.
- 30 % Individual QA: Each student will have a one-to-one 3/4 minutes oral assessment after the presentation. Questions regarding the whole project will be asked. Students are expected to know all details of the project. An academic dishonesty can be found if students failed to explain his/her own contributions.