INT 405: Natural Language Processing, 2024


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


INT 405: Natural Language Processing, 2024
Final Report
Formal Semantic Parsing with Language Models

1 Overview

In this project, students will explore the task of generating SQL queries from natural language inputs using language models (LMs). The project will progressively build on basic prompting, fine-tuning, and grammar-constrained generation of SQL queries. The SQL data used in this project will come from the GeoQuery dataset, available at https://github.com/jkkummerfeld/text2sql-data, and we will utilize Python packages such as torch, sqlite3, transformers, and the transformers-cfg library for context free grammar (CFG) integration. The project will be conducted in groups of up to 3 members, with a joint presentation and each student submitting an individual report (maximum 6 pages, use ACL format) highlighting their contributions.

Alternatively, a group can propose a NLP project of their own interest with the approval from module leader. On the one hand, a self-proposed project should be sufficiently novel so that it has the potential to be published with further work after this semester. On the other hand, a self-proposed project should be sufficiently simple so that its’ core results can be delivered at the end of this semester.

2 Project Components

Throughout the project, we will use ”HuggingFaceTB/SmolLM2-360M-Instruct” model as our base model. One can attempt other models (large model is fine if one can find additional computing resources), but this 360M-Instruct base model has to be used as a baseline in all the first three phases.

The project will be divided into three main phases:

2.1 Phase 1: Basic Prompting

In this phase, you will prompt a pre-trained language model (SmolLM2-360M-Instruct ) to generate SQL queries directly from natural language inputs. This requires using a pre-trained model from the transformers library and prompting it to generate SQL queries based on natural language inputs from the GeoQuery dataset.
Tasks:
  • Finish data loading and evaluating the sql results by running gold sql with sqlite3 against database.
  • Provide natural language prompts and generate corresponding SQL queries.
  • Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
  • Perform other additional analysis

Note that it is very likely that with small language models, you can not get any good accuracy. This phase mainly helps you to set up the evaluation pipeline. Additional analysis can be flexible, for example,providing case by case error analysis.

2.2 Phase 2: Fine-tuning the Language Model

In this phase, you will fine-tune a language model using pairs of natural language inputs and corresponding SQL queries from the GeoQuery dataset. This will involve training the model to better understand SQLsyntax in the context of natural language inputs.

Tasks:

  • Fine-tune a pre-trained language model on the GeoQuery dataset using the transformers library.
  • Train the model on pairs of natural language inputs and SQL queries.
  • Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
  • Perform other additional analysis

2.3 Phase 3: Constraining Output with CFG

In this phase, you will integrate a context-free grammar (CFG) to constrain the language model’s output, ensuring that only syntactically valid SQL queries are generated. For this, you will use the transformers-cfg library (https://github.com/epfl-dlab/transformers-CFG).

Tasks:

  • Define a CFG for SQL queries using the Extended Backus-Naur Form (EBNF).
  • Integrate this CFG into your fine-tuned model using the transformers-cfg library.
  • Evaluate the model’s performance in terms of SQL query correctness and well-formedness.
  • Perform other additional analysis

Note that SQL in general might not be context free, but having a loose grammar can still be helpful.

2.4 Phase 4: Additional Experiments

You can attempt other solutions or variations. e.g., What happens if we have CFG + Prompting without Fine-tuning.

3 Tools and Libraries

You are required to use the following Python packages for the project:
  • torch/jax for training the language model.
  • sqlite3 for running and evaluating SQL queries.
  • transformers for working with pre-trained language models.
  • transformers-cfg for integrating context-free grammars into the language model.

4 Dataset

The GeoQuery dataset from the text2sql-data repository (https://github.com/jkkummerfeld/text2sql-data)will serve as the primary dataset for this project. This dataset contains natural language sentences pairedwith SQL queries and is suitable for text-to-SQL tasks.

5 Submission Guidelines

We have a group submission deadline before the presentation and an individual submission after the presentation deadline.

Each group need to submit a single zip (or tar.gz) file, named p03 <group id>.zip where <group id> isyour student id number. The zip file should include the following items:


  • prompts.pdf It must include all AI usage from coding to text generation.
  • llm for sql.ipynb The main ipython notebook that should be runnable and include all the runningresults. In case, youneed to run several notebooks, name them llm for sql i.ipynb (shared)
  • sql.ebnf The sql grammar file for transformers-cfg (shared)
  • data a data folder containing downloaded data (shared)
  • model a hyper link to the saved model (shared)
  • slides.pdf the slides of group presentation (shared)
  • README.md very briefly highlight an agreed individual contributions and other things if necessary.(shared)
  • The deadline for submission of individual report is [2024 Dec 13 at 11:55 am noon].


One group member should upload the zip file to the Learning Mall before the group deadline.

For individual report, you can finish the writing after the presentation

  • report <student id>.pdf each student must include an individual report, which should be no longerthan 6 pages (excluding references and appendixes), using the ACL 2023 template (https://www.overleaf.com/latex/templates/acl-2023-proceedings-template/qjdgcrdwcnwp). It must includeat least two appendixes: individual contribution section and prompt section (recording all AI usagefrom coding to text generation).
  • The deadline for submission of individual report is [2024 Dec 15 at 11:55 pm].

Note that the report nor the presentation has to be grouped as three phases. Presenting them coherently through method and experiments separation is preferred.

6 Academic Honesty

Plagiarism or any form of academic dishonesty will not be tolerated. Be sure to properly cite any external sources of data, code, or ideas. Discussion with other groups are allowed, but you cannot directly copy code from another group.

Generative AI is allowed for this assignment, but you need to record all the used prompt and AI generated texts in the appendix.

For the report wrting, no copy-pasting from each other. You can share the data, but make your own tables, charts and etc..

7 Marking

Marking will be decomposed into several components:
  • 20% Group Presentation: Each group should have a 9 minutes presentation, marking will be based on individual performance in terms of visual of slides, content delivery and audience engagement. This will happen in the Dec 13 Friday
  • 50% Individual Report: Each individual has a (maximum) 6 pages report. This is marked based on both the result of project (25% performance and analysis) and writing of the project (25% visualization, logic flow, grammar and etc.). The report should clearly explains how fine tuning is done, and illustrate how CFG works in this context. Please read ACL papers for good writing samples. An appendix section should explain what the individual contributions are.
  • 30 % Individual QA: Each student will have a one-to-one 3/4 minutes oral assessment after the presentation. Questions regarding the whole project will be asked. Students are expected to know all details of the project. An academic dishonesty can be found if students failed to explain his/her own contributions.
Students who contribute less are expected to get lower scores, as one is expected to know less about the project. If contribution from an individual student is extremely limited, a very low mark (under 15% out of the 50%) for the individual report section will be given.

8 Contact Information

If you have any questions or need further clarification regarding the project, please contact the module leaderat [[email protected]].

发表评论

电子邮件地址不会被公开。 必填项已用*标注