首页 » 计算机科学(Computing Science) » COMP10002 Foundations of Algorithms

COMP10002 Foundations of Algorithms

2025-05-22 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Semester 1, 2025

Assignment 2

Due: 4pm Tuesday 27 May 2025

Version 1.0

1 Learning Outcomes

In this assignment you will demonstrate your understanding of structures, linked data structures, and algorithm time complexity analysis. You will further extend your skills in program design and implementation.

2 The Story...

This assignment will demystify the magic of ChatGPT (a.k.a. large language models, or LLM).

Figure 1: An interaction with ChatGPT

In the past couple of years, LLMs like ChatGPT have taken the world by storm, showcasing an incredible ability to generate human-like text, translate languages, answer complex questions, and even to draft this introductory paragraph for the assignment. At the heart of their intelligence lies a blend of powerful models (so-called language models) and clever algorithms that help them decide what to say next given a context. Figure 1 shows an example interaction with ChatGPT. While it appears that ChatGPT was able to “an swer” the question ‘The answer to life, universe, and everything is’ given by a human user, what really happened was that the language model behind ChatGPT predicted the most likely next word given the user input sentence to be ‘42’.

Researchers have spent decades of efforts on developing more and more powerful language models, to reach the large language models that we are now using today. In this assignment, we won’t drill into language models, and you don’t need to be an expert of AI to complete the assignment, as these are beyond the scope of this subject. Instead, we will study how a sentence is formed based on the probabilities of next words predicted by language models.

3 Input Data

Below is a sample input to help illustrate the tasks of the assignment. Read on to see what each value means in this input data.

<end> 0.0370

<start> 0.0370

a 0.1083

and 0.1082

answer 0.0053

away 0.0163

barked 0.0145

dog 0.0216

everything 0.0062

forty 0.0029

in 0.0938

is 0.1047

life 0.0071

loudly 0.0092

on 0.0722

ran 0.0198

the 0.1805

to 0.1444

two 0.0044

universe 0.0066

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.2 0.0 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.4 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.3 0.0 0.0 0.0 0.4 0.0 0.0 0.0

0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.5 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.2 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0

0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.5 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

There are three sections in the input data:

1. An integer n (3 ≤ n ≤ 50), n = 20 in the sample above. It represents the number of different words that a language model can generate, that is, the vocabulary of the language model.

2. A total of n lines of word records representing the vocabulary of the language model. Each line repre sents a word, which contains an unique English word and a real number representing the probability of the word (e.g., from counting the word frequency of a collection of web documents – you don’t need to worry about where these numbers came from), separated by a white space.

There are two special “words” <start> (can think of it as “to capitalise” the first word in a sentence) and <end> (can think of it as period ‘.’), representing the start and the end of a sentence, respectively. These two special words are included in all input data, although their exact probabilities may vary across different input. Except for these two special words, you may assume that each English word contains at least 1 and up to 20 lower case letters.

The words in the list have been sorted in increasing order alphabetically (that is, in dictionary order), where <end> and <start> are always word 0, and word 1 in the list, respectively.

3. An n x n matrix of transition probabilities which are non-negative real numbers. The number at row i, column j (0 ≤ i, j < n) represents the transition probability from word i (of the input list of words above, counting from 0; same below) to word j, that is, the probability of word j appearing right after word i in a sentence. Such transition probabilities again can be calculated (in practice, “learned” by an advanced machine learning model) from a collection of web documents.

In the sample input, the first 0.2 in row 1 (column 2) represents the transition probability from <start> (word 1) to “a” (word 2) – this means there is a 20% chance that a sentence starts with word “a’. Similarly, the 0.6 at row 2, column 7 means that, when “a” (word 2) appears in a sentence, there is a 60% chance that the next word in the sentence is “dog” (word 7).

Note: The transition probabilities in each row sum up to 1 (think about why), except for those in row 0 (representing transition probabilities from <end> to each word) which are all 0’s – we include this row in the input regardless, for ease of discussion. Similarly, all transition probabilities in column 1 are 0’s.

You may assume that the test data always follows the format above. No input validity checking is needed.

4 Your Task

You will be given a skeleton code file named program.c with incomplete functions. You need to add code to all the functions including the main function for the following tasks.

4.1 Stage 1: Read the Words and Their Probabilities (5 Marks)

Your first task is to add code to the stage_one function to read the list of words and probabilities. Create a struct named word_rec_t to represent a word record, and an array of this struct type to store all words.

Given the word probabilities, a naive approach to generate texts is just to generate a sentence formed by the words of the highest probabilities. Your next step is to identify the top-10 words (excluding the specialwords <start> and <end>) with the highest probabilities. Then, output a “sentence” starting with <start>, followed by the top-10 words identified (by descending order of their probabilities), and ending with <end>. If the input has fewer than 10 words, your program output should include all words from the input.

For simplicity, you may assume that the word probabilities are all unique, except for those of <start> and <end> which do not need to be examined. You may create a copy of the array of words for sorting, since the subsequent tasks rely on the list of words in their original dictionary order.

The output for this stage given the above sample input is as follows (where “mac:” is the command prompt).

mac: ./program < test0.txt

Stage 1

==========

Like in Assignment 1, we will again use input redirection to feed test data into your program. Thus, you should still use the standard input functions such as scanf or getchar to read the data. You do not need to (and should not) use any file operation functions such as fopen or fread. Your program should not print anything except for the data requested to be output (as shown in the output example).

Hint: Note the newline character ‘\n’ at the end of a line if you use getchar to read the input. To sort the words, your may adapt the insertion sort code given on Ed (https://edstem.org/au/courses/19957/ lessons/72882/slides/487112), or use the qsort function given in stdlib.h (see https://people.eng. unimelb.edu.au/ammoffat/ppsaa/c/callqsort.c for an example).

You can also modify the stage_one function to read all input in one go. You can (and should) create further functions to complete the tasks when opportunities arise.

4.2 Stage 2: Read the Transition Probabilities (6 Marks)

Add code to the stage_two function to read the matrix of transition probabilities. For each word i, except for <end>, output its most likely next word, that is, the word corresponding to the largest transition probabilities in row i. When there is a tie, the word corresponding to a smaller column number should be outputted.

The output for this stage given the above sample input should be:

Stage 2

==========

<start> -> the

a -> dog

and -> everything

answer -> to

away -> <end>

barked -> loudly

dog -> barked

everything -> is

forty -> two

in -> the

is -> forty

life -> universe

loudly -> <end>

on -> the

ran -> away

the -> answer

to -> the

two -> <end>

universe -> and

Here, for word 1, that is, <start>, we examine row 1 of the transition probability matrix. The largest transition probability in this row is 0.4 (at column 16, counting from column 0), which corresponds to word 16 (“the”) in the input list of words. Thus, the most likely next word of <start> is “the”, outputted in the form of <start> -> the.

Hint: You should use double type to store the transition probabilities for best accuracy. Consider modifyingthe word_record_t structure to store the most likely next word for each word. This will simplify Stage 3.

4.3 Stage 3: Generate a Sentence with the Transition Probabilities (6 Marks)

Using the transition probabilities, we will be able to generate a sentence that reads more natural. Add code to the stage_three function to create a linked list to store the sentence to be generated, where each node represents a word in the sentence. You should adapt the linked list code given in the skeleton code file program.c for the assignment.

The linked list should start with a node representing the special word <start>. Then, your function should iteratively add more words (that is, nodes) to the end of the linked list. In each iteration, the most likely next word (according to the transition probabilities read in Stage 2) of the current last word in the linked list should be added to the linked list. Following the sample input above:

In the first iteration, “the” is added, as it is the most likely next word of <start> (the current last word in the linked list). We now have a partial sentence: <start> the.
In the next iteration, “answer” is added, as it is the most likely next word of “the” (the current last word in the linked list). The partial sentence becomes: <start> the answer.
This process continues until <end> is added to the sentence, or if 10 words (excluding <start> and <end>) have been added without reaching <end>. In the latter case, <end> should also be added to complete the sentence.

Your program should output the sentence after the linked list is constructed. See a sample output next.

Stage 3

==========

<start> the answer to the answer to the answer to the <end>

Yes this stage can be done without using a linked list. However, we’d like you to practise using it. If youcomplete the stage without a linked list, a 2-mark penalty applies.

4.4 Stage 4 (For a Challenge): Generate a Sentence with the Transition Probabilities Using an Advanced Algorithm (3 Marks)

In this stage, we expand our sentence generation process to consider more sentences than just the one formed by the most likely next words. Add code to the stage_four function to implement the algorithm as follows.

1. You will need to define a struct type named sent_t to store a partial sentence generated. The struct type should have at least the following components (you may choose to add more):

sentence: an array to store a partial sentence;
last: an int to store the index (in the input list of words) of the last word of the partial sentence;
probability: a double to store the sentence probability, which is calculated by multiplying the transition probability between every two adjacent words in the sentence.

For example, suppose sentence = “<start> the answer to”. Then, last = 17, as the last word in sentence is “to”, which is word 17 in the input list of words. This variable will help us access the transition probabilities of the next words to be added to the sentence, that is, row 17 of the transition matrix. Variable probability = 0.4 × 0.6 × 1.0 = 0.24. Here, 0.4, 0.6, and 1,0 are the transition probabilities from <start> to “the”, “the” to “answer”, and “answer” to “to”, respectively.

2. Create two arrays of the sent_t type. Let us name these two arrays sents and new_sents. Array sents has a size of 2 – we only keep the top-2 best partial sentences generated at each iteration (detailed next), to reduce the computational costs. Array new_sents has a size of 2n – recall that n is the number of words in the input list. This array stores intermediate results in the generation process.

3. At start, sents has only one partial sentence (a sent_t record), where sentence = “<start>”, last = 1, and probability = 1.0 (so we can calculate products later), while new_sents is empty. We iteratively expand each partial sentence record in sents as follows.

4. At each iteration, for each partial sentence in sents:

(a) If last is 0 (that is, <end>), we just add the partial sentence record to new_sents.

(b) Otherwise, for each next word of word last with a non-zero transition probability, we create a new partial sentence record by adding the next word to the end of the partial sentence, and weadd the record to new_sents.

For example, given sentence = “<start> the answer to”, word last is “to”. There are three next words of “to”: “answer”, “life”, and “the” with transition probabilities of 0.1, 0.4, and 0.5, respectively. We create three partial sentence records for “<start> the answer to answer”, “<start> the answer to life”, and “<start> the answer to the”, with probabilities 0.24× 0.1 = 0.024, 0.24 × 0.4 = 0.096, and 0.24 × 0.5 = 0.12, respectively.

5. After all partial sentences in sents are processed, we take the two records in new_sents with the largest sentence probabilities and copy them back to sents, that is, these two are the partial sentences to be used for the next iteration. If there is a tie, the record added to new_sents earlier should be used. If there is only one record in new_sents, just copy the record to sent[0]. The algorithm then empties new_sents and goes back to Step 4. See Figue 2 next page for a running example of the algorithm. The process stops when both sentences in sents end with <end>, or 10 iterations have been completed.

The output of this stage is the sentence in sents[0] when the algorithm stops. If the sentence does not contain <end>, your program should output an <end> at the end.

Given the sample input above, the output of this stage is as follows.

Stage 4

==========

<start> the answer to life universe and everything is forty two <end>

At the end of your submission file, you need to add a comment that states the worst-case time complexity to run your Stage 4 algorithm, assuming N words in the vocabulary, maximum L words in the final output sentence, and top-K partial sentences to be kept after each iteration (these are considered variables and not constants; they cannot be omitted in big-O terms). For simplicity, you may assume that assigning the value of a sent_t variable takes O(1) time.

Figure 2: An example run of the Stage 4 algorithm: numbers on arrow are transition probabilities; numbers in parenthesis are sentence probabilities; words in red form the top-2 candidate sentences after each iteration.

5 Submission and Assessment

This assignment is worth 20% of the final mark. A detailed marking scheme has been included in the assignment package.

Submitting your code. You should put all your code for the assignment into a single file named program.c. To submit your code, you will need to: (1) Log in to LMS subject site, (2) Navigate to “A2” in the “Assignments” page, (3) Click on “Load A2 in a new window”, and (4) follow the instructions on the Gradescope “A2” page and click on the “Submit” link to make a submission. You can submit as many times as you want to. Only the last submission made before the deadline will be marked. Submissions made after the deadline will be marked with late penalties as detailed at the end of this document. Do not submit after the deadline unless a late submission is intended. Two hidden tests will be run for marking purposes. Results of these tests will be released after the marking is done.

You can (and should) submit both early and often – to check that your program compiles correctly on our test system, which may have some different characteristics to your own machines.

Testing on your own computer. You will be given a sample test file test0.txt and the sample output test0-output.txt. You can test your code on your own machine with the following command and compare the output with test0-output.txt:

mac: ./program < test0.txt /* Here ‘<’ feeds the data from test0.txt into program */

Note that we are using the following command to compile your code on the submission testing system (we name the source code file program.c).

gcc -Wall -Wextra -Werror -Wno-newline-eof -Wno-unused-parameter -std=c17 -o program program.c -lm

The flag “-std=c17” enables the compiler to use a modern standard of the C language – C17. To ensure that your submission works properly on the submission system, you should use this command to compile your code on your local machine as well.

You may discuss your work with others, but what gets typed into your program must be individual work,not from anyone else.

Do not give (hard or soft) copies of your work to anyone else.
Do not “lend” your memory stick to others.
Do not leave your laptop unlocked in the lab or library.
Do not upload your assignment solutions onto Github or any other public code repositories.
Do not ask others to give you their programs “just so that I can take a look and get some ideas, I won’t copy, honest”.

The best way to help your friends in this regard is to say a very firm “no” when they ask for a copy of, or to see, your program, pointing out that your “no”, and their acceptance of that decision, is the only thing that will preserve your friendship. A sophisticated program that undertakes deep structural analysis of C code identifying regions of similarity will be run over all submissions in “compare every pair” mode. See https://academichonesty.unimelb.edu.au for more information.

Deadline: Programs not submitted by 4pm Tuesday 27 May 2025 will lose penalty marks at the rate of 3 marks per day or part day late. Late submissions after 4pm Friday 30 May 2025 will not be accepted.

6 Extension and Special Consideration

1. 1-day extension: complete this form (https://forms.office.com/r/NPXTGiDK3V) by 4:00 pm Wednes day 28 May 2025 with your name and student ID, and write “Algorithms are fun” in the form. An extension of 24 hours will be approved automatically. Note: This is a special offer of our subject to simplify the extension process for submissions that just miss the deadline slightly.

2. Extension between 2 and 3 working days: You must follow the steps below and submit your appli cation prior to the assessment deadline.

Complete the online declaration form on the FEIT Extensions and Special Consideration page (see link below).
Receive outcome or response from Jianzhong.

3. Extension between 4 and 10 working days: You must have appropriate supporting documentation and submit via the Special consideration portal1 :

Check your eligibility.
Prepare your supporting documentation.
Submit an application.
Receive an outcome (via email)

4. Special consideration and extensions of more than 10 working days: Applications for all other adjustment requests should be submitted via the Special consideration portal.

Applications must be submitted within four working days of the assessment due date or exam date. You can apply before the assessment due date if you know it will be impacted.

Check your eligibility.
Prepare your supporting documentation.
Submit an application.
Receive an outcome (via email).

You can still apply if you miss the deadline, but you will be prompted to include a late reason and need to provide documentation explaining what impacted your ability to apply on time.

Full instructions can be found in the Extensions and Special Consideration page of the Faculty of Engineering and Information Technology:

https://eng.unimelb.edu.au/students/coursework/study-resources/extensions-and-special-consideration.

And remember, Algorithms are fun!

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *