(the assignment is based on an assignment by Jason Eisner at JHU)
This assignment will help you understand how CFGs work and how they can be used -- sometimes comfortably, sometimes not -- to describe natural language. It will also make you think about some linguistic phenomena that are interesting in their own right.
You will submit a file in PDF format, called ANSWERS.pdf
. It should contain your name and ID number. Whenever the assignment says "discuss..." or "describe..." it means that you should write a discussion or description in that file (indicate clearly which question you are answering).
We provide you with a python program generate.py that generates sentences from a grammar.
Download the basic grammar file grammar.
Look at it and its format.
Now, generate sentences from it using the generate.py
program:
python generate.py grammar
(generate takes as argument a grammar file, and generates a sentence from it.)
The format of the grammar file is as follows:
# A fragment of the grammar to illustrate the format. 1 ROOT S . 1 S NP VP 1 NP Det Noun # There are multiple rules for NP. 1 NP NP PP 1 Noun president 1 Noun chief of staff
corresponding to the rules
ROOT -> S . S -> NP VP NP -> Det Noun NP -> NP PP Noun -> president Noun -> chief of staff
Notice that a line consists of three parts:
#
) indicates a comment, and everything after it is ignored.We take ROOT
to be the root of the grammar -- every generation begins with ROOT
.
Look at the generate.py
program code and understand it. Extend the program so that
it accepts and optional argument (indicated by -n
) that specifies the number of sentences to generate. For example, the invocation:
python generate.py grammar -n 5
will generate 5 sentences, python generate.py grammar -n 2
will generate two sentences, and python generate.py grammar
will generate one sentence, as it does now.
Each sentence should be printed on its own line.
(the list sys.argv
includes the commandline arguments of a python program. You can either take care of the commandline processing logic yourself, or use a module such as docopt.)
a. Why does the program generate so many long sentences? Specifically, what grammar rule is responsible and why? What is special about this rule? discuss.
b. The grammar allows multiple adjectives, as in the fine perplexed pickle
. Why do the
generated sentences do this so rarely? discuss.
c. The grammar format allows specifying different weights to different rules. For example, in the grammar:
3 NP A B 1 NP C D E 1 NP F 2 X NP NP
The tree NP rules have relative odds of 3:1:1, so when generating an NP, the generator will pick the first rules 3/5 of the times, the second one 1/5 of the times, and the third one 1/5 of the times.
Which numbers must you modify to fix the problems in (a) and (b), making the sentences shorter and the adjectives more frequent? Verify your answer by generating from the grammar. Discuss your solution (which rules and why).
d. What other numeric adjustments can you make to the grammar in order to favor more natural
sets of sentences? Experiment. Hand in your grammar file (including the solution to (c) above,
in a file named grammar1
, with comments that motivate your changes, together with 10 sentences generated by the grammar, in a file called grammar1.gen
.
Modify the grammar so it can also generate the types of phenomena illustrated in the following sentences. You want to end up with a single grammar that can generate all of the following sentences as well as grammatically similar sentences.
While your new grammar may generate some very silly sentences, it should not generate any that are obviously ungrammatical. For example, your grammar must be able to generate (d) but not:
*the president thought that a sandwich sighed a pickle .
since that is not okay English. The symbol *
is traditionally used to mark "not okay" language.
Again, while the sentences should be okay structurally, they don't need to really make sense. You don't need to distinguish between classes of nouns that can eat, want, or think and those that can't.
An important part of the problem is to generalize from the sentences above. For example, (b) is an invitation to think through the ways that conjunctions ("and", "or") can be used in English. (g) is an invitation to think about prepositional phrases ("on the desk", "over the rainbow", "of the United States") and how they can be used.
Briefly discuss your modifications to the grammar.
Hand in the new grammar (commented) as a
file named grammar2
and about 20 random sentences that illustrate your modifications in a file named grammar2.gen
.
Note that handling (b) and (h)/(i) can interact in a bad way, to create an ungrammatical sentence. You do not need to solve this case in this part, but you do need to discuss it and explain what the problem is with an example and a short explanation of what's happening there.
Note: The grammar file allows comments and whitespace because the grammar is really a kind of specialized programming language for describing sentences. Throughout this assignment, you should strive for the same level of elegance, generality, and documentation when writing grammars as when writing programs.
How your grammars will be tested: We want to see that your grammar can produce sentences with structures as we specified, generalize to similar structures, but not produce bad sentences.
We will run a parser that uses your grammar to try and parse a set of sentences. Your parser should be able to parse the correct ones, and not be able to parse the incorrect ones. (Recall, that if you can parse something with a grammar it means that it can be generated, and vice-versa).
Hint: If you want to see the effect of various rules while developing your grammar, you can give them a very high weight so that they will trigger often.
Hint: Imlementing Part 3 can also help you debug your grammar.
Modify the generate.py
program so that it gets an optional commandline swith -t
. When -t
is present, the program should not generate only sentences, but also the tree structures that generated the sentences. For example, when invoked as:
python generate.py grammar -t
instead of just printing
the floor kissed the delicious chief of staff .
it should print the more elaborate version
(ROOT (S (NP (Det the) (Noun floor)) (VP (Verb kissed) (NP (Det the) (Noun (Adj delicious) (Noun chief of staff))))) .)
which includes extra information showing how the sentence was generated. For example, the above
derivation used the rules Noun -> floor
and Noun -> Adj Noun
, among others.
Note: It's not too hard to print the pretty indented format above. But it's not necessary. It's sufficient if you will the entire expression on a single line:
(ROOT (S (NP (Det the) (Noun floor)) (VP (Verb kissed) (NP (Det the) (Noun (Adj delicious) (Noun chief of staff))))) .)
The bracketed expression can be visualized in tree form using web-based visulizers like this and this and this, or java-based visualizers like this. This sort of output can be useful when debugging your grammar -- understanding which rules are responsible for what structures.
Hint: The changes to the original code are not big. You don't have to represent a tree in memory, so long as the string you print has the parentheses and nonterminals in the right places.
Your code should support both the -n
switch from part 0, and the -t
switch, either alone or together.
Generate 5 more random sentences, in tree format. Submit them as in a file called part3.gen
as well as the code for the modified generate.py
program.
Think about all of the following phenomena, and extend your grammar from part 2 to handle ANY TWO of them -- your choice. Briefly discuss your solutions and provide sample output.
Be sure you can handle the particular examples suggested, which means among other things your grammar must include the words in those examples.
You should also generalize appropriately beyond these examples. As always, try to be elegant in your grammar design, but you will find that these phenomena are somewhat harder to handle elegantly with CFG notation.
Important: Your final grammar should handle everything from part 2, plus both of the phenomena you chose to add. This means you have to worry about how your rules might interact with one another. Good interactions will elegantly use the same rule to help describe two phenomena. Bad interactions will allow your program to generate ungrammatical sentences, which will hurt your grade.
(a) "a" vs. "an". Add some vocabulary words that start with vowels, and fix your grammar so that
it uses "a" or "an" as appropriate (e.g., an apple
vs. a president
). This is harder than you
might think: how about a very ambivalent apple
?
(b) Yes-no questions. Examples:
Of course, don't limit yourself to these simple sentences. Also consider how to make yes-no questions out of the statements in part 2.
(c) Relative clauses. Examples:
Of course, your grammar should also be able to handle relative-clause versions of more complicated sentences, like those in part 2.
Hint: These sentences have something in common with (d).
(d) WH-word questions. If you also did (b), handle questions like
If you didn't also do (b), you are allowed to make your life easier by instead handling "I wonder" sentences with so-called "embedded questions":
Of course, your grammar should be able to generate wh-word questions or embedded questions that correspond to other sentences.
Hint: All these sentences have something in common with (c).
(e) Singular vs. plural agreement. For this, you will need to use a present-tense verb since past tense verbs in English do not show agreement.
Examples:
(You may not choose both this question and question (a), as the solutions are somewhat similar.)
(f) Tenses. For example,
Here you should try to find a reasonably elegant way of generating all the following 12 tenses:
| present | past | future | ---------------- | --------------- | ---------------- | --------------------- | simple | eats | ate | will eat | perfect | has eaten | had eaten | will have eaten | progressive | is eating | was eating | will be eating | perfect + progr. | has been eating | had been eating | will have been eating |
(g) Appositives. Examples:
The tricky part of this one is to get the punctuation marks right. For the appositives themselves, you can rely on some canned rules like Appos -> 59 years old
.
although if you also did (c), try to extend your rules from that problem to automatically generate a range of appositives such as who ate a sandwich
and which the president ate
.
grammar4
.
Be sure to indicate clearly in a comment at the top of the file which TWO of the above phenomena it handles. Provide a brief description of your solution.
Your submission should include a single zip
file with the files ANSWERS.pdf
, generate.py
, grammar1
, grammar1.gen
, grammar2
, grammar2.gen
, part3.gen
, grammar4
.
Make sure your ANSWERS.pdf
file includes your names and IDs.