Assignment 2 -- Grammar Writing

Due date: 16 May, 2018, Midnight.

(the assignment is based on an assignment by Jason Eisner at JHU)

This assignment will help you understand how CFGs work and how they can be used -- sometimes comfortably, sometimes not -- to describe natural language. It will also make you think about some linguistic phenomena that are interesting in their own right.

You will submit a file in PDF format, called ANSWERS.pdf. It should contain your name and ID number. Whenever the assignment says "discuss..." or "describe..." it means that you should write a discussion or description in that file (indicate clearly which question you are answering).

Part 0 -- Generating Sentences from a CFG (1 point)

We provide you with a python program generate.py that generates sentences from a grammar.

Download the basic grammar file grammar. Look at it and its format. Now, generate sentences from it using the generate.py program:

python generate.py grammar

(generate takes as argument a grammar file, and generates a sentence from it.)

The format of the grammar file is as follows:

    # A fragment of the grammar to illustrate the format.
    1 ROOT S .
    1 S NP VP
    1 NP Det Noun # There are multiple rules for NP.
    1 NP NP PP
    1 Noun president
    1 Noun chief of staff

corresponding to the rules

    ROOT -> S .
    S -> NP VP
    NP -> Det Noun
    NP -> NP PP
    Noun -> president
    Noun -> chief of staff

Notice that a line consists of three parts:

a number (ignore this for now)
a nonterminal symbol, called the "left-hand side" (LHS) of the rule
a sequence of zero or more terminal and nonterminal symbols, which is called the "right-hand side" (RHS) of the rule
A hash sign (#) indicates a comment, and everything after it is ignored.

We take ROOT to be the root of the grammar -- every generation begins with ROOT.

Look at the generate.py program code and understand it. Extend the program so that it accepts and optional argument (indicated by -n) that specifies the number of sentences to generate. For example, the invocation: python generate.py grammar -n 5 will generate 5 sentences, python generate.py grammar -n 2 will generate two sentences, and python generate.py grammar will generate one sentence, as it does now.

Each sentence should be printed on its own line.

(the list sys.argv includes the commandline arguments of a python program. You can either take care of the commandline processing logic yourself, or use a module such as docopt.)

Part 1 -- Weights (9 points)

a. Why does the program generate so many long sentences? Specifically, what grammar rule is responsible and why? What is special about this rule? discuss.

b. The grammar allows multiple adjectives, as in the fine perplexed pickle. Why do the generated sentences do this so rarely? discuss.

c. The grammar format allows specifying different weights to different rules. For example, in the grammar:

    3 NP A B
    1 NP C D E
    1 NP F
    2 X NP NP

The tree NP rules have relative odds of 3:1:1, so when generating an NP, the generator will pick the first rules 3/5 of the times, the second one 1/5 of the times, and the third one 1/5 of the times.

Which numbers must you modify to fix the problems in (a) and (b), making the sentences shorter and the adjectives more frequent? Verify your answer by generating from the grammar. Discuss your solution (which rules and why).

d. What other numeric adjustments can you make to the grammar in order to favor more natural sets of sentences? Experiment. Hand in your grammar file (including the solution to (c) above, in a file named grammar1, with comments that motivate your changes, together with 10 sentences generated by the grammar, in a file called grammar1.gen.

Part 2 -- Extending the Grammar (30)

Modify the grammar so it can also generate the types of phenomena illustrated in the following sentences. You want to end up with a single grammar that can generate all of the following sentences as well as grammatically similar sentences.

(a) Sally ate a sandwich .
(b) Sally and the president wanted and ate a sandwich .
(c) the president sighed .
(d) the president thought that a sandwich sighed .
(e) it perplexed the president that a sandwich ate Sally .
(f) the very very very perplexed president ate a sandwich .
(g) the president worked on every proposal on the desk .
(h) Sally is lazy .
(i) Sally is eating a sandwich.
(j) the president thought that sally is a sandwich .

While your new grammar may generate some very silly sentences, it should not generate any that are obviously ungrammatical. For example, your grammar must be able to generate (d) but not:

*the president thought that a sandwich sighed a pickle .

since that is not okay English. The symbol * is traditionally used to mark "not okay" language.

Again, while the sentences should be okay structurally, they don't need to really make sense. You don't need to distinguish between classes of nouns that can eat, want, or think and those that can't.

An important part of the problem is to generalize from the sentences above. For example, (b) is an invitation to think through the ways that conjunctions ("and", "or") can be used in English. (g) is an invitation to think about prepositional phrases ("on the desk", "over the rainbow", "of the United States") and how they can be used.

Briefly discuss your modifications to the grammar. Hand in the new grammar (commented) as a file named grammar2 and about 20 random sentences that illustrate your modifications in a file named grammar2.gen.

Note that handling (b) and (h)/(i) can interact in a bad way, to create an ungrammatical sentence. You do not need to solve this case in this part, but you do need to discuss it and explain what the problem is with an example and a short explanation of what's happening there.

Note: The grammar file allows comments and whitespace because the grammar is really a kind of specialized programming language for describing sentences. Throughout this assignment, you should strive for the same level of elegance, generality, and documentation when writing grammars as when writing programs.

How your grammars will be tested: We want to see that your grammar can produce sentences with structures as we specified, generalize to similar structures, but not produce bad sentences.

We will run a parser that uses your grammar to try and parse a set of sentences. Your parser should be able to parse the correct ones, and not be able to parse the incorrect ones. (Recall, that if you can parse something with a grammar it means that it can be generated, and vice-versa).

Hint: If you want to see the effect of various rules while developing your grammar, you can give them a very high weight so that they will trigger often.

Hint: Imlementing Part 3 can also help you debug your grammar.

Part 3 -- Tree Structures (10)

Modify the generate.py program so that it gets an optional commandline swith -t. When -t is present, the program should not generate only sentences, but also the tree structures that generated the sentences. For example, when invoked as: python generate.py grammar -t instead of just printing the floor kissed the delicious chief of staff . it should print the more elaborate version

    (ROOT (S (NP (Det the)
                 (Noun floor))
             (VP (Verb kissed)
                 (NP (Det the)
                     (Noun (Adj delicious)
                           (Noun chief
                                 of
                                 staff)))))
           .)

which includes extra information showing how the sentence was generated. For example, the above derivation used the rules Noun -> floor and Noun -> Adj Noun, among others.

Note: It's not too hard to print the pretty indented format above. But it's not necessary. It's sufficient if you will the entire expression on a single line:

(ROOT (S (NP (Det the) (Noun floor)) (VP (Verb kissed) (NP (Det the) (Noun (Adj delicious) (Noun chief of staff))))) .)

The bracketed expression can be visualized in tree form using web-based visulizers like this and this and this, or java-based visualizers like this. This sort of output can be useful when debugging your grammar -- understanding which rules are responsible for what structures.

Hint: The changes to the original code are not big. You don't have to represent a tree in memory, so long as the string you print has the parentheses and nonterminals in the right places.

Your code should support both the -n switch from part 0, and the -t switch, either alone or together.

Generate 5 more random sentences, in tree format. Submit them as in a file called part3.gen as well as the code for the modified generate.py program.

Part 4 -- Additional Lingusitic Structures (40 points, 20 each)

Think about all of the following phenomena, and extend your grammar from part 2 to handle ANY TWO of them -- your choice. Briefly discuss your solutions and provide sample output.

Be sure you can handle the particular examples suggested, which means among other things your grammar must include the words in those examples.

You should also generalize appropriately beyond these examples. As always, try to be elegant in your grammar design, but you will find that these phenomena are somewhat harder to handle elegantly with CFG notation.

Important: Your final grammar should handle everything from part 2, plus both of the phenomena you chose to add. This means you have to worry about how your rules might interact with one another. Good interactions will elegantly use the same rule to help describe two phenomena. Bad interactions will allow your program to generate ungrammatical sentences, which will hurt your grade.

(a) "a" vs. "an". Add some vocabulary words that start with vowels, and fix your grammar so that it uses "a" or "an" as appropriate (e.g., an apple vs. a president). This is harder than you might think: how about a very ambivalent apple?
(b) Yes-no questions. Examples:
- did Sally eat a sandwich ?
- will Sally eat a sandwich ?
Of course, don't limit yourself to these simple sentences. Also consider how to make yes-no questions out of the statements in part 2.
(c) Relative clauses. Examples:
- the pickle kissed the president that ate the sandwich .
- the pickle kissed the sandwich that the president ate .
- the pickle kissed the sandwich that the president thought that Sally ate .
Of course, your grammar should also be able to handle relative-clause versions of more complicated sentences, like those in part 2.

Hint: These sentences have something in common with (d).
(d) WH-word questions. If you also did (b), handle questions like
- what did the president think ?
- what did the president think that Sally ate ?
- what did Sally eat the sandwich with ?
- who ate the sandwich ?
- where did Sally eat the sandwich ?
If you didn't also do (b), you are allowed to make your life easier by instead handling "I wonder" sentences with so-called "embedded questions":
- I wonder what the president thought .
- I wonder what the president thought that Sally ate .
- I wonder what Sally ate the sandwich with .
- I wonder who ate the sandwich .
- I wonder where Sally ate the sandwich .
Of course, your grammar should be able to generate wh-word questions or embedded questions that correspond to other sentences.

Hint: All these sentences have something in common with (c).
(e) Singular vs. plural agreement. For this, you will need to use a present-tense verb since past tense verbs in English do not show agreement.

Examples:
- the citizens choose the president .
- the president chooses the chief of staff .
- the president and the chief of staff choose the sandwich .
(You may not choose both this question and question (a), as the solutions are somewhat similar.)

(f) Tenses. For example,

the president has been eating a sandwich .

Here you should try to find a reasonably elegant way of generating all the following 12 tenses:

                  | present         | past             | future                |
 ---------------- | --------------- | ---------------- | --------------------- |
 simple           | eats            | ate              | will eat              |
 perfect          | has eaten       | had eaten        | will have eaten       |
 progressive      | is eating       | was eating       | will be eating        |
 perfect + progr. | has been eating | had been eating  | will have been eating |

(g) Appositives. Examples:
- the president perplexed Sally , the fine chief of staff .
- Sally , the chief of staff , 59 years old , who ate a sandwich , kissed the floor .
The tricky part of this one is to get the punctuation marks right. For the appositives themselves, you can rely on some canned rules like Appos -> 59 years old. although if you also did (c), try to extend your rules from that problem to automatically generate a range of appositives such as who ate a sandwich and which the president ate.

Hand in your grammar (commented) as a file named grammar4. Be sure to indicate clearly in a comment at the top of the file which TWO of the above phenomena it handles. Provide a brief description of your solution.

How to submit:

Your submission should include a single zip file with the files ANSWERS.pdf, generate.py, grammar1, grammar1.gen, grammar2, grammar2.gen, part3.gen, grammar4.

Make sure your ANSWERS.pdf file includes your names and IDs.