nlp 2018 course

Assignment 2 -- Grammar Writing

Due date: 16 May, 2018, Midnight.

(the assignment is based on an assignment by Jason Eisner at JHU)

This assignment will help you understand how CFGs work and how they can be used -- sometimes comfortably, sometimes not -- to describe natural language. It will also make you think about some linguistic phenomena that are interesting in their own right.

You will submit a file in PDF format, called ANSWERS.pdf. It should contain your name and ID number. Whenever the assignment says "discuss..." or "describe..." it means that you should write a discussion or description in that file (indicate clearly which question you are answering).

Part 0 -- Generating Sentences from a CFG (1 point)

We provide you with a python program generate.py that generates sentences from a grammar.

Download the basic grammar file grammar. Look at it and its format. Now, generate sentences from it using the generate.py program:

python generate.py grammar

(generate takes as argument a grammar file, and generates a sentence from it.)

The format of the grammar file is as follows:

    # A fragment of the grammar to illustrate the format.
    1 ROOT S .
    1 S NP VP
    1 NP Det Noun # There are multiple rules for NP.
    1 NP NP PP
    1 Noun president
    1 Noun chief of staff
    

corresponding to the rules

    ROOT -> S .
    S -> NP VP
    NP -> Det Noun
    NP -> NP PP
    Noun -> president
    Noun -> chief of staff

Notice that a line consists of three parts:

We take ROOT to be the root of the grammar -- every generation begins with ROOT.

Look at the generate.py program code and understand it. Extend the program so that it accepts and optional argument (indicated by -n) that specifies the number of sentences to generate. For example, the invocation: python generate.py grammar -n 5 will generate 5 sentences, python generate.py grammar -n 2 will generate two sentences, and python generate.py grammar will generate one sentence, as it does now.

Each sentence should be printed on its own line.

(the list sys.argv includes the commandline arguments of a python program. You can either take care of the commandline processing logic yourself, or use a module such as docopt.)

Part 1 -- Weights (9 points)

a. Why does the program generate so many long sentences? Specifically, what grammar rule is responsible and why? What is special about this rule? discuss.

b. The grammar allows multiple adjectives, as in the fine perplexed pickle. Why do the generated sentences do this so rarely? discuss.

c. The grammar format allows specifying different weights to different rules. For example, in the grammar:

    3 NP A B
    1 NP C D E
    1 NP F
    2 X NP NP

The tree NP rules have relative odds of 3:1:1, so when generating an NP, the generator will pick the first rules 3/5 of the times, the second one 1/5 of the times, and the third one 1/5 of the times.

Which numbers must you modify to fix the problems in (a) and (b), making the sentences shorter and the adjectives more frequent? Verify your answer by generating from the grammar. Discuss your solution (which rules and why).

d. What other numeric adjustments can you make to the grammar in order to favor more natural sets of sentences? Experiment. Hand in your grammar file (including the solution to (c) above, in a file named grammar1, with comments that motivate your changes, together with 10 sentences generated by the grammar, in a file called grammar1.gen.

Part 2 -- Extending the Grammar (30)

Modify the grammar so it can also generate the types of phenomena illustrated in the following sentences. You want to end up with a single grammar that can generate all of the following sentences as well as grammatically similar sentences.

While your new grammar may generate some very silly sentences, it should not generate any that are obviously ungrammatical. For example, your grammar must be able to generate (d) but not:

*the president thought that a sandwich sighed a pickle .

since that is not okay English. The symbol * is traditionally used to mark "not okay" language.

Again, while the sentences should be okay structurally, they don't need to really make sense. You don't need to distinguish between classes of nouns that can eat, want, or think and those that can't.

An important part of the problem is to generalize from the sentences above. For example, (b) is an invitation to think through the ways that conjunctions ("and", "or") can be used in English. (g) is an invitation to think about prepositional phrases ("on the desk", "over the rainbow", "of the United States") and how they can be used.

Briefly discuss your modifications to the grammar. Hand in the new grammar (commented) as a file named grammar2 and about 20 random sentences that illustrate your modifications in a file named grammar2.gen.

Note that handling (b) and (h)/(i) can interact in a bad way, to create an ungrammatical sentence. You do not need to solve this case in this part, but you do need to discuss it and explain what the problem is with an example and a short explanation of what's happening there.

Note: The grammar file allows comments and whitespace because the grammar is really a kind of specialized programming language for describing sentences. Throughout this assignment, you should strive for the same level of elegance, generality, and documentation when writing grammars as when writing programs.

How your grammars will be tested: We want to see that your grammar can produce sentences with structures as we specified, generalize to similar structures, but not produce bad sentences.

We will run a parser that uses your grammar to try and parse a set of sentences. Your parser should be able to parse the correct ones, and not be able to parse the incorrect ones. (Recall, that if you can parse something with a grammar it means that it can be generated, and vice-versa).

Hint: If you want to see the effect of various rules while developing your grammar, you can give them a very high weight so that they will trigger often.

Hint: Imlementing Part 3 can also help you debug your grammar.

Part 3 -- Tree Structures (10)

Modify the generate.py program so that it gets an optional commandline swith -t. When -t is present, the program should not generate only sentences, but also the tree structures that generated the sentences. For example, when invoked as: python generate.py grammar -t instead of just printing the floor kissed the delicious chief of staff . it should print the more elaborate version

    (ROOT (S (NP (Det the)
                 (Noun floor))
             (VP (Verb kissed)
                 (NP (Det the)
                     (Noun (Adj delicious)
                           (Noun chief
                                 of
                                 staff)))))
           .)

which includes extra information showing how the sentence was generated. For example, the above derivation used the rules Noun -> floor and Noun -> Adj Noun, among others.

Note: It's not too hard to print the pretty indented format above. But it's not necessary. It's sufficient if you will the entire expression on a single line:

(ROOT (S (NP (Det the) (Noun floor)) (VP (Verb kissed) (NP (Det the) (Noun (Adj delicious) (Noun chief of staff))))) .)

The bracketed expression can be visualized in tree form using web-based visulizers like this and this and this, or java-based visualizers like this. This sort of output can be useful when debugging your grammar -- understanding which rules are responsible for what structures.

Hint: The changes to the original code are not big. You don't have to represent a tree in memory, so long as the string you print has the parentheses and nonterminals in the right places.

Your code should support both the -n switch from part 0, and the -t switch, either alone or together.

Generate 5 more random sentences, in tree format. Submit them as in a file called part3.gen as well as the code for the modified generate.py program.

Part 4 -- Additional Lingusitic Structures (40 points, 20 each)

Think about all of the following phenomena, and extend your grammar from part 2 to handle ANY TWO of them -- your choice. Briefly discuss your solutions and provide sample output.

Be sure you can handle the particular examples suggested, which means among other things your grammar must include the words in those examples.

You should also generalize appropriately beyond these examples. As always, try to be elegant in your grammar design, but you will find that these phenomena are somewhat harder to handle elegantly with CFG notation.

Important: Your final grammar should handle everything from part 2, plus both of the phenomena you chose to add. This means you have to worry about how your rules might interact with one another. Good interactions will elegantly use the same rule to help describe two phenomena. Bad interactions will allow your program to generate ungrammatical sentences, which will hurt your grade.

Hand in your grammar (commented) as a file named grammar4. Be sure to indicate clearly in a comment at the top of the file which TWO of the above phenomena it handles. Provide a brief description of your solution.

How to submit:

Your submission should include a single zip file with the files ANSWERS.pdf, generate.py, grammar1, grammar1.gen, grammar2, grammar2.gen, part3.gen, grammar4.

Make sure your ANSWERS.pdf file includes your names and IDs.