Language syntax - Definite clause grammars

Introduction

Definite clause grammars (DCG's) are used to describe parser frameworks for formal or natural languages. The following example adapted from Wikipedia can parse a subset of English sentences:

sentence --> noun_phrase, verb_phrase.
noun_phrase --> det, noun.
verb_phrase --> verb, noun_phrase.
det --> "the".
det --> "a".
noun --> "cat".
noun --> "bat".
verb --> "eats".

The first rule can be read as "a sentence is a noun phrase followed by a verb phrase". DCG rules are translated into ordinary predicates by the Plang parser. The predicates take two arguments for the front and end of a "difference list":

sentence(L1, L3) { noun_phrase(L1, L2); verb_phrase(L2, L3); }
noun_phrase(L1, L3) { det(L1, L2); noun(L2, L3); }
verb_phrase(L1, L3) { verb(L1, L2); noun_phrase(L2, L3); }
det(["the"|L], L).
det(["a"|L], L).
noun(["cat"|L], L).
noun(["bat"|L], L).
verb(["eats"|L], L).

Once the rules have been compiled, the program can be queried for valid sentences:

sentence(["the", "cat", "eats", "a", "bat"], [])            succeeds
sentence(["the", "cat", "killed", "a", "bat"], [])          fails

The second sentence fails because "killed" is not listed as a valid verb.

DCG's and WordNet

The words module provides access to the WordNet lexical database from Princeton University. The database contains huge amounts of information about English words, particularly their part of speech (noun, verb, adverb, or adjective). This information can be very useful when building Natural Language Processing systems in Plang. The example above can be re-expressed as follows:

:- import(words).
sentence --> noun_phrase, verb_phrase.
noun_phrase --> det, words::noun.
verb_phrase --> words::verb, noun_phrase.
det --> "the".
det --> "a".

Now it is no longer necessary to explicitly list every noun and verb in the English language to parse useful sentences:

sentence(["the", "cat", "eats", "a", "bat"], [])            succeeds
sentence(["the", "cat", "killed", "a", "bat"], [])          succeeds
sentence(["the", "car", "killed", "a", "pedestrian"], [])   succeeds

See the documentation for the words module for more information on using WordNet with Plang.

Building parse trees

Usually an application wants to do more with a sentence than just check if it is valid. The applications wants to also extract a parse tree that describes the major features of the sentence to perform further processing. DCG rules can be augmented with parameters to build such a parse tree:

:- import(words).
sentence(s(NP, VP)) --> noun_phrase(NP), verb_phrase(VP).
noun_phrase(np(D, N)) --> det(D), words::noun(N).
verb_phrase(vp(V, NP)) --> words::verb(V), noun_phrase(NP).
det(d("the")) --> "the".
det(d("a")) --> "a".

We can now generate parse trees with the augmented predicates:

sentence(Tree, ["the", "cat", "eats", "a", "bat"], [])

Tree = s(np(d("the"), n("cat")), vp("eats", np(d("a"), n("bat"))))

DCG rule types

The most common DCG rules express how to convert a head production into a body made up of the concatenation of other productions:

verb_phrase --> words::verb, noun_phrase.

Multiple rules for the same head can be specified to provide alternatives:

verb_phrase --> words::verb, noun_phrase.
verb_phrase --> words::verb, pronoun(object).

Alternatives can also be specified using the (||)/2 operator:

verb_phrase --> words::verb, noun_phrase || words::verb, pronoun(object).

Alternatives can also appear in the body of a rule when surrounded by parentheses:

verb_phrase --> words::verb, (noun_phrase || pronoun(object)).

Sometimes a rule is only valid if the following text does not have a certain form. This can be expressed using the (!)/1 operator. The following rule can be read as "a foo is a bar, but only if the bar is not followed by a baz":

foo --> bar, !baz.

Rule commitment can be expressed using the commit/0 or !/0 operators. In the following example, if b is followed by c, then the DCG rule set will commit to using the first rule and try to match against d. If the text after c does not match d, then the parse fails rather than retry with e instead of c.

a --> b, c, commit, d.
a --> b, e.

The above rules can also be written as:

a --> b, c, !, d.
a --> b, e.

Literals strings or atoms for matching are expressed as lists:

noun --> ["cat"].
noun --> ["bat"].
noun --> ["bobby", "fischer"].
noun --> [dog].

As a short-hand, it is possible to drop the list brackets if the literal is a string (but not an atom):

noun --> "cat".
noun --> "bat".
noun --> "bobby", "fischer".
noun --> dog.   // expands to the dog rule, not the word "dog".

The empty list can be used to match nothing, which is useful for optional rules:

optional_foo --> foo.
optional_foo --> [].

Sometimes a parse rule cannot be expressed directly in DCG syntax. For example, recognizing an integer value - enumerating all of the integers in list literals is impractical. This case can be handled by embedding a statement into the rule to perform additional checks:

expression --> integer_value.
expression --> expression, "+", integer_value.

integer_value --> [X], { integer(X); }.

Another method is to use a custom predicate of arity 2 instead of the rule:

expression --> integer_value.
expression --> expression, "+", integer_value.

integer_value([X|Tail], Tail)
{
    integer(X);
}

This is essentially what the words module does for recognizing nouns, verbs, adjectives, and adverbs from WordNet. The module provides custom predicates that look and act like DCG rules.

Fuzzy DCG rules

DCG rules can be annotated with fuzzy logic to express the degree of confidence that the rule has in a particular sentence parse:

verb_phrase --> verb <<0.7>>.
verb_phrase --> verb, noun_phrase <<0.3>>.

When a grammar is ambiguous, this can help determine which of several parses for a sentence is the most likely.

Formal syntax of DCG rules

dcg_rule --> dcg_head, "-->", dcg_body, ".".
dcg_rule --> dcg_head, "-->", dcg_body, dcg_confidence, ".".

dcg_head --> atom.
dcg_head --> atom, "(", arguments, ")".
dcg_head --> atom, "(", ")".

dcg_body --> dcg_term.
dcg_body --> dcg_body, "||", dcg_term.

dcg_term --> dcg_neg.
dcg_term --> dcg_term, ",", dcg_neg.

dcg_neg  --> dcg_prim.
dcg_neg  --> "!", dcg_prim.

dcg_prim --> dcg_head.
dcg_prim --> compound_statement.
dcg_prim --> "[", "]".
dcg_prim --> "[", arguments, "]".
dcg_prim --> string.
dcg_prim --> "(", dcg_body, ")".
dcg_prim --> "!".
dcg_prim --> "commit".

dcg_confidence --> "<<", integer, ">>".
dcg_confidence --> "<<", float, ">>".