A discussion on conlang started me thinking, and here are
the fruits of my thinking, written up a bit more carefully,
than my posts here, usually written on the spur of the moment.
Sent to conlang and LINGUIT, too.
———————————————————-
"If we rank the sequences of a given length in order of statistical
approximation to English, we will find both grammatical and
ungrammatical sequences scattered throughout the list; there appears to
be no particular relation between order of approximation and
grammaticalness."
Syntactic Structures, 1957.
It has occurred to me that Chomsky’s statement cannot have been based on
actual observations showing that, indeed, one could see little
correlation between the grammaticalness of statistically approximated
texts and the order of their approximation.
This for two reasons:
1. Known methods of statistical approximations of texts become either so
computationally expensive for higher orders or so memory-hungry that
no computer existing at the time of publication of Syntactic
Structures could have coped with approximations beyond the third
order. Here is a third-order approximation of Hamlet (Bennett
1976:122) "HAMLET OF TWE AS TO BE MURGAINS FART ASSE GIVE ONEGS LOVE
GODY". Indeed, ungrammatical is an understatement!
2. It is easy to generate texts to any order of approximation without a
computer, and fairly quickly too, by adapting the parlour game known
in early surrealist circles as "cadavre exquis" from the first text
ever generated in this manner: "Le cadavre exquis…" (it was a
second-order approximation). But the very manner in which such texts
are generated provides a simple proof that
a) the higher the order of approximation, the more sequences of any
given length are both grammatically and semantically well-formed
b) the higher the order of approximation, the longer the sequences
which are both grammatically and semantically well-formed.
Order of Statistical Approximation Explained.
———————————————
"Order of statistical approximation" is a notion discovered and first
explained by Shannon and Weaver in their 1949 "Mathematical Theory of
Information", as the context of the above quote (not given here) makes
clear.
Imagine that there exists a method to generate a text randomly so that
the relative frequencies of its sequences of n letters are the same as
those of a real corpus. We say of such a text that it has been
approximated to the n-th order. That is the meaning of "order of
(statistical) approximation" used there by Chomsky.
Imagine that we have approximated an English corpus to the first order.
If we now count its individual letters we will find that they occur with
approximately the same relative frequencies as in the corpus, e.g. 5.8%
A’s, 1.2% B’s, 1.7% C’s, etc. (figures computed on Act III of Hamlet,
spaces counted. See Bennett 1976:131). With a second-order approximation
we will find two-letter sequences in the same frequencies as in the
corpus, with third-order approximations three-letter sequences, etc.
First Reason: Computational Cost
——————————–
Two methods have been known from the time of "Mathematical Theory of
Information" for generating such texts.
First method. Say you want a second-order approximation. First, build a
matrix of the frequencies with which letters are found to follow each
other. Thus, having counted that T is followed 316 times by H, you enter
316 in row T, column C. This done, calculate the progressive sums of the
columns for each row. For instance,
A B C D ….
A 0 12 16 13
becomes:
A B C D ….
A 0 12 28 41
The last column of each row contains thus the absolute frequency of its
letter. E.g. (figures for Act III of Hamlet, from Bennett 1976:111):
….. <space>
A 2043
B 410
C 584
…. ….
<space> 6934
If there are any letters which do not occur (the column of their row is
zero), remove their rows and columns from the matrix.
Finally, calculate the progressive sums of the cells of that last column:
A 2043
B 2453
C 3027
…. ….
<space> 35224
This last figure is the number of letters (counting spaces) in the
corpus.
You are now ready to generate a random text that will approximate Hamlet
to the second order.
Step 1. Draw a random number in the range 1 to 35223. Say it is 1066.
Step 2. Scan the last column of the matrix from the top until you find a
an equal of greater figure equal (2043, row A). Output its letter: A.
Step 3. Consider the row of the letter just output (A). Draw a random
number in the range from 1 to its absolute frequency (last column:
2043). Say it is 33.
Step 4. Scan the row from the left until you find an equal or greater
figure (here 41, column D). Output the letter of that column (D).
Step 5. Continue from Step 3 until you have had enough.
The method is the same for higher-order approximations.
For an n-th order approximation you build a matrix the rows
of which correspond to the sequences of n-1 letters found in the
corpus, and you apply the algorithm above. Thus, for a fourth-order
approximation you need to this matrix (AAA, AAB, etc, presumably not
occurring in the corpus having been deleted):
A B C ….. <space>
ABE … … … ….. …
ABI … … … ….. …
ABO … … … ….. …
… … … … ….. …
ZYT … … … ….. …
As you can imagine, this method is so time-consuming that you need a
computer. But the necessary matrix quickly becomes larger and larger for
higher-order approximations. So large that the computers available at
the time of the publication of Syntactic Structures had very little
memory, could hold not a matrix needed for even only fourth-order
approximations.
Second Method. This method, given by Shannon, is extremely simple.
Say you want to approximate Hamlet to the fifth order.
Step 1. Pick a spot in Hamlet at random. Note the first *four* letters
there. Output them.
Step 2. Keep picking a spot at random until you find a sequence of
letters identical to the last four letters output.
Step 3. Output the letter that follows that sequence.
Step 4. Continue from Step 2 until you have had enough.
This method is easily implemented: you only need enough memory to store
the corpus you want to mimick, plus the very simple algorithm above.
However, again, early computers had too little memory to hold any
but the most trivial corpora (Hickory, Dickory, Dock for instance).
And resorting to tape (the only storage medium that might have been
available then, other than punch cards) makes this algorithm
impossibly slow.
Even on later mainframes step 2 makes just sixth-order approximations
extremely slow (I tested it on a DEC-KL10 many years ago: it often took
a minute to output one letter).
Second Reason: The Cadavre-Exquis Proof
—————————————
This is how you play at cadavre exquis proper.
Step 1. The first player takes a sheet of paper and write one word on
it, and passes it to the next player, e.g. "Le"
Step 2. The current player, who holds the sheet, reads the word,
and writes under it a compatible word, e.g. "cadavre", folds the
paper to hide the top word, and passes it to the next player.
Step 3. The game continues from Step 2 until each has played,
or the sheet is full.
Step 4. Unfold the sheet and read it to the audience.
These rules are an algorithm for generating random text to the second
order of approximation, functionally equivalent to the two methods
detailed above. The only differences are
1. the unit of output is the word rather than the letter
2. that players may pick two-word sequences which they have never
encountered, whereas the computer algorithms can only select two-letter
sequences which do occur in the corpus to mimick.
The first difference is immaterial: letter-by-letter approximations were
historically used because they require far less storage space than
word-by-word approximations. My program, Monkey, does indifferently word
or letter approximation at the user’s request, using Shannon’s very
algorithm, only made very fast by an index structure of my invention.
The second difference means that computer algorithms, such as Shannon’s,
can only produce text as grammatical as cadavre exquis, or more
grammatical, never less.
The proof follows immediately, trivially even, that the greater the
order of approximation, the greater the grammaticalness of the text
generated.
Consider a first-order game of cadavre exquis: each player writes one
word, and folds the paper to hide it. Only very few sequences of the
resulting text will be grammatical, let alone meaningful.
Consider a tenth-order cadavre exquis: each player sees the last
*nine* words, write down a word that "makes sense" in this context,
and folds the paper to hide the top word. Only very few sequences
of the text will be ungrammatical, let alone meaningless. So that
the lengths of the grammatical sequences.
Grammaticalness is simply proved to correlate necessarily and strongly
with order of statistical approximation.
William Ralph Bennett Jr. Scientific and engineering problem-solving with
the computer. Prentice-Hall, Englewood Cliffs, NJ. 1976
ISBN 0-13-795807-2
folds it to hide the top word,
and passes the sheet to the next player.
Step 3. The player
Suppose that we want to approximate an English corpus to
the second order. We first build a matrix of co-occurrences of the
letters and punctuation marks as counted in the corpus. For instance,
if we count T followed 12000 times by H, we enter 12000 in row T,
column H of the matrix. We now draw a letter at random. This is
the first letter of our approximated text
We say that
a text has been approximated the n-th order when it has been
randomly generated so
That is out of Syntactic Structure. Chomsky does not explain what this
"in order of statistical approximation to English" means, nor does he
give a reference. You have to guess the covert reference to Shannon and
Weaver’s 1949 "Mathematical Theory of Information", and you can guess
only because it is in the bibliography.
And here we catch Chomsky lying, in flagrante delicto. At the time of
the publication of "Syntactic Structures" no-one had approximated
English to an order beyond 3, be it letter by letter or word by word. In
fact, until I dreamt up the algorithm at the basis of Monkey three years
ago, it was computationally too expensive, even on a modern PC, even on
a mainframe, to approximate English by word to just, say, the fifth
order. Chomsky should have written:
"If we could rank the sequences of a given length in order of statistical
approximation to English, we would find both grammatical and
ungrammatical sequences scattered throughout the list; there would
appear to be no particular relation between order of approximation and
grammaticalness. Alas, we cannot verify this experimentally because
we do not know how to produce such sequences to the desired orders
of approximation"
Some difference with the ex cathedra "if we rank… we will find"!
The very wording suggests that Chomsky *has* found an absence of
correlation between grammaticalness and order of approximation.
To close, I will point out that "sequences of a given length in order
of statistical approximation" is not an adequate wording for what is
meant there. Chomsky simply did not understand what he wanted to
pontificate about, and regurgitated undigested jargon in a jumbled mess.
That, in plain English, is called a charlatan.