Natural languages, communication, etc





Archive for February, 2010

nadsat dictionary

Does anyone know where on the net I can get a dictionary of "nadsat", the
Russified English slang in Burgess’ _A Clockwork Orange_?  All I need is a
list of the nadsat words used in the book w/their English equivalents, so
I don’t have to go through and list them myself.

.

posted by admin in Uncategorized and have Comments (3)

2 years old

As a developmental speech and langaige therapy I’m looking for langauge program fot 2 years old,
for kindergarden

posted by admin in Uncategorized and have No Comments

Repeated Announcement TWLT9

[Follow-ups worden doorgestuurd naar nlnet.misc]

                     Twente Workshop on Language Technology
                 Corpus-Based Approaches to Dialogue Modelling
                                  June 9, 1995
                              Repeated Announcement

On June 9, the ninth international Twente Workshop on Language Technology
(TWLT 9) will take place at the University of Twente, Enschede, the Netherlands.
This time, the workshop will be devoted to approaches which emphasize the use
of empirical data as a basis for dialogue modelling.
Special attention will be paid to the exploitation of man-man and (simulated)
man-machine dialogues for the design of (spoken) dialogue models and systems.

TWLT 9 will be organized around the following subjects:

   * corpus-based methods as applied to dialogue modelling
   * methods for evaluating implemented dialogue models
   * dialogue formalisms, speech acts, dialogue grammars
   * intention-based approaches to dialogue modelling
   * tools and methods to obtain and process dialogues

TWLT 9 is sponsored by the University of Twente, KPN Research and NWO, the
Dutch Organization for Scientific Research.

The latest version of the programme, as well as abstracts of the
talks are available at URL:

http://hydra.cs.utwente.nl/~stan/twlt9/.

The TWLT Series
—————

TWLT 9 is the ninth in a series of similar workshops, which started in March
1991. As the previous workshops, it has an international character.
In the TWLT series the following workshops have been organized so far:

TWLT 1, March 1991. Tomita’s Algorithm: Extensions and Applications.
TWLT 2, November 1991. Linguistic Engineering: Tools and Products.
TWLT 3, May 1992. Connectionism and Natural Language Processing.
TWLT 4, September 1992. Pragmatics in Language Technology.
TWLT 5, June 1993. Natural Language Interfaces: From Laboratory to Commercial
        and User Environment.
TWLT 6, December 1993. Natural Language Parsing: Methods and Formalisms.
TWLT 7, June 1994. Computer-Assisted Language Learning.
TWLT 8, December 1994. Speech and Language Engineering.

Proceedings of TWLT 2-TWLT 8 can be ordered via the organizing secretariat.
TWLT 10 will take place December 6-8, 1995. Its topic is algebraic methods in
language processing. Organizers are M. Nivat, T. Rus and A. Nijholt. It will be
a joint event with the first AMAST workshop on language processing.
For information contact anijh…@cs.utwente.nl.

Proceedings of TWLT 9 will be available at the workshop.

Program Time Schedule
———————

Friday, June 9, 1995

9.00    Registration and Coffee

9.45    Opening

10.00   N. Fraser (Vocalis Ltd, Cambridge)
        Messy data, what can we learn from it?

10.30   N. Dahlback (NLP Laboratory, Linkoping)
        Kinds of agents and types of dialogues

11.00   Break

11.15   J.H. Connolly (Loughborough University of Technology)
        Clause-internal structure in spoken dialogue

11.45   J. Kowtko (HCRC, Edinburgh)
        The analysis of conversational games (preliminary title)

12.15   Lunch

13.45   J. Alexandersson (DFKI, Saarbrucken)
        Designing the dialogue component in a speech translation system
        – a corpus-based approach

14.15   H. Aust (Philips, Aachen)
        Dialogue control in automatic inquiry systems

14.45   Coffee

15.00   T. Andernach (University of Twente, Enschede)
        Predicting and interpreting speech acts in a theatre information
        and booking system

15.30   M. Rats (ITK, Tilburg)
        Referring to topics – a corpus-based study

16.00   Break

16.15   H. Dybkjaer, L. Dybkjaer and N. O. Bernsen (Centre for Cognitive
        Science, Roskilde)
        Design, formalization and evaluation of spoken language dialogue

16.45   D.G. Novick and B. Hansen (Orgeon Graduate Institute of Science and
        Technology, Portland)
        Mutuality strategies for reference in task-oriented dialogue

17.15   Discussion

18.00   Closing

Organization
————

TWLT9 is organized by the PARLEVINK-project, a language theory and technology
project of the University of Twente in co-operation with KPN Research.
The local organizers are:

Toine Andernach, University of Twente
email: ander…@cs.utwente.nl; fax: +31 53 315283

Stan P. van de Burgt, KPN Research
email: S.P.vandeBu…@research.ptt.nl; fax: +31 70 3326477

Gerrit van der Hoeven, University of Twente
email: vdhoe…@cs.utwente.nl; fax: +31 53 315283

Registration
————

To register fill in the form below and send it to the organizing secretariat
(preferably via e-mail) or fill-in the form at URL:
http://hydra.cs.utwente.nl/~stan/twlt9/.

Regular registration fee: DFL 75.-
Students registration fee: DFL 40.-
This includes a lunch, refreshments during breaks and workshop proceedings.
Proceedings will be available on site and registration fee is to be paid
on site.

Please register well in advance and no later than June 1!
The number of participants is limited to 50.

Workshop site
————-

The workshop will take place in the "Demozaal" in building "L" of the
Informatica-complex at the campus of the University of Twente, Enschede,
the Netherlands. Enschede is located in the eastern part of the Netherlands.

Every 30 minutes a train in the direction of Enschede leaves from Schiphol
Airport.

The campus can be reached by car and by bus. By car, follow the signs
"Universiteit". From Hengelo station (the faster) take bus 15 or 51, and
from Enschede station take bus 1 or 51.
A ten minutes walk from the campus entrance will bring you to the workshop.
Follow the red signs "TWLT".

During the conference hours participants can be reached via the secretariat.
For hotel accommodation (information and booking) contact the organizing
secretariat.

Workshop secretariat
——————–

For more information on the workshop, please contact the organizers. For other
information and accommodation, please contact:

TWLT secretariat
University of Twente
Department of Computer Science
P.O. Box 217
7500 AE Enschede
The Netherlands
tel: +31 53 893680
fax: +31 53 315283
email: t…@cs.utwente.nl

Registration form
—————–

Name:
Department:
Organization:
Address:
City, zip:
Country:
Tel:
Fax:
Email:                  (please specify)
Type of participant (student/regular):
Hotel (y/n):            if yes:
                        nights (June 8, June 9 or both):
                        kind of room (single/double):
Comments, questions:

———————————————————————–
Toine Andernach       Department of Computer Science       P.O. Box 217
phone: +31 53 893789       University of Twente       7500 AE  Enschede
fax:   +31 53 315283   email: ander…@cs.utwente.nl    The Netherlands
                  URL: http://www.cs.utwente.nl/~andernac
———————————————————————–

[nlnet.announce is gemodereerd; stuur artikelen naar nlnet-annou...@NL.net]

posted by admin in Uncategorized and have No Comments

Grammaticalness vs order of approximation

A discussion on conlang started me thinking, and here are
the fruits of my thinking, written up a bit more carefully,
than my posts here, usually written on the spur of the moment.
Sent to conlang and LINGUIT, too.

———————————————————-

  "If we rank the sequences of a given length in order of statistical
  approximation to English, we will find both grammatical and
  ungrammatical sequences scattered throughout the list; there appears to
  be no particular relation between order of approximation and
  grammaticalness."

                     Syntactic Structures, 1957.

It has occurred to me that Chomsky’s statement cannot have been based on
actual observations showing that, indeed, one could see little
correlation between the grammaticalness of statistically approximated
texts and the order of their approximation.

This for two reasons:

1. Known methods of statistical approximations of texts become either so
   computationally expensive for higher orders or so memory-hungry that
   no computer existing at the time of publication of Syntactic
   Structures could have coped with approximations beyond the third
   order. Here is a third-order approximation of Hamlet (Bennett
   1976:122) "HAMLET OF TWE AS TO BE MURGAINS FART ASSE GIVE ONEGS LOVE
   GODY". Indeed, ungrammatical is an understatement!

2. It is easy to generate texts to any order of approximation without a
   computer, and fairly quickly too, by adapting the parlour game known
   in early surrealist circles as "cadavre exquis" from the first text
   ever generated in this manner: "Le cadavre exquis…" (it was a
   second-order approximation). But the very manner in which such texts
   are generated provides a simple proof that

   a) the higher the order of approximation, the more sequences of any
   given length are both grammatically and semantically well-formed

   b) the higher the order of approximation, the longer the sequences
   which are both grammatically and semantically well-formed.

Order of Statistical Approximation Explained.
———————————————

"Order of statistical approximation" is a notion discovered and first
explained by Shannon and Weaver in their 1949 "Mathematical Theory of
Information", as the context of the above quote (not given here) makes
clear.

Imagine that there exists a method to generate a text randomly so that
the relative frequencies of its sequences of n letters are the same as
those of a real corpus. We say of such a text that it has been
approximated to the n-th order. That is the meaning of "order of
(statistical) approximation" used there by Chomsky.

Imagine that we have approximated an English corpus to the first order.
If we now count its individual letters we will find that they occur with
approximately the same relative frequencies as in the corpus, e.g. 5.8%
A’s, 1.2% B’s, 1.7% C’s, etc. (figures computed on Act III of Hamlet,
spaces counted. See Bennett 1976:131). With a second-order approximation
we will find two-letter sequences in the same frequencies as in the
corpus, with third-order approximations three-letter sequences, etc.

First Reason: Computational Cost
——————————–

Two methods have been known from the time of "Mathematical Theory of
Information" for generating such texts.

First method. Say you want a second-order approximation. First, build a
matrix of the frequencies with which letters are found to follow each
other. Thus, having counted that T is followed 316 times by H, you enter
316 in row T, column C. This done, calculate the progressive sums of the
columns for each row. For instance,

     A   B    C    D ….
A    0   12   16   13

becomes:

     A   B    C    D ….
A    0   12   28   41

The last column of each row contains thus the absolute frequency of its
letter. E.g. (figures for Act III of Hamlet, from Bennett 1976:111):

  ….. <space>
A        2043
B         410
C         584
….     ….
<space>  6934

If there are any letters which do not occur (the column of their row is
zero), remove their rows and columns from the matrix.

Finally, calculate the progressive sums of the cells of that last column:

A        2043
B        2453
C        3027
….     ….
<space> 35224

This last figure is the number of letters (counting spaces) in the
corpus.

You are now ready to generate a random text that will approximate Hamlet
to the second order.

Step 1. Draw a random number in the range 1 to 35223. Say it is 1066.

Step 2. Scan the last column of the matrix from the top until you find a
an equal of greater figure equal (2043, row A). Output its letter: A.

Step 3. Consider the row of the letter just output (A). Draw a random
number in the range from 1 to its absolute frequency (last column:
2043). Say it is 33.

Step 4. Scan the row from the left until you find an equal or greater
figure (here 41, column D). Output the letter of that column (D).

Step 5. Continue from Step 3 until you have had enough.

The method is the same for higher-order approximations.
For an n-th order approximation you build a matrix the rows
of which correspond to the sequences of n-1 letters found in the
corpus, and you apply the algorithm above. Thus, for a fourth-order
approximation you need to this matrix (AAA, AAB, etc, presumably not
occurring in the corpus having been deleted):

       A    B    C   ….. <space>
ABE   …  …  …  …..   …
ABI   …  …  …  …..   …
ABO   …  …  …  …..   …
…   …  …  …  …..   …
ZYT   …  …  …  …..   …

As you can imagine, this method is so time-consuming that you need a
computer. But the necessary matrix quickly becomes larger and larger for
higher-order approximations. So large that the computers available at
the time of the publication of Syntactic Structures had very little
memory, could hold not a matrix needed for even only fourth-order
approximations.

Second Method.  This method, given by Shannon, is extremely simple.

Say you want to approximate Hamlet to the fifth order.

Step 1. Pick a spot in Hamlet at random. Note the first *four* letters
     there. Output them.

Step 2. Keep picking a spot at random until you find a sequence of
     letters identical to the last four letters output.

Step 3. Output the letter that follows that sequence.

Step 4. Continue from Step 2 until you have had enough.

This method is easily implemented: you only need enough memory to store
the corpus you want to mimick, plus the very simple algorithm above.
However, again, early computers had too little memory to hold any
but the most trivial corpora (Hickory, Dickory, Dock for instance).
And resorting to tape (the only storage medium that might have been
available then, other than punch cards) makes this algorithm
impossibly slow.

Even on later mainframes step 2 makes just sixth-order approximations
extremely slow (I tested it on a DEC-KL10 many years ago: it often took
a minute to output one letter).

Second Reason: The Cadavre-Exquis Proof
—————————————

This is how you play at cadavre exquis proper.

Step 1. The first player takes a sheet of paper and write one word on
it, and passes it to the next player, e.g. "Le"

Step 2. The current player, who holds the sheet, reads the word,
and writes under it a compatible word, e.g. "cadavre", folds the
paper to hide the top word, and passes it to the next player.

Step 3. The game continues from Step 2 until each has played,
or the sheet is full.

Step 4. Unfold the sheet and read it to the audience.

These rules are an algorithm for generating random text to the second
order of approximation, functionally equivalent to the two methods
detailed above. The only differences are

1. the unit of output is the word rather than the letter

2. that players may pick two-word sequences which they have never
encountered, whereas the computer algorithms can only select two-letter
sequences which do occur in the corpus to mimick.

The first difference is immaterial: letter-by-letter approximations were
historically used because they require far less storage space than
word-by-word approximations. My program, Monkey, does indifferently word
or letter approximation at the user’s request, using Shannon’s very
algorithm, only made very fast by an index structure of my invention.

The second difference means that computer algorithms, such as Shannon’s,
can only produce text as grammatical as cadavre exquis, or more
grammatical, never less.

The proof follows immediately, trivially even, that the greater the
order of approximation, the greater the grammaticalness of the text
generated.

Consider a first-order game of cadavre exquis: each player writes one
word, and folds the paper to hide it. Only very few sequences of the
resulting text will be grammatical, let alone meaningful.

Consider a tenth-order cadavre exquis: each player sees the last
*nine* words, write down a word that "makes sense" in this context,
and folds the paper to hide the top word. Only very few sequences
of the text will be ungrammatical, let alone meaningless. So that
the lengths of the grammatical sequences.

Grammaticalness is simply proved to correlate necessarily and strongly
with order of statistical approximation.

William Ralph Bennett Jr. Scientific and engineering problem-solving with
the computer. Prentice-Hall, Englewood Cliffs, NJ. 1976
ISBN 0-13-795807-2

 folds it to hide the top word,
and passes the sheet to the next player.

Step 3. The player

Suppose that we want to approximate an English corpus to
the second order. We first build a matrix of co-occurrences of the
letters and punctuation marks as counted in the corpus. For instance,
if we count T followed 12000 times by H, we enter 12000 in row T,
column H of the matrix. We now draw a letter at random. This is
the first letter of our approximated text

We say that
a text has been approximated the n-th order when it has been
randomly generated so

That is out of Syntactic Structure. Chomsky does not explain what this
"in order of statistical approximation to English" means, nor does he
give a reference. You have to guess the covert reference to Shannon and
Weaver’s 1949 "Mathematical Theory of Information", and you can guess
only because it is in the bibliography.

And here we catch Chomsky lying, in flagrante delicto. At the time of
the publication of "Syntactic Structures" no-one had approximated
English to an order beyond 3, be it letter by letter or word by word. In
fact, until I dreamt up the algorithm at the basis of Monkey three years
ago, it was computationally too expensive, even on a modern PC, even on
a mainframe, to approximate English by word to just, say, the fifth
order. Chomsky should have written:

  "If we could rank the sequences of a given length in order of statistical
  approximation to English, we would find both grammatical and
  ungrammatical sequences scattered throughout the list; there would
  appear to be no particular relation between order of approximation and
  grammaticalness. Alas, we cannot verify this experimentally because
  we do not know how to produce such sequences to the desired orders
  of approximation"

Some difference with the ex cathedra "if we rank… we will find"!
The very wording suggests that Chomsky *has* found an absence of
correlation between grammaticalness and order of approximation.
To close, I will point out that "sequences of a given length in order
of statistical approximation" is not an adequate wording for what is
meant there. Chomsky simply did not understand what he wanted to
pontificate about, and regurgitated undigested jargon in a jumbled mess.

That, in plain English, is called a charlatan.

posted by admin in Uncategorized and have No Comments

Re: Vulcan ears

Ken Mellon (kmel…@pinetree.pinetree.org) wrote:
> And of course there’s always the Klingon Dictionary, available in any
> bookstore.  K’apla!

  As I recall, the publishers had to pull that book from the market after
  losing a copyright infringement suit.  Apparently a lot of their material
  was "borrowed" without permission from another publishers’ Welsh-Tagalog
  dictionary.

   - snopes

+—————————————————————————–+
|                                                                             |
| "A few stems of asparagus eaten shall give our urine a disagreeable odor."  |
|                                                                             |
|                                                       – Benjamin Franklin   |
+—————————————————————————–+

posted by admin in Uncategorized and have No Comments

List of 'normal' English words needed

Hi everyone,

I am aware that my posting is not of a liguistic nature, sorry for that, but I
suspect that I might find people around here that could help me out.

I am operating and still developing the interactive WWW-game ‘Lingo for WWW’,
wich is an adaption of a dutch television-show-wordgame. Objective of the game
is to guess a five-letter word. The wordlist I use now origins from the
crossword-archives and is dramatically overcomplete, featuring words like
‘chevy’, ‘yecch’ and ‘blocs’. The appearance of these words does not add up to
the fun of playing, so I am looking for a less extensive lexicon. I am only
interested in the 5-letter words, but I could sieve out the other ones from a
complete list.

If you have any wordlist that seems suitable for my purpose, please mail me.

I am:  duyns…@cpedu.rug.nl

Curious? :  http://indy6.cpedu.rug.nl:8084/lingo.html

Teun Duynstee

posted by admin in Uncategorized and have No Comments

Gender of personified Death

A recent thread in sci.lang discussed about father tongue and
motherland.
I wonder about the gender usually attributed to Death in fairy tales,
popular drama, and also in pictures on cards used for fortune telling.
My first language is Polish, which like most slavonic languages
treats Death as a fem. word/person; when I got confronted with
German "Gevatter Tod", I was much surprised at seeing that other
languages/cultures treat Death as masc.;
Can fellow netters please tell me how other languages treat this
subject?
I’m *not* interested in purely *grammatical gender* alone, but also
implications on how people imagine Death – as a man or as woman.

Malgorzata Roos, Zurich
gr…@amath.unizh.ch

posted by admin in Uncategorized and have Comments (8)

linguistic societies/organizations

next year I will begin graduate work in the feild of linguistics.  I would
like to obtain information regrading linguistic orgaizations in ANY
countries.  Can anyone give me some names?

posted by admin in Uncategorized and have No Comments

linguistic societies

Looking for a list of linguistic societies/organizations in ANY country.
Can anyone help?

posted by admin in Uncategorized and have No Comments

Maa

After reading today’s posts about motherlanguage and father’s land
I got one ridiculous thought I like to share.

As we speak about "father’s land" we also use to speak about
"mother earth".
The word "mother" is quite similar in indoeuropean languages, also
Estonian "ema" probably has same background. In Finnish we use quite
different word "{iti". The origins of that word is unknown for me.
(perhaps from baby’s crying "{{{" just to make a simple guess).
In Finnish we have also an Estonian like "emo" for animalmothers.

What sounds funny for me is that the name of earth in Finnish "maa"
(same in Estonian) sounds very motherlike. Could there be some
connection?

Jorma Kyppo
Laukaa, Finland
jo…@jytko.jyu.fi

P.S. I just heard, that the word "mammuth" is Jakutian origin
and becomes from their word for earth or ground. When they saw
the firsr mammuths, that were digged out of ice, they thought
mammuth is kind of gigantic mole living underground!

posted by admin in Uncategorized and have No Comments