This is the utf/ure library, version 2.8. It is based (a very long
time ago) on Henry Spencer's regular expression package, but very
heavily modified by myself. I am fairly sure that Henry would have
nothing to do with the changes I've made. This library also includes
some utf routines, as I've made the regular expressions UTF aware.

The changes that I've made are:

1. egrep-style {min,max} patterns,
2. \<\>, \b, \s\S, \w\W, \p\P, \d\D matching
3. @ wildcard matching (and appropriate change of . wildcard),
4. case insensitivity checked at runtime rather than at compile time,
5. calculate character ranges at runtime, not compilation
6. internal use of Runes (although interface to both compile and
execute routines is by utf string),
7. utf-aware expressions and searching.
8. Unicode \uxxxx character comprehension
9. Elimination of all static variables. Threads, here we come...
10. Made the routines use a POSIX-like interface, with some extensions,
and some omissions. The ommissions are the things like [:[digit]], which
are better supported by the utf routines and collation sequences.
11. Added an autoconf configure script, and supporting *.in files

In addition, I've added some basic UTF-8 routines, as coded from "The
Java Programming Language" by Ken Arnold and James Gosling, Addison
Wesley 1996.  Because Unicode just specifies a character encoding set,
but there is no inherent ordering in the said set, I've implemented
this in a UTF file called langcoll.utf, which is installed in
${prefix}/lib by the installation process.  This has some first
guesses at language collation sequences for German (from my 3 years
living there), and for French (from school).  I apologise in advance
for all the errors in that file.  Please send any enhancements or
corrections to the address shown at the bottom of this file.

The routines that mimic ASCII character manipulation in Unicode are
provided here in the urelang file, and are usually prefixed with a
UNICODE_ designation e.g. UNICODE_isdigit, UNICODE_isalpha etc.
There are two routines for deciding if the unicode character (a Rune)
is numeric:  UNICODE_isdigit() tells if the Rune is a Unicode digit,
whilst UNICODE_isnumber() tells if the Rune is a number in your
language collating sequence.  For example, a Tamil one (\u0be7) will
always return non-zero when passed as an argument to UNICODE_isdigit,
but it will only return non-zero when passed as an argument to
UNICODE_isnumber when the current language collation sequence is set
to Tamil.  There is a similar relationship between UNICODE_isletter(),
which returns non-zero when passed a Unicode letter, and
UNICODE_isalpha, which will only return non-zero when passed a Rune in
your language collation sequence for upper or lower case.

If a language has no difference between upper and lower case, the
alphabet should be presented as all lower case in the langcoll.utf. 
If there is no ordering inherent in the language, then there will
presumably be no range searching done using the language, and so the
coder of the language collation sequence is free to choose an
arbitrary sequence.

The utf regexp-aware grep that is included is not intended to be a
replacement for your normal grep program - it is simply a "proof of
concept" utf-aware grep.  I'm well aware it's not fast, but
functionality was the goal, not blinding speed.

Finally, if you are looking for a UTF-aware editor to experiment
with UTF, I highly recommend Gary Capell's wily editor

	ftp://ftp.cs.su.oz.au/gary/wily/src/wily.tgz

or the version of sam that works under X11

	ftp://netlib.bell-labs.com/netlib/research/sam.shar.Z

Alistair G. Crooks
(agc@amdahl.com or agc@westley.demon.co.uk)
Mon Feb 24 12:17:23 GMT 1997
