Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at
Thu Feb 1 02:19:48 CST 2018

2018-02-01 8:03 GMT+01:00 Philippe Verdy <verdy_p at>:

> 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode <
> unicode at>:
>> For example, in  /a{2}/ has 4 nodes (still marked by leading apostrophes
> here), and so k=2:
>   '<SOT>    <----
>             \   /        \
>              'a          |
>               |           ^
>           '{2,2}        |
>              /    \       /
  '<EOT>      --->

I forgot to describe how I represent the graph (when compiling regexps
only). And I forgot the second (looping) link from the quantifier (added

The graph is just a vector of nodes stored as a linear array indexed by
integers. Nodes (with leaing apostrophes in the notation above) are objects
with one of 4 types:

   SOT, character class, quantifier, or EOT.

- The SOT and EOT node types are trivial and has no other properties, they
exists once and only once in all graphs, they can be the same actual type.

- the quantifier node type has two integer properties: min and max taking a
positive or null value, and INFINITE for the unbounded quantifer (e.g. the
"+", "*", or "{2,}" quantifiers) with the constraint that min<=max.
INFINITE can be represented in an integer as -1 or MAXINT. It also has
another computed property, the counter index, i.e. an index in the array of
counter values you'll allocate for the "state" variable you'll use in the
matcher. It has two other properties: the next node number (in the
represented graph) to take whever the counter has reached the [min, max]
interval or not, and possibly a "greedy" flag to specify which condition
(false or true) you'll evaluate first (instead of this flag you can create
two separate subtypes for greedy quantifiers and non-greedy quantifiers to
avoid this test at runtime in the matcher).

- the character class node type can have subtypes : single character, basic
character class (like [abc]), or a more complex character class,
implemented by a character class method "is(c)" where c is the input code
point from the text being scanned. The functional method can evaluate for
example Unicode character properties such as "isupper(c)" evaluating the
[:gc=Lu:] character class, or "isdigit(c)" evaluating the [:gc=Nd:]
character class, or isin(c, "string") to detect if c is present in the
given string containing a list of characters. or isnotin(c, "string") for
the inverse. In Javascript, it is trivial to build functions on the fly, in
C/C++ you need a representation to call the appropriate method with some
parameters. It

You may also have two additional node types for capturing groups :

  SOC (start of capturing group) and   EOC (end of capturing group)

both with a single property: the capturing group index. You'll allocate a
new index for the array of captured groups to allocate in the "state"
variable used by the matcher. This type of node is unconditionally
traversed, but traversing it just consists into storing the current input
position in the array of captured groups (and making sure that, while
runniong the macher, you'll keep the past scanned input text in a buffer as
soon as you've started caputuring any group)

All node types have also an array of output node index (this array is
unordered: all of them will be taken simultaneously by a step in the
matcher, so you can segregate node types instead of using polymorphic
nodes, and then store in each node an array of indexes for each output node
type),and then instead of using a single array of nodes for storing the
graph, use a separate array for quantifiers nodes, and an separate array
for character class nodes. As SOT will never be part of the output nodes to
link to, you'll just need a boolean flag to say if the EOT node is part of
the output nodes from a node in your graph,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list