Regular Expression Engine

Regular expression engine uses Perl-style syntax with Unicode extensions. Non-backtracking virtual machine guarantee that regular expression searches run in time linear in the size of the input.

Example program

Here is example to show how to use API of the regular expression engine.

with Ada.Command_Line;
with Ada.Strings.Wide_Wide_Fixed;
with Ada.Wide_Wide_Text_IO;

with League.Regexps;
with League.Strings;

procedure Demo is

   function Read (File_Name : String) return League.Strings.Universal_String;

   ----------
   -- Read --
   ----------

   function Read (File_Name : String) return League.Strings.Universal_String is
      File   : Ada.Wide_Wide_Text_IO.File_Type;
      Buffer : Wide_Wide_String (1 .. 1024);
      Last   : Natural;

   begin
      Ada.Wide_Wide_Text_IO.Open
        (File, Ada.Wide_Wide_Text_IO.In_File, File_Name, "wcem=8");
      Ada.Wide_Wide_Text_IO.Get_Line (File, Buffer, Last);
      Ada.Wide_Wide_Text_IO.Close (File);

      return League.Strings.To_Universal_String (Buffer (1 .. Last));
   end Read;

   Expression : League.Strings.Universal_String :=
     Read (Ada.Command_Line.Argument (1));
   String     : League.Strings.Universal_String :=
     Read (Ada.Command_Line.Argument (2));
   Pattern    : League.Regexps.Regexp_Pattern :=
     League.Regexps.Compile (Expression);
   Match      : League.Regexps.Regexp_Match :=
     Pattern.Find_Match (String);

begin
   if Match.Is_Matched then
      Ada.Wide_Wide_Text_IO.Put_Line
        ("Match found:"
           & Integer'Wide_Wide_Image (Match.First_Index)
           & " .."
           & Integer'Wide_Wide_Image (Match.Last_Index)
           & " => '"
           & League.Strings.To_Wide_Wide_String (Match.Capture)
           & "'");

      for J in 1 .. Match.Capture_Count loop
         Ada.Wide_Wide_Text_IO.Put_Line
           ("         \"
              & Ada.Strings.Wide_Wide_Fixed.Trim
                  (Integer'Wide_Wide_Image (J), Ada.Strings.Both)
              & ":"
              & Integer'Wide_Wide_Image (Match.First_Index (J))
              & " .."
              & Integer'Wide_Wide_Image (Match.Last_Index (J))
              & " => '"
              & League.Strings.To_Wide_Wide_String (Match.Capture (J))
              & "'");
      end loop;

   else
      Ada.Wide_Wide_Text_IO.Put_Line ("Not matched");
   end if;
end Demo;

Syntax

Characters

Any character except Pattern_Syntax and Pattern_White_Space All characters except the special characters Pattern_Syntax and Pattern_White_Space match a single instance of themselves.
. (dot) Matches any single character.
\X where X is Pattern_White_Space or Pattern_Syntax Matches a specified character (can be used inside character classes).
\a Match bell character (can be used in character classes).
\e Match escape character (can be used in character classes).
\f Match form feed character (can be used in character classes).
\n Match LF (can be used in character classes).
\r Match CR (can be used in character classes).
\t Match horizontal tab character (can be used in character classes).
\v Match vertical tab character (can be used in character classes).
\cX where X in range A-Z Match an ASCII character Control+A through Control+Z (can be used in character classes).
\uFFFF where FFFF are 4 hexadecimal digits Matches a character with specified Unicode code point (can be used inside character classes).
\UFFFFFFFF where FFFFFFFF are 8 hexadecimal digits Matches a character with specified Unicode code point (can be used inside character classes).
\Q ... \E Matches the characters between \Q and \E literally, suppressing the meaning of special characters.

Named character classes

\p{name} where name is name of the binary property or general category Match a character with the specified binary property or value of general category (can be used in character classes).
\P{name} where name is name of the binary property or general category Match a character except characters with the specified binary property or value of general category (can be used in character classes).
[:name:] where name is name of the binary property or general category Match a character with the specified binary property or value of general category (can be used in character classes).
[:^name:] where name is name of the binary property or general category Match a character except characters with the specified binary property or value of general category (can be used in character classes).

Supported binary properties

Short name Full name Alternative name
AHex ASCII_Hex_Digit
Alpha Alphabetic
Bidi_C Bidi_Control
Bidi_M Bidi_Mirrored
CE Composition_Exclusion
Comp_Ex Full_Composition_Exclusion
Dash Dash
Dep Deprecated
DI Default_Ignorable_Code_Point
Dia Diacritic
Ext Extender
Gr_Base Grapheme_Base
Gr_Ext Grapheme_Extend
Gr_Link Grapheme_Link
Hex Hex_Digit
Hyphen Hyphen
IDC ID_Continue
Ideo Ideographic
IDS ID_Start
IDSB IDS_Binary_Operator
IDST IDS_Trinary_Operator
Join_C Join_Control
LOE Logical_Order_Exception
Lower Lowercase
Math Math
NChar Noncharacter_Code_Point
OAlpha Other_Alphabetic
ODI Other_Default_Ignorable_Code_Point
OGr_Ext Other_Grapheme_Extend
OIDC Other_ID_Continue
OIDS Other_ID_Start
OLower Other_Lowercase
OMath Other_Math
OUpper Other_Uppercase
Pat_Syn Pattern_Syntax
Pat_WS Pattern_White_Space
QMark Quotation_Mark
Radical Radical
SD Soft_Dotted
STerm STerm
Term Terminal_Punctuation
UIdeo Unified_Ideograph
Upper Uppercase
VS Variation_Selector
WSpace White_Space space
XIDC XID_Continue
XIDS XID_Start
XO_NFC Expands_On_NFC
XO_NFD Expands_On_NFD
XO_NFKC Expands_On_NFKC
XO_NFKD Expands_On_NFKD

Supported values of general category property

Short name Full name Alternative name
C Other
Cc Control cntrl
Cf Format
Cn Unassigned
Co Private_Use
Cs Surrogate
L Letter
LC Cased_Letter
Ll Lowercase_Letter
Lm Modifier_Letter
Lo Other_Letter
Lt Titlecase_Letter
Lu Uppercase_Letter
M Mark
Mc Spacing_Mark
Me Enclosing_Mark
Mn Nonspacing_Mark
N Number
Nd Decimal_Number digit
Nl Letter_Number
No Other_Number
P Punctuation punct
Pc Connector_Punctuation
Pd Dash_Punctuation
Pe Close_Punctuation
Pf Final_Punctuation
Pi Initial_Punctuation
Po Other_Punctuation
Ps Open_Punctuation
S Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
Sm Math_Symbol
So Other_Symbol
Z Separator
Zl Line_Separator
Zp Paragraph_Separator
Zs Space_Separator

Character classes

[members] Match any character specified by members.
[^members] Match any character except specified by members.

Character class members

Any character except Pattern_Syntax and Pattern_White_Space All characters except the special characters Pattern_Syntax and Pattern_White_Space adds a single instance of themselves into the class.
x-y Specifies a range of characters.

Quantifiers

? Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.
?? Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible.
* Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.
*? Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.
+ Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
+? Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
{n} where n is an integer >= 1 Repeats the previous item exactly n times.
{n,m} where n >= 0 and m >= n Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.
{n,m}? where n >= 0 and m >= n Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.
{n,} where n >= 0 Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.
{n,}? where n >= 0 Repeats the previous item n or more times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.
{,n} where n >= 0 Repeats the previous item between zero and n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.
{,n}? where n >= 0 Repeats the previous item between zero and n times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.

Composites

x y x followed by y
x | y Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options. The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.

Grouping

(regex) Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
(?:regex) Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything and do not create backreferences.

Comments

(?#comment) Everything between (?# and ) is ignored by the regex engine.

Anchors

^ (caret) Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.
$ (dollar) Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.
Last modified 6 years ago Last modified on Jan 8, 2011, 12:25:02 PM