Regular Expression Engine
Regular expression engine uses Perl-style syntax with Unicode extensions. Non-backtracking virtual machine guarantee that regular expression searches run in time linear in the size of the input.
Example program
Here is example to show how to use API of the regular expression engine.
with Ada.Command_Line;
with Ada.Strings.Wide_Wide_Fixed;
with Ada.Wide_Wide_Text_IO;
with League.Regexps;
with League.Strings;
procedure Demo is
function Read (File_Name : String) return League.Strings.Universal_String;
----------
-- Read --
----------
function Read (File_Name : String) return League.Strings.Universal_String is
File : Ada.Wide_Wide_Text_IO.File_Type;
Buffer : Wide_Wide_String (1 .. 1024);
Last : Natural;
begin
Ada.Wide_Wide_Text_IO.Open
(File, Ada.Wide_Wide_Text_IO.In_File, File_Name, "wcem=8");
Ada.Wide_Wide_Text_IO.Get_Line (File, Buffer, Last);
Ada.Wide_Wide_Text_IO.Close (File);
return League.Strings.To_Universal_String (Buffer (1 .. Last));
end Read;
Expression : League.Strings.Universal_String :=
Read (Ada.Command_Line.Argument (1));
String : League.Strings.Universal_String :=
Read (Ada.Command_Line.Argument (2));
Pattern : League.Regexps.Regexp_Pattern :=
League.Regexps.Compile (Expression);
Match : League.Regexps.Regexp_Match :=
Pattern.Find_Match (String);
begin
if Match.Is_Matched then
Ada.Wide_Wide_Text_IO.Put_Line
("Match found:"
& Integer'Wide_Wide_Image (Match.First_Index)
& " .."
& Integer'Wide_Wide_Image (Match.Last_Index)
& " => '"
& League.Strings.To_Wide_Wide_String (Match.Capture)
& "'");
for J in 1 .. Match.Capture_Count loop
Ada.Wide_Wide_Text_IO.Put_Line
(" \"
& Ada.Strings.Wide_Wide_Fixed.Trim
(Integer'Wide_Wide_Image (J), Ada.Strings.Both)
& ":"
& Integer'Wide_Wide_Image (Match.First_Index (J))
& " .."
& Integer'Wide_Wide_Image (Match.Last_Index (J))
& " => '"
& League.Strings.To_Wide_Wide_String (Match.Capture (J))
& "'");
end loop;
else
Ada.Wide_Wide_Text_IO.Put_Line ("Not matched");
end if;
end Demo;
Syntax
Characters
Any character except Pattern_Syntax and Pattern_White_Space | All characters except the special characters Pattern_Syntax and Pattern_White_Space match a single instance of themselves.
|
. (dot) | Matches any single character.
|
\X where X is Pattern_White_Space or Pattern_Syntax | Matches a specified character (can be used inside character classes).
|
\a | Match bell character (can be used in character classes).
|
\e | Match escape character (can be used in character classes).
|
\f | Match form feed character (can be used in character classes).
|
\n | Match LF (can be used in character classes).
|
\r | Match CR (can be used in character classes).
|
\t | Match horizontal tab character (can be used in character classes).
|
\v | Match vertical tab character (can be used in character classes).
|
\cX where X in range A-Z | Match an ASCII character Control+A through Control+Z (can be used in character classes).
|
\uFFFF where FFFF are 4 hexadecimal digits | Matches a character with specified Unicode code point (can be used inside character classes).
|
\UFFFFFFFF where FFFFFFFF are 8 hexadecimal digits | Matches a character with specified Unicode code point (can be used inside character classes).
|
\Q ... \E | Matches the characters between \Q and \E literally, suppressing the meaning of special characters.
|
Named character classes
\p{name} where name is name of the binary property or general category | Match a character with the specified binary property or value of general category (can be used in character classes).
|
\P{name} where name is name of the binary property or general category | Match a character except characters with the specified binary property or value of general category (can be used in character classes).
|
[:name:] where name is name of the binary property or general category | Match a character with the specified binary property or value of general category (can be used in character classes).
|
[:^name:] where name is name of the binary property or general category | Match a character except characters with the specified binary property or value of general category (can be used in character classes).
|
Supported binary properties
Short name | Full name | Alternative name
|
AHex | ASCII_Hex_Digit |
|
Alpha | Alphabetic |
|
Bidi_C | Bidi_Control |
|
Bidi_M | Bidi_Mirrored |
|
CE | Composition_Exclusion |
|
Comp_Ex | Full_Composition_Exclusion |
|
Dash | Dash |
|
Dep | Deprecated |
|
DI | Default_Ignorable_Code_Point |
|
Dia | Diacritic |
|
Ext | Extender |
|
Gr_Base | Grapheme_Base |
|
Gr_Ext | Grapheme_Extend |
|
Gr_Link | Grapheme_Link |
|
Hex | Hex_Digit |
|
Hyphen | Hyphen |
|
IDC | ID_Continue |
|
Ideo | Ideographic |
|
IDS | ID_Start |
|
IDSB | IDS_Binary_Operator |
|
IDST | IDS_Trinary_Operator |
|
Join_C | Join_Control |
|
LOE | Logical_Order_Exception |
|
Lower | Lowercase |
|
Math | Math |
|
NChar | Noncharacter_Code_Point |
|
OAlpha | Other_Alphabetic |
|
ODI | Other_Default_Ignorable_Code_Point |
|
OGr_Ext | Other_Grapheme_Extend |
|
OIDC | Other_ID_Continue |
|
OIDS | Other_ID_Start |
|
OLower | Other_Lowercase |
|
OMath | Other_Math |
|
OUpper | Other_Uppercase |
|
Pat_Syn | Pattern_Syntax |
|
Pat_WS | Pattern_White_Space |
|
QMark | Quotation_Mark |
|
Radical | Radical |
|
SD | Soft_Dotted |
|
STerm | STerm |
|
Term | Terminal_Punctuation |
|
UIdeo | Unified_Ideograph |
|
Upper | Uppercase |
|
VS | Variation_Selector |
|
WSpace | White_Space | space
|
XIDC | XID_Continue |
|
XIDS | XID_Start |
|
XO_NFC | Expands_On_NFC |
|
XO_NFD | Expands_On_NFD |
|
XO_NFKC | Expands_On_NFKC |
|
XO_NFKD | Expands_On_NFKD |
|
Supported values of general category property
Short name | Full name | Alternative name
|
C | Other |
|
Cc | Control | cntrl
|
Cf | Format |
|
Cn | Unassigned |
|
Co | Private_Use |
|
Cs | Surrogate |
|
L | Letter |
|
LC | Cased_Letter |
|
Ll | Lowercase_Letter |
|
Lm | Modifier_Letter |
|
Lo | Other_Letter |
|
Lt | Titlecase_Letter |
|
Lu | Uppercase_Letter |
|
M | Mark |
|
Mc | Spacing_Mark |
|
Me | Enclosing_Mark |
|
Mn | Nonspacing_Mark |
|
N | Number |
|
Nd | Decimal_Number | digit
|
Nl | Letter_Number |
|
No | Other_Number |
|
P | Punctuation | punct
|
Pc | Connector_Punctuation |
|
Pd | Dash_Punctuation |
|
Pe | Close_Punctuation |
|
Pf | Final_Punctuation |
|
Pi | Initial_Punctuation |
|
Po | Other_Punctuation |
|
Ps | Open_Punctuation |
|
S | Symbol |
|
Sc | Currency_Symbol |
|
Sk | Modifier_Symbol |
|
Sm | Math_Symbol |
|
So | Other_Symbol |
|
Z | Separator |
|
Zl | Line_Separator |
|
Zp | Paragraph_Separator |
|
Zs | Space_Separator |
|
Character classes
[members] | Match any character specified by members.
|
[^members] | Match any character except specified by members.
|
Character class members
Any character except Pattern_Syntax and Pattern_White_Space | All characters except the special characters Pattern_Syntax and Pattern_White_Space adds a single instance of themselves into the class.
|
x-y | Specifies a range of characters.
|
Quantifiers
? | Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.
|
?? | Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible.
|
* | Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.
|
*? | Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.
|
+ | Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
|
+? | Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
|
{n} where n is an integer >= 1 | Repeats the previous item exactly n times.
|
{n,m} where n >= 0 and m >= n | Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.
|
{n,m}? where n >= 0 and m >= n | Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.
|
{n,} where n >= 0 | Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.
|
{n,}? where n >= 0 | Repeats the previous item n or more times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.
|
{,n} where n >= 0 | Repeats the previous item between zero and n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.
|
{,n}? where n >= 0 | Repeats the previous item between zero and n times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.
|
Composites
x y | x followed by y
|
x | y | Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options. The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.
|
Grouping
(regex) | Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
|
(?:regex) | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything and do not create backreferences.
|
(?#comment) | Everything between (?# and ) is ignored by the regex engine.
|
Anchors
^ (caret) | Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.
|
$ (dollar) | Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.
|