From hetchego@hasar.com Thu Jul 20 16:28:02 2000
Date: Wed, 31 May 2000 16:20:23 -0300
From: Hugo Etchegoyen <hetchego@hasar.com>
To: Philip Hazel <ph10@cam.ac.uk>
Subject: C++ wrapper for pcre

Philip:

Some days ago I was looking for a regular expression library that I
could use in C++ and I came across pcre-3.2. I liked it because it was
so powerful and versatile, so I decided to write a C++ wrapper around
it. It was an easy job, and the results quite rewarding.

Please feel free to publish and/or use the attached files for any useful
purpose.

Thank you for a nice package.

Regards,

Hugo Etchegoyen.


Re: A C++ wrapper around pcre-3.2
---------------------------------

Re is a C++ wrapper around version 3.2 of pcre (Philip Hazel, <ph10@cam.ac.uk>). It defines a regular expression class (Regexp), which is basically a holder for a compiled regular expression. Instances of this class can be uninitialized or initialized with a regular expression string (or char *) or with another Regexp. Initialization from string or char * involves compiling the regular expression and storing the compiled result and the (optional) compilation options within the class instance. Initialization from another Regexp involves copying the compilation options and cloning the compiled regular expression in the source.

A Regexp can also be assigned. Assignment from another Regexp involves freeing the memory occupied by any previous compiled expression, copying the options and cloning the compiled expression in the source. Assignment from a string or char * involves freeing the memory and compiling the source string using the current value of the compilation options.

These are the constructors, assignment operators and destructors:

	Regexp(unsigned opts = 0);
	Regexp(const string & s, unsigned opts = 0);
	const Regexp & operator = (const string & s);
	Regexp(const char * s, unsigned opts = 0);
	const Regexp & operator = (const char * s);
	Regexp(const Regexp & r);
	const Regexp & operator = (const Regexp & r);
	~Regexp();

Compiling a regular expression can throw exceptions in case pcre detects errors. The type of the exceptions thrown is Regexp::exception, which apart from its name, is identical to the standard library's runtime_error:

	class exception : public runtime_error
	{
	public:
		exception(const string & msg) : runtime_error(msg) { }
	};

This exception carries an error message suplied by pcre and returned by the what() method:

	try
	{
		Regexp re = "ab.*|c";
	}
	catch(Regexp::exception &e)
	{
		cout << "Regexp error: " << e.what() << endl;
		exit(1);
	}

Compilation is influenced by some flags represented as option bits. The Regexp constructors which compile a source string take an optional argument for the compilation options, whose default value is 0. Compiling assignments (from a string or char *) use the current options settings. Pcre's compilation flags are given some alternative C++-style names, but of course you can use the original ones if you like them:

	enum 
	{
		anchored =	PCRE_ANCHORED,
		caseless =	PCRE_CASELESS,
		dollarend =	PCRE_DOLLAR_ENDONLY,
		dotall =	PCRE_DOTALL,
		extended =	PCRE_EXTENDED,
		multiline =	PCRE_MULTILINE,
		ungreedy =	PCRE_UNGREEDY
	};

For example,

	Regexp re("ab.*|c", Regexp::anchored|Regexp::dotall);

Two methods allow reading and setting the current compilation options:

	unsigned options();			// reading
	void options(unsigned opts);		// setting

A compiled regular expression can be used for matching a string against it:

	typedef pair<int, int> markers;
	vector<markers> match(const string & s, unsigned offset = 0) const;

Match() takes a string and an optional starting offset, whose default value is 0, and matches it against the compiled regular expression. An exception is thrown if the Regexp is uninitialized. It returns a vector of Regexp::markers, each marker being a pair of integers. If the returned vector is empty, the string did not match. If the returned vector (let's call it v) is not empty, then the v[0] pair contains the offsets of the first and the last-plus-one characters in the match. The rest of the elements, v[i], contain the same information for the captured substrings. If a certain subpattern in the expression did not participate in the match, the corresponding vector element will contain the pair (-1, -1).

Getting the actual strings for the main match and its substrings is possible using two auxiliary functions:

	static string substr(const string & s, const vector<markers> & marks,					unsigned index);
	static vector<string> substr(const string & s, 
						const vector<markers> & marks);

The first function takes the matched string, the vector of markers returned by match() and an index, and returns the corresponding substring. An exception is thrown if the index is outside bounds. The second function returns a vector of strings containing all the substrings in the match. In both cases, (-1, -1) markers are translated into empty strings.

There is another matching method:

	vector<markers> gmatch(const string & s) const;

Gmatch()(global match) finds all matchings of the regular expression in the string and returns a vector of markers with one element for each match. Substrings for each match are not reported. The strings corresponding to the individual matches can be retrieved using the substr() functions as above.

There is also a method that allows splitting a string in fields separated by successive matches of the regular expression:

	vector<string> split(const string & s, bool emptyfields = true,
					unsigned maxfields = 0) const;
	vector<string> split(const string & s, unsigned maxfields,
					bool emptyfields = true) const;

This is a single function with two prototypes, allowing you to pass the second and third (both optional) arguments in any order you like. The functionality resembles that of the corresponding AWK function. If emptyfields is false (the default is true) empty fields will not be reported: separators at the beginning or the end of the string will be ignored and two adjacent separators will be considered as a single separator. If maxfields is greater that zero (default is zero) fields will be reported from left to right up to maxfields; if the original string contains more than maxfields fields (let's say N fields), the last reported field runs from the beginning of field[maxfield-1] to the end of field[N-1] including intervening separators, i.e., extra fields and intervening separators are concatenated to the last reported field. If maxfields is 1 and emptyfields is false, split() will trim separators at the beginning and end of the string and return the trimmed string as a single field. These funtion returns a vector of strings containing the fields.

Using this module
-----------------

To use this module, #include re.h in your source, and link with re.o (or re.obj) and the pcre library (libpcre.a or libpcre.lib).

I compiled and tested the pcre library and this package on RedHat Linux and on Windows (using the Borland C++ Builder 4.0 compiler).

Limitations
-----------

This wrapper implements only part of the functionality of pcre, for example:

1. Pcre allows options to be used both when compiling and when matching. This wrapper only makes use of the compilation options; matching options are used at their default value (0).

2. Locales and character table generation are not supported.

If anyone implements the missing features, please let me know.

Author
------

Hugo Etchegoyen
hetchego@hasar.com
Buenos Aires, Argentina.
