An Introduction to Regular Expressions in JavascriptFiled: Fri, Dec 22 2006 under Programming|| Tags: regexp regular expressions javascript voodoo
A fantastic regular
expression cheat sheet can be downloaded and printed out
at ilovejackdaniels.com.
Regular Expressions are a shorthand scripting language that allows you to
match strings according to user defined rules. Perhaps no element of a programming
language inspires more fear and dread in the novice programmer, yet information about
this beast is regularly sought by those seeking to validate phone numbers and
credit cards or transform data into a certain formats. Fortunately, or unfortunately,
depending on your view, Javascript supports regular expressions. Now nobody can make them
easy, but hopefully we can go a ways toward making them understandable.
The origins of regular expressions lies in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. The mathematician Stephen Kleene in the 1950s described these models using his mathematical notation called regular sets. Ken Thompson built this notation into the editor QED, and then into the Unix editor ed, which eventually led to grep's use of regular expressions. Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities such as: expr, awk, Emacs, vi, lex, and Perl. -- Wikipedia If you just read the history of regular expressions the way I read the history of regular expressions you came to the same conclusion I did: Regular Expressions came from an era when we were still using vacuum tubes and punch cards and nothing has changed much since. On the bright side, if you actually take the time to sit down and master regular expressions you should be able to use that knowledge, like forever which is about twenty years longer than the life-expectancy of your your current favorite programming language. Regular expressions in Javascript have existed since Javascript version 1.2 (IE version 4.0), which means they're not compatible with all browsers in use today, however the number of people using version 3.0 (or earlier) browsers are so statistically insignificant that it is considered safe to use regular expressions as part of your javascript programs. To give you a frame of reference; Javascript 1.6 began with firefox 1.5 and the current, and most modern version is 1.7 which was introduced in Firefox 2.0. Since we're using features not found in the original version of Javascript, it doesn't hurt to specify the minimum javascript version you need when you define your script declaration: <script type="application/javascript;version=1.2"> </script> There are two ways to define a regular expression in javascript. The first way is through a variable declaration. The second is through an object constructor. Variable declarations allow you to define constants while the object constructor allows you to convert strings into regular expressions for when you need to dynamically execute regular expressions. <script type="application/javascript;version=1.2">
var regexp_as_a_variable = /abcd/i;
var regexp_as_an_object = new RegExp("abcd","i");
</script>
The first thing you'll notice is that when defining a regular expression as a variable declaration we use slashes to surround the data -- in this case /abcd/. When you use the RegExp constructor you pass a string which is then converted to a regular expression. Whichever method you chose to declare your regular expressions, javascript will see them both as functions. <script type="application/javascript;version=1.2"> var regexp_as_a_variable = /abcd/; alert(typeof(regexp_as_a_variable)); </script> Will return "function" as the result. One minor thing to get out of the way real quick. You'll notice that after /abcd/ we have an i, we also pass an "i" as a second parameter when doing the constructor call. The "i" in both cases is what is known as a modifier, specifically the "i" means: ignore case. Which means /abcd/ will match ABCD, AbcD, abcd and so forth. Perl and Unix have several different modifiers but javascript recognizes g (global, for when you're doing multiple search and replaces), i (ignore case), m (multi-line string for when your string has new lines and carriage returns), and s (single line string). Javascript supports regular expressions in several different ways. The regexp object itself has two methods: test and exec. If you set up a regular expression as such... var re=/abc/i; then re.test('does this string have abc in it somewhere?') will return true because the test string does have abc in it. If you do re.exec('does this string have abc in it somewhere?') you will get 'abc' back -- basically exec found abc then clipped it and passed it back to you. Strings also support regular expressions. If you have var str='This is a test that has abc in it somewhere'; then str.match(/abc/i) will pass back an array where [0] would be abc. str.search(/abc/i) would return 22 which is the position in the string where abc was found. One of the more useful methods is replace where str.replace(/abc/ig,'xyz') would replace all occurrences of "abc" in str with xyz. And finally, the str.split command can accept a regular expression and break out a string into separate array elements based on your regular expression. Here is an example that will pop an alert box if our string has "abc" in it somewhere and then pop another alert that replaces abc with xyz. var re = /abc/;
str='This is a test string that has abc in it somewhere';
if (re.test('does this string contain abc somewhere?')) {
alert('abc is part of the string');
alert(str.replace(/abc/ig,'xyz'));
}
So here, perhaps, is a better way to understand regular expressions. It's a way to find a specific substring within a string; obtaining true or false if the substring exists or not (test), the index of where the substring begins (str.search), or to replace the substring with something new (str.replace). Which means that even though regular expressions may be difficult to master they are indeed quite powerful and able to do many tasks we, as programmers, need to be able to do. From here on out, this tutorial will be using the following text from Jules Verne's "Voyage to the center of the earth" as its example text. Looking back to all that has occurred to me since that eventful day, I
am scarcely able to believe in the reality of my adventures. They were
truly so wonderful that even now I am bewildered when I think of them.
My uncle was a German, having married my mother's sister, an
Englishwoman. Being very much attached to his fatherless nephew, he
invited me to study under him in his home in the fatherland. This home
was in a large town, and my uncle a professor of philosophy, chemistry,
geology, mineralogy, and many other ologies.
One day, after passing some hours in the laboratory--my uncle being
absent at the time--I suddenly felt the necessity of renovating the
tissues--i.e., I was hungry, and was about to rouse up our old French
cook, when my uncle, Professor Von Hardwigg, suddenly opened the street
door, and came rushing upstairs.
This paragraph is not part of Jules Verne's "Voyage to the Center of the Earth"
but is here because we need a place to test for numbers 1234591823 and 75% percentages and dates
12/24/2005 and currency $123.95. It probably wouldn't hurt to put in a phone number --
how about a fancy phone number (555)-723-3938 and then a simple phone number 555-867-5309
and maybe a zip+5 73823-9321 . And here's a repeating pattern
test: it was a very very big elephant!
Now Professor Hardwigg, my worthy uncle, is by no means a bad sort of
man; he is, however, choleric and original. To bear with him means to
obey; and scarcely had his heavy feet resounded within our joint
domicile than he shouted for me to attend upon him.
As a new concept is introduced you will be presented with a hyperlink which will allow you to enter, and test a regular expression. Simply click on the hyperlink and enter the expression. Go ahead and try it now. Enter different words and substrings and see what happens. For your convenience the test script uses an "i" modifier so all of your tests will be case insensitive. Lets start at the very beginning! A very good place to start! If you want you can use the caret (^) to specify the start of a string. Which means in our example various lengths of ^looking will all return true from ^look to ^l to /^looking back to all/. And back will be true because back is a substring of our text but ^back will be false because our text does not begin with the word back! Go ahead and try it out for yourself. Endings are good too! We use the dollar sign ($) to specify what the string must end with. Here various lengths of "attend upon him." will return true. Note that a period is a metachar so it needs to be escaped. Or not. A period is basically a wildchar meaning that as long as there's a character where the period is, it doesn't matter what the character is as long as it exists. So /attend upon him.$/ will be true and so will /attend upon him\.$/ Also note that the $ is at the END of the expression -- because we're matching the end of the string and not the beginning of the end (heh). Go ahead and try it out for yourself. As we learned above, the period (.) is a wildchar that matches a single character. If you type uncle into the test dialog it will come back true because uncle is a word in our test string. We can also type in u...e and match any five letter word which begins with a u and ends with an e. Hmm I wonder what word that would be! Now typing 3 periods is a lot of work. How about if we only type one period and say it represents 3 characters. We'd do that by typing in u.{3}e Well, that's actually a bit more work now isn't it. But still, you're saying find a substring that begins with a u, is followed by 3 of any characters, and then an e. Go ahead and try it! You'll see that u...e is exactly the same as u.{3}e! Ok so the period means we don't care what character is here, but what if we do? h.m returns "hem" at position 208. But what if we were looking for occurances of him? We can do that by building a character class. h[aeiou]m will match (true) any three letter word which begins with an h,ends with an m and has a vowel inbetween. Our example dialog will match "hem" because that's the first occurence which returns true. If you take the "e" out of the list. h[aiou]m "hem" isn't matched but "him" is. The caret (^) pulls double duty in character class definitons. When used between brackets it means *NOT* so h[^e]m means match any 3 letter word that starts with h, ends with m and doesn't have an e in the middle: "him". Likewise h[^i]m says that this 3 letter word can't have an i in the middle so it returns hem. Now why is this useful? Because of search and replace. If you do... str.replace(/h[ei]m/g,'her') then all occurrances of hem or him will be replaced with "her". So this isn't algebra, you actually will use it when you grow up and get out of school. To simplify things, a character class can use a hyphen to define a range. Which means 7[0123456789]% is the same as 7[0-9]%. Did you figure what that one does on your own? Did you?! If you guessed match any substring that starts with a 7 has any digit from zero to 9 and is then followed by a %, you're right! in this case 75% which we so crassly inserted into Jules Verne's masterpiece. You can also, of course, use letter ranges like [a-z]. Also note that if you're not using the case-insensitive modifier then [A-Z] becomes much different than [a-z]! As mentioned above, the caret (^) takes on a different meaning inside the square braces which define a character class. The $ also stops meaning the end of a string and goes back to meaning, well, a dollar sign. Likewise the . stops being a wild card and starts being a period. One of the more powerful characters is the vertical pipe (|) character which allows you to specify alternatives in an expression. For instance, in our example text there is no word aunt. But there is a word uncle. We can do an expression which will look for either aunt or uncle using the pipe character.aunt|uncle. Used in a replace we could do 'uncle|sister' and replace both words with 'relative' as such: str.replace(/(uncle|sister)/ig,'relative'); Of course we can use character classes with pipes! For instance, if we were not doing case insensitive searches you could see if the first character of our test string was upper case or a number: ^([0-9]|[A-Z]) . The caret means start of string, then we look for either a 0-9 or A-Z as the first character, if the string begins with a lower case letter then this will return false. Again you can't really test this with the dialog box because it's forced to be case-insensitive so here you'll just have to take my word on it. Now we're getting to a spot where we can start doing some very useful things with regular expressions. Remember up above when it was shown that .{3} was the same as ...? Well {} is a quantifier which means it looks for repeating sequences. Specifically {n} means exactly n times; {n,m} means at least n times but no more than m times; {n,} at least n times. There is also * which is the same as {0,}; + which is the same as {1,} and ? which is the same as {0,1}. Too much? OK! Example time! A phone number is 3 digits (area code) followed by 3 digits and then 4. So we should be able to find a phone number like this. [0-9]{3}-[0-9]{3}-[0-9]{4} This finds jenny's phone number which is 555-867-5309 but it doesn't find the first one with the parenthesis around the area code. For that we use the ? which means either 0 or 1, (note that we'll have to escape the parenthesis with a backslash so the regular expression knows we're looking for that character instead of creating a group.) \(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}. Ah now we found the first phone number. The regular expression can now cope with parenthesis aroudn the area code and it can work just as well if the parenthesis are not there. And in the useful tidbit area, we've used a regular expression to discover if a telephone number exists inside the string and to extract it. Likewise, if we were doing a form we could do something like this: str.replace(/[^0-9]/g,'') to strip any non-numerical character out of the field. To make things a little bit easier, regexps recognize \d as being the same as [0-9] so you can use \d if you're looking for a number. This means our telephone number checker can also be expressed as such: \(?\d{3}\)?-\d{3}-\d{4} Parenthesis can be used to create sub-patterns. for instance a (very )+big will match a very very big elephant because we sub-groupped very and said there could be 1 or more occurrence between a and big. Likewise when you're using a pipe to set up alternates it's usually wise to encase the alternates in parenthesis. So if you're looking for (elephant|mouse) it helps to keep all the different elements separate. And this concludes our introduction to regular expressions. Hopefully you'll now be able to look at one of these beasts and have a general idea of what it does, and even figure out how to write your own! What is going to throw you off the most as you get started is the multitudes of escaping that will tend to complicate the expression. for instance \d{2}[\/\.\-]\d{2}[\/\.\-]\d{4}. This looks hideously complicated until you realize that a lot of the expression is just escaping the special characters. Here's what it would look like if you didn't have to escape the meta characters: \d{2}[/.-]\d{2}[/.-]/d{4} Now that's much less complicated isn't it? /d stands for digit, then look for either /. or a - (02-, 02/, 02.) Then 2 more digits, then the separator characters again then four digits. There's a TON more things regular expressions are capable of and don't be too surprised if you find something that looks like it was written by ET in a drunken stupor! But at least you've taken the first big step toward understanding the Sanskrit. Figuring out how to divine your mother's astrological sign from two letters and a malformed pixel will just have to wait until I get around to writing Intermediate Regular Expressions. Before I leave you to your own devices I'll offer you the following three functions which extend javascript's string object and provide three functions which should have been part of the language, but never was. String.prototype.trim = function() {
//^=start of string
//\s=whitespace
//+=1 or more occurence
//| alternate
//\s whitespace
//+ one or more occrence
//$ at end of string
//g = global -- replace more than one occurence
return this.replace(/^\s+|\s+$/g,'');
}
String.prototype.rtrim = function() {
// \s = white space
// + = one or more occurence
// $ = at end of string
// g = global -- replace more than one occrence
return this.replace(/\s+$/g,'');
}
String.prototype.ltrim = function() {
// ^ = at beginning of string
// /s = white space
// + = one or more occurence
// g = global -- replace more than one occurence
return this.replace(/^\s+/g,'');
}
str=" The quick brown fox jumped over the lazy dog! ";
|