Encyclopedia of Regular Expressions for SEO Last Updated: May 10, 2012
Regular Expression (or REGEX) is a topic which gives me creep all the time (even though i have dared to write about it). I think it is something which can’t be learned in day or two, not even in a month (at least not mastered). I am not a web developer and i have neither the capacity nor the inclination to write a 544 pages manual on REGEX like this guy did. I am going to explain the building blocks of a regular expression, so that you can read and understand them. I will also cover the basics of .htaccess as its goes hand to hand with REGEX. If you wish to dive deep, then i recommend Mastering Regular Expressions by Jeffrey E. F. Friedl.
What is Regular Expression?
It is an expression which is used to check for a pattern in a string. For e.g. ^Colou?r$ is a regular expression which matches both the string: ‘color’ and ‘colour’. A regex is made up of characters and metacharacters.
What are Metacharacters?
These are the characters which have special meaning in regex. They are the building blocks of a regex. For e.g. [], ^, (), {}, $, +, * etc.
Building Blocks of a Regular Expression
- []
^
- ()
- $
- +
- ?
- .
- *
- |
- \
- !
1. ‘[]‘ – This square bracket is used to check for any single character in the character set specified in []. For e.g:
- [a] => Check for a single character which is a lowercase letter ‘a’.
- [ab] => Check for a single character which is either a lower case letter ‘a’ or ‘b’.
- [aB] => Check for a single character which is either a lower case letter ‘a’ or uppercase letter ‘B’
- [1B] => Check for a single character which is either a number ’1′ or an uppercase letter ‘B’.
- [Dog] => Check for a single character which can be anyone of the following: uppercase letter ‘D’, lower case letter ‘o’ or lowercase letter ‘g’.
- [123b] => Check for a single character which can be anyone of the following: number ’1′, number ’2′, number ’3′ or lowercase letter ‘b’.
- [1-3] => Check for a single character which can be any one number from 1, 2 and 3.
- [0-9] => Check for a single character which is a number.
- [a-d] => Check for a single character which can be any one of the following lower case letter: ‘a’, ‘b’, ‘c’ or ‘d’.
- [a-z] => Check for a single character which is a lower case letter.
- [A-Z] => Check for a single character which is a upper case letter.
- [A-T] => Check for a single character which can be any uppercase letter from ‘A’ to ‘T’.
- [home.php] => Check for a single character which can be anyone of the following characters: lowercase letter ‘h’, lowercase letter ‘o’, lowercase letter ‘m’, lowercase letter ‘e’, special character ‘.’, lower case letter ‘p’, lowercase letter ‘h’ or lowercase letter ‘p’
2. ‘^’ – This is known as ‘Caret’ and is used to mark the beginning of a regular expression. For e.g.
- ^Colou?r => Check for a pattern which starts with ‘Color’ or ‘Colour’
- ^Nov(ember)? => Check for a pattern which starts with ‘Nov’ or ‘November’
- ^elearning*\.html => Check for a pattern which starts with ‘elearning.html’
- ^*\.php => Check for a pattern which starts with any php file
- ^product-price\.php => Check for a pattern which starts with ‘product-price.php’
Caret also means NOT when used after the opening square bracket. For e.g.
- [^a] => Check for any single character other than the lowercase letter ‘a’.
- [^B] = > Check for any single character other than the uppercase letter ‘B’.
- [^1] => Check for any single character other than the number ’1′
- [^ab] => Check for any single character other than the lower case letters ‘a’ and ‘b’
- [^aB] => Check for any single other than the lower case letter ‘a’ and uppercase letter ‘B’
- [^1B] => Check for any single character other than the number ’1′ and uppercase letter ‘B’
- [^Dog] => Check for any single character other than the following: uppercase letter ‘D’, lowercase letter ‘o’ and lowercase letter ‘g’.
- [^123b] => Check for any single character other than the following characters: number ’1′, number ’2′, number ’3′ and lowercase letter ‘b’.
- [^1-3] => Check for any single character other than the following: number ’1′, number ’2′ and number ’3′
- [^0-9] => Check for any single character other than the number.
- [^a-b] => Check for any single character which is not a lower case letter.
- [^A-Z] => Check for any single character which is not a upper case letter.
3. ‘()’ – This is known as parenthesis and is used to check for a string. For e.g.
(a) => Check for string ‘a’
(ab) => Check for string ‘ab’
(dog) => Check for string ‘dog’
(dog123) => Check for string ‘dog123′
(0-9) => Check for string ’0-9′
(A-Z) => Check for string ‘A-Z’
(a-z) => Check for string ‘a-z’
(123dog588) => Check for string ’123dog588′
Note: () is also used to create and store variables. For e.g. ^ (.*) $
4. ‘$’ - It is used to denote the end of a regular expression or ending of a line. For e.g.
- Colou?r$ => Check for a pattern which ends with ‘Color’ or ‘Colour’
- Nov(ember)?$ => Check for a pattern which ends with ‘Nov’ or ‘November’
- elearning*\.html$ => Check for a pattern which ends with ‘elearning.html’
- *\.php$ => Check for a pattern which ends with any php file
- product-price\.php$ => Check for a pattern which ends with ‘product-price.php’
5. ‘+’ – It is used to check for one or more occurrences of the preceding character. For e.g.
[a]+ => Check for one or more occurrences of lowercase letter ‘a’.
[dog]+ => Check for one or more lowercase letters ‘d’, ‘o’ or ‘g’.
[548]+ => Check for one or more numbers ’5′, ’4′ or ’8′.
[o-9]+ => Check for one or more numbers
[a-z]+ => Check for one or more lower case letters
[^a-z]+ => Check for one or more characters which are not lowercase letters.
[a-zA-z]+ => Check for any combination of uppercase and lowercase letters.
[a-z0-9]+ => Check for any combination of lowercase letters and numbers.
[^9]+ => Check for one or more character which is not the number 9.
6. ‘?’ – It is used to check for zero or one occurrence of the preceding character. For e.g.
[a]? => Check for zero or one occurrence of lowercase letter ‘a’.
[dog]? => Check for zero or one occurrence of lowercase letter ‘d’, ‘o’ or ‘g’.
[^dog]? => Check for zero or one occurrence of a character which is not the lowercase letter ‘d’, ‘o’ or ‘g’.
[0-9]? => Check for zero or one occurrence of a number
[^a-z]? => Check for zero or one occurrence of a character which is not a lower case letter.
Note: ? when used in a regular expression makes the preceding letter or group of letters optional. For e.g. the regular expression: ^colou?r$ matches both ‘color’ and ‘colour’. Similarly, the regular expression: ^Nov(ember)28(th)?$ matches: ‘nov 28′, ‘november 28, Nov 28th and November 28th
7. ‘.’ – It is used to check for a single character which is not the ending of a line. For e.g.
the regular expression: Action ., Scene2 would match Action 1, Scene2; Action A, Scene2; Action 9,Scene2 but not Action 10,Scene2 or Action AB,Scene2
8. ‘*‘ – It is used to check for any number of occurences (including zero occurences) of the preceding character. For e.g 31* would match 3, 31, 311, 3111, 31111 etc.
9. | – It is the logical OR . For e.g.
(His | Her) => Check for the string ‘his’ or ‘her’.
10. ‘\’ – It is the escaping character which is used to escape from the normal way a subsequent character is interpreted. For e.g.
the regular expression: ^www\.abc\.com$ matches www.abc.com
11. ‘!’ – It is logical NOT. But unlike ^ (caret), it is used only at the beginning of a rule or a condition. For e.g.
(!abc) => Check for a string which is not the string ‘abc’.
[!0-9] => Check for a single character which is not a number.
[!a-z] => Check for a single character which is not a lower case letter.
12. White Space- To create a white space in a regular expression, just use the white space. For e.g.
(Himanshu Sharma) => Check for the string ‘Himanshu Sharma’
Some Regex Examples
^(*\.html)$ => Check for any number of characters before .html and store them in a variable.
^dog$ => Check for the string ‘dog’
^a+$ => Check for one or more occurrences of a lower case letter ‘a’
^(abc)+$ => Check for one or more occurrences of the string ‘abc’.
^[a-z]+$ => Check for one or more occurrences of a lower case letter.
^(abc)*$ => Check for any number of occurrences of the string ‘abc’.
^a*$ => Check for any number of occurrences of the the lower case letter ‘a’
Q. Find all the files which start from ‘elearning’ and which have the ‘.html’ file extension
^elearning* \.html$
Q. Find all the PHP files
^*\.php$
mod_rewrite
It is a module (function) written in ‘C’ programming language: ‘mod_rewrite.c’. This module works only with Apache server 1.2 or later and is called from the .htaccess file (ASCII file which contains configuration directives and rules for files and folders). Though this module you can:
- Re-Write URLs
- Redirect URLs
- Solve Canonical URL issues
- Solve Hot linking issues
- Block visitors from accessing a particular folder, file or the whole website.
- Create custom 403 and 404 pages.
- Deliver contents on the basis of the IP address and benefits are end less.
Types of Configuration Directives
There are 9 types of configuration directives:
- RewriteEngine
- RewriteOptions
- RewriteLog
- RewriteLogLevel
- RewriteLock
- RewriteMap
- RewriteBase
- RewriteRule
- RewriteCond
But here we will talk about only three directives: RewriteEngine, RewriteRule and RewriteCond. I have not found any use of other directives so far. But if you think, other directives can be helpful for a seo, then please let me know in the comments below.
RewriteEngine
This configuration directive is used to enable or disable the mod-rewrite module.
Syntax: RewriteEngine on/off
Default Value: RewriteEngine off
That’s why in .htacess file we first enable the mod-rewrite module by adding the following code:
Options +FollowSymLinks
RewriteEngine on
RewriteRule
This configuration directive tells the server to interpret the given statement as a rule.
Syntax: RewriteRule <pattern> <substitution> [FLAGS]
Here pattern is a regular expression and substitution is a URL.
FLAGS can be [R], [F], [NC], [QSA], [L], [OR] etc.
[R] => Redirect. Its default value is 302. It can be assigned any number from 300 to 400. For e.g.
RewriteRule ^index\.html$ /index.php [r=301]
[F] => Forbidden. It is generally used with hyphen (-). The hyphen tells the server not to perform any substitution. This flag tells the server not to fulfill the request and return ’403′ response code. For e.g.
RewriteRule ^product-price\.php$ -[F]
[NC] => It tells the server to ignore uppercase or lowercase when checking for patterns. For e.g.
RewriteRule ^him*\.php$ [nc]
[QSA] => Query String append. It tells the server to pass query string from the original URL to the new URL.
[L] => Last rule. This tag tells the server not to process any more rules.
[OR] => Logical OR. This flag is used as logical OR for RewriteCond statements.
RewriteCond
This configuration directive tells the server to interpret the given statement as a condition for the rule which immediately follows it.
Syntax:

Here first mod-rewrite matches each URL with the given pattern. If no URL matches the pattern, then mod_rewrite process the next rule. If a URL matches the pattern, then mod_rewrite looks for the corresponding RewriteCond. If no corresponding RewriteCond exist, then the matched URL is replaced by the substitution.
If corresponding RewriteCond exist, then each RewriteCond is processed in the order they appear from top to bottom. Each RewriteCond is processed by matching its test string to against its corresponding condition pattern. If test string doesn’t matches with its condition pattern, then mod_rewrite process the next rule, otherwise it process the next RewriteCond. When all RewriteConds are successfully processed, then the matched URL is replaced by the substitution. A test string can be:
1. A simple text
2. RewriteRule back reference
3. RewriteCond back reference
4. Server Variable
RewriteRule Back Reference
It is of the form $N, where N can be any number from o to 9. It is used to denote that variable which was created in the RewriteRule pattern. For e.g.
RewriteRule ^(.*)$ /index.php/$1 [L]
RewriteCond Back Reference
It is of the form %N, where N can be any number from 1 to 9. It is used to denote that variable which was created in the ‘condpattern’ from the last matched ‘RewriteCond’. For e.g.
RewriteCond %{HTTP_HOST} ^(123\.42\.162\.7)$
RewriteCond %1 ^123\.42\.162\.7$
RewriteRule ……………..
Server Variable
Syntax: % {Variable_Name}
E.g.
1. %{HTTP_HOST} – This variable gives information about server name and its IP address.
2. %{HTTP_USER_AGENT} – This variable gives information about user’s operating system and browser.
3. %{QUERY_STRING} – This variable returns query string.
4. %{HTTP_REFERER} – This variable returns the URL of the referer.
5.%{REMOTE_ADDR} -This variable returns the IP address of the referer.
Examples
Example-1: Redirect all request for pages in the media folder to a new page ‘media.html’.
RewriteRule ^media/$ /media.html [r=301,l]
Example-2: Redirect oldaddress.html page to newaddress.html page
RewriteRule ^oldaddress\.html$ /newaddress.html [r=301,l]
Example-3: Redirect one website to another
Redirect 301 http://www.anotherwebsite.com
Example-4: Redirect abc.com/index.html to www.abc.com
RewriteCond %{REQUEST_URL} ^index\.html$
RewriteRule ^(.*)$ http://www.abc.com/$1 [r=301, l]
Example-5: Block a visitor from the IP address 12.34.56.78 to view your file product-prices.html
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^product-prices\.html$ /sorry.html -[F]
Example-6: Block a visitor from the IP address 12.34.56.78 to view your folder ‘sales-demo’
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^sales-demo/$ /sorry.html -[F]
Example-7: Block a visitor from the IP address 12.34.56.78 to view your website www.abc.com
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^.*$ / -[F]
Regular Expressions and Google Analytics
There are many cases where regular expressions are very useful in Google Analytics. Some of such cases are:
1. Setting up a goal which should match multiple goal pages instead of one.
2. Setting up a funnel in which a step should match mutiple pages instead of one. Infact when you set up a funnel, all URLs are treated as regular expressions.
3. Excluding traffic from a IP address range via filters. Infact there are many filters which require regular expressions. Big organizations generally own a range of IP addresses. Therefore to exclude organization’s internal traffic you need to specify a IP range using regex.
4. Setting up advanced segments. For example following regex can segment all the traffic coming from social media sites:
twitter\.com|facebook\.com|linkedin\.com|plus\.google\.com|t\.co|bit\.ly|reddit\.com
Note: You can use Regex equipped advanced segments to unleash the power of the long tail keywords and determine whether these keywords are worth chasing. You can also use regex to segment important data through advanced segments.
5. Rewriting URLs in Google Analytics reports.
You can rewrite URLs in Google Analytics reports with ‘search and replace’ advanced filter. This comes handy when your website has very long ugly dynamic URLs and you can’t figure out what the page is all about just by looking at its URL. So for example with ‘Search & Replace’ advanced filter you can ask GA to report the following URL:
http://www.abc.com/fder/?catg=2341&pid=428
as
http://www.abc.com/outdoor/fleeces
6. Filtering data within the GA report interface.
You can use following regular expressions to filter keywords on the Google Analytics reporting interface:
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){0}$ =>Filter 1 word keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){1}$ =>Filter 2 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){2}$ =>Filter 3 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){3}$ =>Filter 4 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){4}$ =>Filter 5 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){5}$ => Filter 6 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){6}$ => Filter 7 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){7}$ => Filter 8 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){8}$ => Filter 9 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){9}$ =>Filter 10 words keyword phrase
^([^ ]+ ){4,10}[^ ]+$ – Filter keywords that have between 4 to 10 spaces in them. This regex can help you in determining long tail keywords on your website.
^/([^/]+/){3}[^/]*$ – Filter landing pages that that have 4 slashes in their URL. This regex can help you in identifying low quality pages on your website.
Related Tools:
- To learn more about regular expressions: http://www.regular-expressions.info/
- The Regex Coach is a graphical application for Windows which can be used to test regular expressions
- Regular Expression Checker - chrome add-on to text regex
.htaccess
It is an ASCII file which contains configuration directives and rules for files, folders and the whole website. You can have more than one .htaccess file on a server. In fact you can have one .htaccess file per folder/directory. When you put the file in a directory, the rules mentioned in it are applicable only to all the files and sub-directories in the directory. When you put the file in the root directory, the rules mentioned in it are applicable to all the files and directories on the server. A htaccess file must contain following two lines:
Options +FollowSymLinks
RewriteEngine on
If you like this post then you should subscribe to my blog and follow me on twitter.
Other Posts you may find useful:
- Six valuable .htaccess tips
- Excel for SEO – Powerful Cheat Sheet to Boost Productivity
- How to do Site Speed Optimisation
- Ultimate Data Visualization Guide for SEO
- How to write a SEO Contract?
- How to Automate Event Tracking in Google Analytics
- Social interactions tracking through Google Analytics
- Google Analytics Account Setup Checklist
- SEO Contract | Sample SEO Contract Template
- Event Tracking – Google Analytics (Simplified Version)
Tweet
Last Updated: May 10, 2012
Regular Expression (or REGEX) is a topic which gives me creep all the time (even though i have dared to write about it). I think it is something which can’t be learned in day or two, not even in a month (at least not mastered). I am not a web developer and i have neither the capacity nor the inclination to write a 544 pages manual on REGEX like this guy did. I am going to explain the building blocks of a regular expression, so that you can read and understand them. I will also cover the basics of .htaccess as its goes hand to hand with REGEX. If you wish to dive deep, then i recommend Mastering Regular Expressions by Jeffrey E. F. Friedl.
What is Regular Expression?
It is an expression which is used to check for a pattern in a string. For e.g. ^Colou?r$ is a regular expression which matches both the string: ‘color’ and ‘colour’. A regex is made up of characters and metacharacters.
What are Metacharacters?
These are the characters which have special meaning in regex. They are the building blocks of a regex. For e.g. [], ^, (), {}, $, +, * etc.
Building Blocks of a Regular Expression
- []
^
- ()
- $
- +
- ?
- .
- *
- |
- \
- !
1. ‘[]‘ – This square bracket is used to check for any single character in the character set specified in []. For e.g:
- [a] => Check for a single character which is a lowercase letter ‘a’.
- [ab] => Check for a single character which is either a lower case letter ‘a’ or ‘b’.
- [aB] => Check for a single character which is either a lower case letter ‘a’ or uppercase letter ‘B’
- [1B] => Check for a single character which is either a number ’1′ or an uppercase letter ‘B’.
- [Dog] => Check for a single character which can be anyone of the following: uppercase letter ‘D’, lower case letter ‘o’ or lowercase letter ‘g’.
- [123b] => Check for a single character which can be anyone of the following: number ’1′, number ’2′, number ’3′ or lowercase letter ‘b’.
- [1-3] => Check for a single character which can be any one number from 1, 2 and 3.
- [0-9] => Check for a single character which is a number.
- [a-d] => Check for a single character which can be any one of the following lower case letter: ‘a’, ‘b’, ‘c’ or ‘d’.
- [a-z] => Check for a single character which is a lower case letter.
- [A-Z] => Check for a single character which is a upper case letter.
- [A-T] => Check for a single character which can be any uppercase letter from ‘A’ to ‘T’.
- [home.php] => Check for a single character which can be anyone of the following characters: lowercase letter ‘h’, lowercase letter ‘o’, lowercase letter ‘m’, lowercase letter ‘e’, special character ‘.’, lower case letter ‘p’, lowercase letter ‘h’ or lowercase letter ‘p’
2. ‘^’ – This is known as ‘Caret’ and is used to mark the beginning of a regular expression. For e.g.
- ^Colou?r => Check for a pattern which starts with ‘Color’ or ‘Colour’
- ^Nov(ember)? => Check for a pattern which starts with ‘Nov’ or ‘November’
- ^elearning*\.html => Check for a pattern which starts with ‘elearning.html’
- ^*\.php => Check for a pattern which starts with any php file
- ^product-price\.php => Check for a pattern which starts with ‘product-price.php’
Caret also means NOT when used after the opening square bracket. For e.g.
- [^a] => Check for any single character other than the lowercase letter ‘a’.
- [^B] = > Check for any single character other than the uppercase letter ‘B’.
- [^1] => Check for any single character other than the number ’1′
- [^ab] => Check for any single character other than the lower case letters ‘a’ and ‘b’
- [^aB] => Check for any single other than the lower case letter ‘a’ and uppercase letter ‘B’
- [^1B] => Check for any single character other than the number ’1′ and uppercase letter ‘B’
- [^Dog] => Check for any single character other than the following: uppercase letter ‘D’, lowercase letter ‘o’ and lowercase letter ‘g’.
- [^123b] => Check for any single character other than the following characters: number ’1′, number ’2′, number ’3′ and lowercase letter ‘b’.
- [^1-3] => Check for any single character other than the following: number ’1′, number ’2′ and number ’3′
- [^0-9] => Check for any single character other than the number.
- [^a-b] => Check for any single character which is not a lower case letter.
- [^A-Z] => Check for any single character which is not a upper case letter.
3. ‘()’ – This is known as parenthesis and is used to check for a string. For e.g.
(a) => Check for string ‘a’
(ab) => Check for string ‘ab’
(dog) => Check for string ‘dog’
(dog123) => Check for string ‘dog123′
(0-9) => Check for string ’0-9′
(A-Z) => Check for string ‘A-Z’
(a-z) => Check for string ‘a-z’
(123dog588) => Check for string ’123dog588′
Note: () is also used to create and store variables. For e.g. ^ (.*) $
4. ‘$’ - It is used to denote the end of a regular expression or ending of a line. For e.g.
- Colou?r$ => Check for a pattern which ends with ‘Color’ or ‘Colour’
- Nov(ember)?$ => Check for a pattern which ends with ‘Nov’ or ‘November’
- elearning*\.html$ => Check for a pattern which ends with ‘elearning.html’
- *\.php$ => Check for a pattern which ends with any php file
- product-price\.php$ => Check for a pattern which ends with ‘product-price.php’
5. ‘+’ – It is used to check for one or more occurrences of the preceding character. For e.g.
[a]+ => Check for one or more occurrences of lowercase letter ‘a’.
[dog]+ => Check for one or more lowercase letters ‘d’, ‘o’ or ‘g’.
[548]+ => Check for one or more numbers ’5′, ’4′ or ’8′.
[o-9]+ => Check for one or more numbers
[a-z]+ => Check for one or more lower case letters
[^a-z]+ => Check for one or more characters which are not lowercase letters.
[a-zA-z]+ => Check for any combination of uppercase and lowercase letters.
[a-z0-9]+ => Check for any combination of lowercase letters and numbers.
[^9]+ => Check for one or more character which is not the number 9.
6. ‘?’ – It is used to check for zero or one occurrence of the preceding character. For e.g.
[a]? => Check for zero or one occurrence of lowercase letter ‘a’.
[dog]? => Check for zero or one occurrence of lowercase letter ‘d’, ‘o’ or ‘g’.
[^dog]? => Check for zero or one occurrence of a character which is not the lowercase letter ‘d’, ‘o’ or ‘g’.
[0-9]? => Check for zero or one occurrence of a number
[^a-z]? => Check for zero or one occurrence of a character which is not a lower case letter.
Note: ? when used in a regular expression makes the preceding letter or group of letters optional. For e.g. the regular expression: ^colou?r$ matches both ‘color’ and ‘colour’. Similarly, the regular expression: ^Nov(ember)28(th)?$ matches: ‘nov 28′, ‘november 28, Nov 28th and November 28th
7. ‘.’ – It is used to check for a single character which is not the ending of a line. For e.g.
the regular expression: Action ., Scene2 would match Action 1, Scene2; Action A, Scene2; Action 9,Scene2 but not Action 10,Scene2 or Action AB,Scene2
8. ‘*‘ – It is used to check for any number of occurences (including zero occurences) of the preceding character. For e.g 31* would match 3, 31, 311, 3111, 31111 etc.
9. | – It is the logical OR . For e.g.
(His | Her) => Check for the string ‘his’ or ‘her’.
10. ‘\’ – It is the escaping character which is used to escape from the normal way a subsequent character is interpreted. For e.g.
the regular expression: ^www\.abc\.com$ matches www.abc.com
11. ‘!’ – It is logical NOT. But unlike ^ (caret), it is used only at the beginning of a rule or a condition. For e.g.
(!abc) => Check for a string which is not the string ‘abc’.
[!0-9] => Check for a single character which is not a number.
[!a-z] => Check for a single character which is not a lower case letter.
12. White Space- To create a white space in a regular expression, just use the white space. For e.g.
(Himanshu Sharma) => Check for the string ‘Himanshu Sharma’
Some Regex Examples
^(*\.html)$ => Check for any number of characters before .html and store them in a variable.
^dog$ => Check for the string ‘dog’
^a+$ => Check for one or more occurrences of a lower case letter ‘a’
^(abc)+$ => Check for one or more occurrences of the string ‘abc’.
^[a-z]+$ => Check for one or more occurrences of a lower case letter.
^(abc)*$ => Check for any number of occurrences of the string ‘abc’.
^a*$ => Check for any number of occurrences of the the lower case letter ‘a’
Q. Find all the files which start from ‘elearning’ and which have the ‘.html’ file extension
^elearning* \.html$
Q. Find all the PHP files
^*\.php$
mod_rewrite
It is a module (function) written in ‘C’ programming language: ‘mod_rewrite.c’. This module works only with Apache server 1.2 or later and is called from the .htaccess file (ASCII file which contains configuration directives and rules for files and folders). Though this module you can:
- Re-Write URLs
- Redirect URLs
- Solve Canonical URL issues
- Solve Hot linking issues
- Block visitors from accessing a particular folder, file or the whole website.
- Create custom 403 and 404 pages.
- Deliver contents on the basis of the IP address and benefits are end less.
Types of Configuration Directives
There are 9 types of configuration directives:
- RewriteEngine
- RewriteOptions
- RewriteLog
- RewriteLogLevel
- RewriteLock
- RewriteMap
- RewriteBase
- RewriteRule
- RewriteCond
But here we will talk about only three directives: RewriteEngine, RewriteRule and RewriteCond. I have not found any use of other directives so far. But if you think, other directives can be helpful for a seo, then please let me know in the comments below.
RewriteEngine
This configuration directive is used to enable or disable the mod-rewrite module.
Syntax: RewriteEngine on/off
Default Value: RewriteEngine off
That’s why in .htacess file we first enable the mod-rewrite module by adding the following code:
Options +FollowSymLinks
RewriteEngine on
RewriteRule
This configuration directive tells the server to interpret the given statement as a rule.
Syntax: RewriteRule <pattern> <substitution> [FLAGS]
Here pattern is a regular expression and substitution is a URL.
FLAGS can be [R], [F], [NC], [QSA], [L], [OR] etc.
[R] => Redirect. Its default value is 302. It can be assigned any number from 300 to 400. For e.g.
RewriteRule ^index\.html$ /index.php [r=301]
[F] => Forbidden. It is generally used with hyphen (-). The hyphen tells the server not to perform any substitution. This flag tells the server not to fulfill the request and return ’403′ response code. For e.g.
RewriteRule ^product-price\.php$ -[F]
[NC] => It tells the server to ignore uppercase or lowercase when checking for patterns. For e.g.
RewriteRule ^him*\.php$ [nc]
[QSA] => Query String append. It tells the server to pass query string from the original URL to the new URL.
[L] => Last rule. This tag tells the server not to process any more rules.
[OR] => Logical OR. This flag is used as logical OR for RewriteCond statements.
RewriteCond
This configuration directive tells the server to interpret the given statement as a condition for the rule which immediately follows it.
Syntax:

Here first mod-rewrite matches each URL with the given pattern. If no URL matches the pattern, then mod_rewrite process the next rule. If a URL matches the pattern, then mod_rewrite looks for the corresponding RewriteCond. If no corresponding RewriteCond exist, then the matched URL is replaced by the substitution.
If corresponding RewriteCond exist, then each RewriteCond is processed in the order they appear from top to bottom. Each RewriteCond is processed by matching its test string to against its corresponding condition pattern. If test string doesn’t matches with its condition pattern, then mod_rewrite process the next rule, otherwise it process the next RewriteCond. When all RewriteConds are successfully processed, then the matched URL is replaced by the substitution. A test string can be:
1. A simple text
2. RewriteRule back reference
3. RewriteCond back reference
4. Server Variable
RewriteRule Back Reference
It is of the form $N, where N can be any number from o to 9. It is used to denote that variable which was created in the RewriteRule pattern. For e.g.
RewriteRule ^(.*)$ /index.php/$1 [L]
RewriteCond Back Reference
It is of the form %N, where N can be any number from 1 to 9. It is used to denote that variable which was created in the ‘condpattern’ from the last matched ‘RewriteCond’. For e.g.
RewriteCond %{HTTP_HOST} ^(123\.42\.162\.7)$
RewriteCond %1 ^123\.42\.162\.7$
RewriteRule ……………..
Server Variable
Syntax: % {Variable_Name}
E.g.
1. %{HTTP_HOST} – This variable gives information about server name and its IP address.
2. %{HTTP_USER_AGENT} – This variable gives information about user’s operating system and browser.
3. %{QUERY_STRING} – This variable returns query string.
4. %{HTTP_REFERER} – This variable returns the URL of the referer.
5.%{REMOTE_ADDR} -This variable returns the IP address of the referer.
Examples
Example-1: Redirect all request for pages in the media folder to a new page ‘media.html’.
RewriteRule ^media/$ /media.html [r=301,l]
Example-2: Redirect oldaddress.html page to newaddress.html page
RewriteRule ^oldaddress\.html$ /newaddress.html [r=301,l]
Example-3: Redirect one website to another
Redirect 301 http://www.anotherwebsite.com
Example-4: Redirect abc.com/index.html to www.abc.com
RewriteCond %{REQUEST_URL} ^index\.html$
RewriteRule ^(.*)$ http://www.abc.com/$1 [r=301, l]
Example-5: Block a visitor from the IP address 12.34.56.78 to view your file product-prices.html
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^product-prices\.html$ /sorry.html -[F]
Example-6: Block a visitor from the IP address 12.34.56.78 to view your folder ‘sales-demo’
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^sales-demo/$ /sorry.html -[F]
Example-7: Block a visitor from the IP address 12.34.56.78 to view your website www.abc.com
RewriteCond %{REMOTE_ADDR} ^12\.34\.56\.78$
RewriteRule ^.*$ / -[F]
Regular Expressions and Google Analytics
There are many cases where regular expressions are very useful in Google Analytics. Some of such cases are:
1. Setting up a goal which should match multiple goal pages instead of one.
2. Setting up a funnel in which a step should match mutiple pages instead of one. Infact when you set up a funnel, all URLs are treated as regular expressions.
3. Excluding traffic from a IP address range via filters. Infact there are many filters which require regular expressions. Big organizations generally own a range of IP addresses. Therefore to exclude organization’s internal traffic you need to specify a IP range using regex.
4. Setting up advanced segments. For example following regex can segment all the traffic coming from social media sites:
twitter\.com|facebook\.com|linkedin\.com|plus\.google\.com|t\.co|bit\.ly|reddit\.com
Note: You can use Regex equipped advanced segments to unleash the power of the long tail keywords and determine whether these keywords are worth chasing. You can also use regex to segment important data through advanced segments.
5. Rewriting URLs in Google Analytics reports.
You can rewrite URLs in Google Analytics reports with ‘search and replace’ advanced filter. This comes handy when your website has very long ugly dynamic URLs and you can’t figure out what the page is all about just by looking at its URL. So for example with ‘Search & Replace’ advanced filter you can ask GA to report the following URL:
http://www.abc.com/fder/?catg=2341&pid=428
as
http://www.abc.com/outdoor/fleeces
6. Filtering data within the GA report interface.
You can use following regular expressions to filter keywords on the Google Analytics reporting interface:
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){0}$ =>Filter 1 word keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){1}$ =>Filter 2 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){2}$ =>Filter 3 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){3}$ =>Filter 4 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){4}$ =>Filter 5 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){5}$ => Filter 6 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){6}$ => Filter 7 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){7}$ => Filter 8 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){8}$ => Filter 9 words keyword phrase
^[^\.\s\-]+([\.\s\-]+[^\.\s\-]+){9}$ =>Filter 10 words keyword phrase
^([^ ]+ ){4,10}[^ ]+$ – Filter keywords that have between 4 to 10 spaces in them. This regex can help you in determining long tail keywords on your website.
^/([^/]+/){3}[^/]*$ – Filter landing pages that that have 4 slashes in their URL. This regex can help you in identifying low quality pages on your website.
Related Tools:
- To learn more about regular expressions: http://www.regular-expressions.info/
- The Regex Coach is a graphical application for Windows which can be used to test regular expressions
- Regular Expression Checker - chrome add-on to text regex
.htaccess
It is an ASCII file which contains configuration directives and rules for files, folders and the whole website. You can have more than one .htaccess file on a server. In fact you can have one .htaccess file per folder/directory. When you put the file in a directory, the rules mentioned in it are applicable only to all the files and sub-directories in the directory. When you put the file in the root directory, the rules mentioned in it are applicable to all the files and directories on the server. A htaccess file must contain following two lines:
Options +FollowSymLinks
RewriteEngine on
If you like this post then you should subscribe to my blog and follow me on twitter.
Other Posts you may find useful:
- Six valuable .htaccess tips
- Excel for SEO – Powerful Cheat Sheet to Boost Productivity
- How to do Site Speed Optimisation
- Ultimate Data Visualization Guide for SEO
- How to write a SEO Contract?
- How to Automate Event Tracking in Google Analytics
- Social interactions tracking through Google Analytics
- Google Analytics Account Setup Checklist
- SEO Contract | Sample SEO Contract Template
- Event Tracking – Google Analytics (Simplified Version)

About the Author: Himanshu Sharma is the founder of seotakeaways.com which provides SEO Consultation, PPC Management and Analytics Consultation services to businesses of all size. He holds a bachelors degree in ‘Computer Science’, is a proud member of 'Digital Analytics Association' and is also a Google Analytics Certified Individual with GAIQ Score of 95%. He is also the founder of EventEducation.com and EventPlanningForum.net.






once again excellent post himanshu
Nice helpful article but are you sure you can redirect one website to another using redirect 301 command.
great post himanshu. i am not a web developer but now i can finally understand how regex is used in configuration directives. Thanks for taking out time in writing such a comprehensive post.
It’s my pleasure buddy. It took me almost a month to write this post.
thats a very thorough and useful guide. thanks for sharing.
That’s a great reference. I am going to bookmark it.
Your guide is really detailed and helpful. Here is another great reference on regex: http://regexlib.com/
Thanks Alison.
Excellent. I have spent the last one year trying to understand regular expressions and their use in mod rewrite. Your post explains it in a very simple manner. thanks
I though i was the only one who loves regex.
very nicely done tutorial on regex and mod rewrite. thanks a lot for this great information himanshu. any other resource on regex?
Mastering Regular Expressions by Jeffrey E. F. Friedl is the ultimate resource on regex.
love it…. learn a lot from it.
excellent tips. this gives me a lot to learn about. bookmarked
thanks a lot for this insightful article. regular expressions is indeed a very tough topic. Hope to get more such articles from you.
i love posts like this. thanks for wrapping up this information for us.
After visiting your blog post only, i am aware about that regular expression. Anyway, it will be very helpful for me. Thanks for the nice share!
Here is a real world example,
Used for finding links in html
href=”([^"]+)”
The parentheses above denote a grouping which allow you to pull that section of the regex out and us it for other purposes. Typically this grouping would be available in the $1 variable.
thanks SBR.
Great article on one of my most loathed subjects! I have struggled with Regex since first taking an interest in web design some years ago. This is a great resource which I shall bookmark.
Thank you for posting this it is very helpful, for both SEO and Analytics folks, as you can use these in google analytics as well.
superb post. I haved liked almost all of your posts so far. This is one is truly remarkable.
thanks for the post
Excellent reference guide. I have bookmarked and will no doubt be back soon
Thanks Blaine.
Just wanted to drop in and thank you!
Here’s my favorite testing tool:
http://www.gskinner.com/RegExr/
I came here after reading your post on SEOmoz and these two articles are pretty similar. I’m sure that your article was used as a reference but you shouldn’t have called him out so boldly. Nonetheless, your post is awesome as well and thanks!
Hi Andrew!
Thanks for stopping by my blog. I stand for what is right. Nobody can get away easily using my content without proper attribution.
Thank for posting brother, really helped me to boost up my knowledge in REGEX and Re-Write URLs
I have 1 question…
For
Q. Find all the PHP files
^*\.php$ works same as *\.php$ for finding php files, isnt it? or m i wrong?
It should work same if there is no regex before *\.php$. But just to be on the safe side i would prefer to use ^ at the very beginning.
A very useful post i have read after a long time and possibly the best post on regex so far. Thanks.
Nice post. Will it be possible for you to convert into a cheat sheet for regex. Thanks.
I didn’t see this post until now. Great work.Bookmarked.
Hi Himanshu! Is there any firefox add on to test regular expressions? Nice post.
Yes. There is one. Google ‘regular expression tester firefox’ and click on the first result.
Wow! Excellent resource for regular expressions. I would like to share my favorite regex tool http://www.regexbuddy.com/
Thanks Brian. I will check it out.
Can you use htaccess to block all bad bots? The reason I ask, is I had a bot hitting my article directory with “empty user agent string” that was using up tons of my bandwidth. My hosting company and I tried several things. Banned ranges of IPs, etc. They even tried to do something in their firewall, but this bot continued to hit regardless of anything we’ve done.
Any suggestions? I’d pay someone to get this bot banned from crawling my website.
Yes htaccess can be used to block bad bots as generally such bots don’t obey the robots exclusion protocol. I don’t know the name of your bot so can’t give you the actual code. But here is an example:
RewriteEngine OnRewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
Here blackwidow and Zeus are the name of two bad bots. If you know the name of your bots then you can use that names instead. Let me know if it helps.