Talking GeekSpeek to a Mailbox

(A different approach to Validating an Email Address in PHP)

(This tutorial assumes a PHP-capable audience. I would class it as Early Intermediate level)
(Some of the code in this tutorial will not work on Windows® platforms without modification.)
(See Footnote <3> for details.)


The only real problem with online e-mail forms is that you have no control over what your visitor is typing into the "Your e-mail address" box. The best you can do from the programmer's end, is to make sure you don't accept anything which is not valid. Now this automatically begs the question "What is a valid e-mail address". A lot of different answers come to mind, but I am a pragmatic type of guy. To me the ONLY useful definition of a valid e-mail address is ...

"A valid e-mail address is one to which I can send an e-mail, and have it delivered."

Therefore, the ONLY way of validating an e-mail address is to send a mail to it, and that is almost, but not quite, what we are going to do in this tutorial. I said that right at the beginning so that nobody need live in fear of a long exposition on the elegance of regular expressions (regexp). Forget that stuff, we're here for fun.

Phase 0 .. Nothing to do with Validation

Let's keep it simple, this function only expects to receive an email address as its calling parameter, and it relies on a single global variable $HTTP_HOST. So lets create a shell for the function, and declare the global variable right now, just to get that bit out of the way ..

function validateEmail($Email) {
  global $HTTP_HOST;
  // you can also define any local variables like your return string here too
  //
  // all the rest of the logic in this tutorial replaces this comment
}

Phase 1 .. Does it Look OK?

Why bother testing the validity of an address if it's malformed anyway?

Ahhh .. OK ... there has to be just a little bit of a regular expression in here, but I'm lazy, so I didn't write it, I just searched for address validation code in all the usual places, and discovered that there is quite a lot of it out there. All of it is based on an attempt to test the address string using different regular expressions of arbitrary levels of complexity.

Before we go any further, let me tell you that I am not going to teach anyone how to construct regular expressions. You can find lots of regexp tutorials on Google. I am also not going to waste time describing the structure of a valid e-mail address. The masochists amongst you can read and enjoy the full and unexpurgated definition in section 3.4 of RFC 2822 which has now replaced the original specification outlined in Section 6 of RFC 822 (RFCs are better than mandrax for putting you to sleep).

I don't recommend writing your own parsing expression. It's all been done before, by experts, and if you don't understand regexp you can easily make a cobblers of it. For a taste of it, just get hold of a regular expression syntax guide, and work out what this expression ...

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$

... evaluates to. An expert can fool it, it certainly isn't the best you can find, but it gives a reasonable first pass, and only lets through addresses that we can work on later without causing the code to throw up exceptions. That's all we need at this point in the program.

You can find an explanation of the above expression in footnote <2>. If, on the other hand, you really want everything to be as rigorous as possible, then refer to the code written by Jay Greenspan and Brad Bulger (here is a copy) which is an attempt to test an address for full RFC-822 compliance (and in my opinion is complete overkill for our purposes).

Here is my short, sweet validation expression in full, and this is the first bit of code to put in the function template we made in Phase 0.

  if (!eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$", $Email)) { 
    // return with an invalid address error
  }  

  // otherwise it passes the test so we can go off and do the 'real' validation
  // the rest of the code replaces this comment

Phase 2 .. Is $Email's domain valid for receiving mail?

Well this part works on the domain, so first let's chop up the address into user and domain portions by splitting it at the @ symbol

  list ( $Username, $Domain ) = split ("@",$Email);

PHP has some great little in-built networking functions, and if you have a rudimentary understanding of how DNS and the SMTP interface works, you can take advantage of them to do some very fast tests. Some of the explanation below is probably way below the level of most of you. If it is, bear with me while I use it to get my own thoughts in order.

When you try to send an e-mail to me at trib-design.com, it actually goes to a server called mail.trib-design.com. How does it know how to do that? - because my DNS servers (we all have them) have what is known as an MX (Mail eXchanger) record for my domain, which points to my mail server. Any attempt to send mail causes the sending program (sendmail, qmail, etc) to request the details of the MX server from the DNS service "out there". (55 yrs old and I still marvel how clever the Internet's design really is).

So that must be the first of the internet tests which we want to try ... is there an MX record for that domain? i.e. is there a mail server WHICH SERVICES that domain. Be aware that not all domains have an internal mail server. Some pay for a service, so the MX record will point to the service providers mail server. Just to complicate things further, in some cases there may not even be an MX record (It is possible that the mail server has an alias which is the same as the domain name. In such a case a lazy/sloppy sysadmin may only have created a regular - Type A - record for the server ....) Anyway, if there is an MX record, we need to know the name of the server to which it refers, in order to proceed with the testing. If there is not, then we can only assume that the domain name is the name of the mail server and proceed on that assumption.

PHP provides for retrieving the MX record details with two functions, whilst taking care of the DNS queries behind the scenes for you.

checkdnsrr($Domain, $record_type)
- checks to see if a DNS record of type $record_type exists for domain name $Domain
getmxrr($Domain, $MXrec)
- gets the MX record for $Domain and returns any MX records found in an array which we are calling $MXrec

I guess, if you wanted to be quick and dirty, you could just use getmxrr() and test for a failure value returned in $MXrec, but since we are going to be rigorous, let's do it the way a properly written sendmail server would do it.

  if (checkdnsrr ( $Domain, "MX" ))  {
  
    if (getmxrr ($Domain, $MXrec))  {  

      // save the MX hostname ready for Phase 3 testing

      $Mailserver = $MXrec[0];

    } else {

      // hmm .. there's an MX record, but we failed to retrieve it
      // return with a system error message NOT an invalid address
    }

  } else {

    // in this case there isn't an MX record, so we have to assume that the domain
    // name ($Domain) is also the name of the mail server itself (it can happen)
    // so save that as the mailserver address ready for Phase 3 testing

    $Mailserver = $Domain;
  }

Phase 3 - HELO - is there anybody there?

So far we have what appears to be a correctly formed e-mail address, and either a known mail server name, or a good guess at one. Now is the time to really get down to testing our data and seeing if it all hangs together, but first, we have to attempt to make a connection to (i.e. open a socket on) whatever name we have in $Mailserver. We do this by using the fsockopen($host, $port) function. The default SMTP port is port 25.

  if ($Connection = fsockopen($Mailserver, 25)) {

    // start the SMTP validation - the rest of the code goes in here

  } else {
    // the connection failed - return with an invalid address error
  }

Have you ever telnetted to an SMTP server? It talks, human-readable responses, and it says a lot of interesting things. If you're interested you can see them explained in all their glory in yet another pair of RFCs - RFC 2821 and the earlier RFC 821 (yawn). What we're going to do is actually ask it some questions, and see what it says in response, and that's the real meat of this validation. Basically the conversation goes like this

Are you an SMTP server ?
Will you talk to me ?
Is $Email authorised to use you as a relay ?
Will you accept mail on behalf of $Email ?

I think you'll agree that if we get a yes to every one of those questions, then we're pretty sure that we have a valid e-mail address ... yes?

OK, lets talk geekspeek to a machine ... ready ?

The first thing that happens when you connect to an SMTP server is that it announces itself to you (try it yourself - telnet to port 25 of your own mail server and see what it says).

(aside ... to read from a socket, you have to use raw input, i.e. the fgets(connection, buffersize) function. Fgets() returns a value when either its buffer is full, or it receives a newline or EOF terminator. In the tests below I use the default of 1024 as the size of the fgets buffer. This is overkill, but at least it guarantees that we don't leave behind the end of the first message and then retrieve it next time, thinking it's the next message. This is important because It's the first 3 characters of each message which we test. We can't be more precise about the buffer size, because some of the SMTP messages are customizable at the server, so you don't know in advance how long they might be. However, 1024 is well over the maximum allowed by any of the mail servers I know about.)

As I said above, the first part of any SMTP message is the message number, and that's the important bit for our tests. e.g. A valid opening announcement begins with 220.

So ...

Are you an SMTP server ? - answer - 220 = yes, anything else = failed

    if (ereg("^220", $Rubbish = fgets($Connection, 1024))) {

      // it's an SMTP server at least so you can start talking to it
      // lets have the conversation and then test the results

Will you talk to me ? - any answer will do (it will always be yes if there's a server there)

      // Tell it who you are and get the response (not needed later)
 
      fputs ( $Connection, "HELO $HTTP_HOST\r\n" );  
      $Rubbish = fgets ( $Connection, 1024 );

Is $Email authorised to use you as a relay ? .. 250 = yes, anything else = failure

      // Ask it to accept a relay request from your $Email user
      //store the response (needed later)

      fputs ( $Connection, "MAIL FROM: <{$Email}>\r\n" );  
      $Fromstring = fgets ( $Connection, 1024 ); 

Will you accept mail on behalf of $Email ? .. 250 = yes, anything else = failure

      // Ask it to accept mail for delivery to $Email user
      //store the response (needed later)

      fputs ( $Connection, "RCPT TO: <{$Email}>\r\n" );  
      $Tostring = fgets ( $Connection, 1024 );

Thanks .. you can go back to sleep now ...

      // Now tell it you're done with chatting
      // and close the connection

      fputs ( $Connection, "QUIT\r\n");  
      fclose($Connection);  

Were we successful ? ...

      // did we get "type 250" responses?

      if (ereg("^250", $Fromstring) && ereg("^250", $Tostring)) {

        // YAHOOO .. we got a good one
        // return with a successful validation

      } else {

        // the server refused the user so return with invalid address error

    } else {

      // we connected to something, but it failed to identfy itself as an SMTP server 
      // so return with an invalid address error

    }

And that's it ... we've got an address which works, for a user who exists, on a server which can talk SMTP and which has confirmed the user's credentials. We were only one step short of actually sending an e-mail using that username and address (we would probably need a password to go much further), but that's still what I would call a reasonable test.

For the record, even though it goes out for a chat in the process, the average execution time of this code is between 0.25 and 1.0 seconds from the point where the visitor hits [submit] to the result being returned.

I'll leave it as an exercise for you to hack it all together, debug it, and include the return value handling code which you want to use. The core processing is all there, and in my humble opinion, where there's code involved, you shouldn't really be using it if you don't fully understand it.

... Caveat Emptor ...

Don't forget that even though this has been a very thorough test of an EMAIL ADDRESS, it still does not mean that this is the e-mail address of the ACTUAL PERSON who typed it in. You will find that mickeymouse@disneyland.com IS a valid address, and is probably used daily by millions of kids all around the world to write fan mail to Mickey, but would the little fella really be applying to join your forum? So in the end, for important validation, you will always have to send a mail, and wait for some sort of uniquely coded response or ...

In part 3 of the tutorial, which is still under construction, I'll be showing you how to interface PHP with a microsensitive keyboard, and then talk you through developing the code to communicate with the fingerprints database at FBI Headquarters in the Pentagon .. now THAT is rigorous validation ..

'till next time ... Enjoy .... Trib


Footnotes

<1> - The Goodies Section

For those of you who can't be bothered to work it all out, a finished, cleaned up, documented and working version of the function described in this tutorial is available here, the mailform file I call it from is here, and the very useful set of include files, containing utility sub-routines written by Jay Greenspan (which I use for a lot of things) are here. Incidentally, the sendemail.php file is set up to be run inside a popup window. You will need to modify it if you don't want to do that.

<2> - That Regular Expression

Normally I would leave you to work out the regular expression yourselves, on the premise that, where coding is concerned, if you don't understand something, you shouldn't be using it. So if you would prefer to do that. Stop reading now. On the other hand ... if you still want to know ... here it is again ...

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$

Now let's split it up and see how it works. Here it is broken down into its component parts ...

Part 1 ... ^[_a-z0-9-]+

^ ... "the entire string MUST start with" ...
[_a-z0-9-] ... "a letter, a number, an underscore or a minus" ...
+ ... "repeated any number of times, but at least once" ... followed by ...

Part 2 ... (\.[_a-z0-9-]+)*

( ... "treat the following as a sub-expression until the closing bracket" ...
\. ... "this sub-section MUST start with a full stop" ... followed by ...
[_a-z0-9-]+ ... "that part 1 text pattern again" ...
) ... "and thats the end of the sub-expression" ...
* ... "which can be repeated NONE OR ANY NUMBER OF times (i.e. non-compulsory)" ...

Part 3 ... @

There MUST be an @ symbol next ...

Part 4 ... [a-z0-9-]+(\.[a-z0-9-]+)*

Almost Part 1 and part 2 all over again with the exception of the ^ symbol (because we aren't specifying that the first character must be at the beginning of the whole string)

Part 5 ... (\.[a-z]{2,3})$

( ... "start another sub-section" ...
\. ... "there's that compulsory full stop again ... followed by ..."
[a-z]{2,3} "... either two or three, lowercase alphabetic characters ...."
) ... "end of sub-section definition" ...
$ "... and this sub-section MUST be the very last thing in the string."

Did that all make sense ?? Here it is even more roughly translated,

Part 1 - a compulsory word containing only certain permissible characters
Part 2 - a non-compulsory sequence of any number of "dot - another word like part 1" combinations
Part 3 - a compulsory @ before the next expression
Part 4 - Part 1 and Part 2 repeated
Part 5 - a compulsory end sequence comprising a dot followed by either 2 or 3 lowercase alpha chars.

Now doesn't that sound a lot like an e-mail address ??

<3> - Windows does not support checkdnsrr() or getmxrr() functions.
(This is an experts-only footnote)

Because of the way the Windows® operating system handle DNS lookups, the PHP native DNS functions used in this tutorial will not work on a Windows® platform. This is known, and is documented on the function reference pages for checkdnsrr() and getmxrr(). However, on the chkdnsrr() page there are a couple of sparsely-documented workarounds, which should perform the same purpose as the built-in functions. From what I can see, the functions below should be direct replacements for the PHP builtins. I have copied the code here for the sake of completeness, but since I don't use Windows® as a development platform, I cannot vouch for the quality ... always tread with care on untested ground ...

N.B. - It is unfortunate, but even this workaround will NOT work on Windows 95® systems (W95 doesn't have nslookup).

/******************************************************

These functions can be used on WindowsNT to replace
their built-in counterparts that do not work as
expected.

checkdnsrr_winNT() works just the same, returning true
or false

getmxrr_winNT() returns true or false and provides a
list of MX hosts in order of preference.

*******************************************************/

function checkdnsrr_winNT( $host, $type = '' ) {
  if( !empty( $host ) ) {
    # Set Default Type:
    if( $type == '' ) $type = "MX";
    @exec( "nslookup -type=$type $host", $output );
    while( list( $k, $line ) = each( $output ) ) {
      # Valid records begin with host name:
      if( eregi( "^$host", $line ) ) {
        # record found:
        return true;
      }
    }
    return false;
  }
}

function getmxrr_winNT( $hostname, &$mxhosts ) {
  if( !is_array( $mxhosts ) ) $mxhosts = array();
  if( !empty( $hostname ) ) {
    @exec( "nslookup -type=MX $hostname", $output, $ret );
    while( list( $k, $line ) = each( $output ) ) {
      # Valid records begin with hostname:
      if( ereg( "^$hostname\tMX preference = ([0-9]+), mail exchanger = (.*)$", $line, $parts )) {
        $mxhosts[ $parts[1] ] = $parts[2];
      }
    }
    if( count( $mxhosts ) ) {
      reset( $mxhosts );
      ksort( $mxhosts );
      $i = 0;
      while( list( $pref, $host ) = each( $mxhosts ) ) {
        $mxhosts2[$i] = $host;
        $i++;
      }
    $mxhosts = $mxhosts2;
    return true;
    } else {
      return false;
    }
  }
}