PHP captcha script

> Download (ZIP, 180.2KB)
> Example page

Recently I found myself in need of a captcha for a couple of my websites. Given that I hate ReCaptcha and I felt like writing some code, I decided to make one myself. The result is a simple PHP script that uses GD to generate the captcha image, and (optionally) javascript on the client side to let the user load a new captcha without refreshing the page.

After I'd gone to the effort of writing it, I decided that I might as well publish it as open source. I doubt it will be much use to anyone because there are lots of perfectly good captcha scripts out there already, but who knows? Right now it has a lot of rough edges and could be broken with some fairly simple image analysis, so I'm only offering it as-is. There are a lot of improvements that could be made, but it's good enough for what I needed and I don't want to spend any more time on it. You are welcome to make your own changes and improvements as you wish. I'd be grateful if you submit any improvements you make back to me so I can publish them here, but you don't have to. Of course I will credit all contributions you make if that's what you want.

Using the script

Adding a captcha to your site involves five simple steps. Take a look at the source code of the example page in the ZIP file to see how it works in practice.

Step 1: Copy files

You need to copy the following four files into the same directory as your script:

These files must be in the same directory as your script. At the moment there is no support for working across multiple directories.

Step 2: Add an include statement

The API functions are all in the captcha.php file. You need to include this file into your script using a PHP include() statement:

include("captcha.php");

The include statement must be located before any calls to the API functions. I suggest you put it right at the beginning of your script. Alternatively, you can copy and paste the contents of captcha.php directly into your own script.

Step 3: Display the captcha image

This part is easy, as all the work is done for you by the show_captcha() function. Calling this function will embed the necessary HTML and javascript code into your form. show_captcha() takes a single boolean argument; this should be true if you want a refresh button on your form, or false if you don't want a refresh button. The refresh button will appear to the right of the captcha, and can be clicked by the user to get a new captcha without having to reload the page.

The call to show_captcha() should be located wherever you want the captcha to appear. For example:

<td>
<!-- Captcha appears here -->
<?php show_captcha(true); ?>
</td>

Step 4: Add a text field for the user to enter the code

This has been left up to you rather than being handled by show_captcha() or a similar function, because every implementation will be slightly different. You can customize the text field however you want. The field should be part of a form that is submitted back to the server when the users clicks a submit button, preferably by HTTP POST. For example:

<form action="myform.php" method="post">
<table>

<!-- Other fields go here -->

<tr>
<td>Security code:</td>
<td><input name="captcha" type="text"></input></td>
</tr>

<tr><td colspan="2"><input type="submit"></input></td></tr>

</table>
</form>

Step 5: Validate the user input

Simply call verify_code() with the user-supplied code as a parameter. This would usually be obtained from the POST values. This function returns true if the supplied code is correct, or false if incorrect. Because it uses PHP sessions, it must be called before any output has been sent to the user's browser. The script that calls this function should be located at the very beginning of the document, before any HTML code, and before calling show_captcha().

How it works

It's actually very simple. The captcha code is generated using some random numbers converted to base-64. It is stored as a session variable so it can be used to validate the users input, and is used to generate the captcha image. The following sections describe the script in detail. They are designed for you to read along with the code files, so download the ZIP and take a look.

Captcha and image generation

This section describes the code_image.php file. This file outputs a PNG format image, and is designed to be embedded into your form using an img tag (the functions in captcha.php will take care of this for you). While the code looks complicated at first glance, it can be broken down into a number of simple steps:

  1. Create a random 5 letter alphanumeric code and store it as a session variable.
  2. Create a blank image of pre-defined size and fill it with a white background.
  3. Draw coloured horizontal and vertical curves across the image.
  4. Draw each character of the code in a semi-random position.
  5. Output the image in PNG format.

The code generation is a nasty long line which should probably be split up for readability, but there's nothing complicated about it. It follows this simple formula:

  1. Generate 15 random bytes using openssl_random_pseudo_bytes().
  2. Convert the bytes to base-64.
  3. Convert the string to all upper case for readability.
  4. Remove any non-alphanumeric characters.
  5. Trim the string to five characters long.

Using OpenSSL to generate our random data is probably overkill. OpenSSL provides a cryptographically secure random number generator which means its output is completely unpredictable, but you could get away with something like mt_rand() here. Still, openssl_pseudo_random_bytes() is easier to use if you need a string of bytes, because mt_rand() returns an integer. OpenSSL might be a bit slower, but there's no real performance hit as far as I can tell. Whatever you do, don't change this to use rand(), as this uses the C rand() function which varies a lot from one operating system to the next, but often isn't very random at all.

Why 15 bytes? There are several reasons, the first of which is that we ideally need something that is divisable by 3, to avoid any padding when converted to base-64. Each character in a base-64 string represents 6 bits (26 = 64), meaning that every 3 bytes of binary is converted into 4 base-64 characters. This gives us a 20 character string, but some of those characters could be punctuation that will be removed. If we want to be left with at least 5 alphanumeric characters when all the punctuation is gone, we should generate a few more characters than we actually need. 20 characters is still a bit more than necessary, but it leaves enough for you make longer captchas by simply changing TEXT_LENGTH.

There is a remote possibility that we could end up with a lot of punctuation and little to no alphanumeric characters. This is very unlikely, but just in case, the code checks the length of the final trimmed string. If it is less than 5 characters long, another string will be generated. Five upper case letters or numbers gives us a little over 6 million possible codes ([26+10]5). Cryptography geeks might be wincing now, but 6 million codes is plenty for a captcha because you only get to try each code once. The code changes every time you make a mistake. Compiling a database of images to check against isn't possible because the position of each character changes randomly.

The randomization makes drawing the code a little more complicated than it would otherwise be. Each character has to be drawn individually, and we have to limit the output of the random number generator so that the characters appear in order, within the bounds of the image. The characters are all assigned their own bounding box, the same height as the image and one-fifth the width. The size of this box is the same for every character, so it only has to be calculated once. The X position of the box is the (zero-based) indice of the character multiplied by the box width, and the Y position is always zero.

However, we can't just choose random X and Y coordinates from within the box. The character obviously takes up some space of its own, and we need to make sure it doesn't overlap the edges of the box. We can find the size of the character with GD's imagettfbbox() function, which returns an array of coordinates that we can use to calculate the font size. This has to be repeated for each character, because each glyph in a TrueType font can be a different size. Although I chose to use a monospace font, there's nothing to say you have to.

Notice how the range for the Y coordinate is reversed. Instead of being from 0 to
$box_height - $font_height, it's from $font_height to $box_height. This is because the coordinates of a character in GD specify the bottom left corner, not the top left corner. Finally, once we've determined the allowable range for both the X and Y coordinates, we can generate some random coordinates using mt_rand(), and draw the character using imagettftext().

Include file

The include file (captcha.php) contains two functions which you can call from your own script. I've already explained how to use these functions; this section contains more detail on how the functions work.

verify_code() is used to validate the code the user enters. It takes the user input as a parameter, and returns a boolean value that tells you whether the code is correct. The function looks simple enough, but there are two important points. First, you might be wondering why we check to see if the user input is an empty string. Why not just compare it to the stored code straight away? Consider what would happen if an attacker didn't load the captcha image. The script in code_image.php would never be run, so the session variable would never be set and would remain empty. Now the attacker can bypass the captcha just by entering an empty string. We have to make sure the string isn't empty to prevent this.

Second, it's very important that the session variable is cleared after verifying the code. Without this, an attacker could load the captcha image only once, then never load it again. The code stored in the session variable would remain the same no matter how many times the form is reloaded, allowing the attacker an infinite number of attempts to guess the code by brute force.

show_captcha() looks more complicated, but there's not a lot that needs explaining. The only thing that might not be obvious at first is the line of javascript that reloads the image when the user clicks the refresh button—why tag the current unix time onto the end of the filename? This is to force the browser to load a new image from the server. Without it, some browsers will show a locally cached version of the image and the code won't change. The time string comes after a # character in the filename, so it doesn't make any difference to how the image is shown (you can't have anchors in an image).

Customization

There are a lot of changes you can make to customize the captcha to your own tastes and site design. Here are a few things I specifically made easy to change when writing the code.

Constants

The constants at the beginning of the script control most aspects of how the captcha appears:

Colours

The four colours used to draw the captcha (white for the background, black for the text, red for the horizontal curves and blue for the vertical curves) are defined on lines 18 to 21. You can change the RGB codes to select any colour you want, but you should make sure there is enough contrast for the text to be easily readable. In particular, the horizontal and vertical stripes should be in light pastel colours or the text will be hard to read.

Security

Breaking this captcha would be easy using basic image analysis. This method should work:

  1. Break the image up horizontally into 5 equal-sized sections. Each section will have one character in it.
  2. Filter each section to remove everything that isn't black. Only the text will be left over.
  3. Determine the boundaries of a box that surrounds only the black area of a section, while taking up the minimum area possible.
  4. Compare the image within each bounding box to the glyphs from the font that was used to generate the captcha.
  5. Reassemble the resulting characters into a string in the order that they were taken from the image. This is the captcha code.

Improvements

Depending on how much time you're willing to put in, there are a lot of improvements that could be made to this script. Some of these improve security and make the captcha harder to break; others improve usability or fix minor issues.

Security

Perhaps the best thing you can do to make the captcha harder to break is to distort the text somehow. This is the approach taken by most captcha software such as ReCaptcha. Introducing a random skew across each character will make it difficult for image processing algorithms to recognize them, while still remaining readable to a human. You could use other types of distortion like removing part of each glyph or adding noise over the top of the character, but averaging algorithms might still be able to interpret the text.

You might be tempted to use a different background in place of the curves used currently. Don't bother—backgrounds are next to pointless because they don't make a captcha any harder to break. And yes, that includes the current background. For a captcha to be readable to a human, there needs to be enough "contrast" between the text and the background for the user to be able to tell them apart, either by colour or by shape. If the background and text are different enough, a computer will always be able to seperate them.

Remember that the point isn't to make an unbreakable captcha, because there's no such thing. Captchas, especially text-based ones, will always be breakable by computers, so long as someone is willing to write the code to do it. The point is to make a captcha that is impractical to break on a large scale using sensible amounts of computing power. If your computer can only break one captcha every minute, it can still be done by hand more efficiently. Exactly how complicated is good enough will change over time, as computers get more powerful and image processing algorithms get more efficient. This is why captchas have got significantly more difficult in recent years.

Other

The are a couple of improvements that could be made to how the script handles different image sizes. At the moment, the captcha size is hardcoded to 100x38 pixels, but can be changed by modifying two constants (see "Customizations" above). You could easily modify the script to accept different sizes as HTTP GET parameters, making it easier to integrate a captcha into your web page without having to change the code. Other parameters such as font size or colouring could be set like this too. Some bounds checking would be needed to stop users from making the server generate excessively large images, and the code should fall back to a default size if no parameters are specified.

The script works correctly at any sensible size for a captcha, but at unusually large sizes (greater than 200 pixels wide or 100 pixels high), the curve patterns start to break up around the edges. This is because not enought horizontal and vertical curves are generated to fill the image properly. Although the number of curves does scale automatically with changes to image size, the scaling is linear while it should probably be some form of logarithmic. A linear approximation holds up well at small image sizes, but with larger images it starts to fall apart. This shouldn't be a problem as no one needs such big captchas, but it could still be improved to make the code truly robust. I haven't given any real thought to how you'd do that yet.

You could make the font size scale with changes in image size, or the image scale with font size if you prefer. At the moment both are fixed, so if you resize the image you will need to change the font size seperately.

Licenses