A few weeks back i posted on why our mailboxes are full of spam. I mentioned there that i had created a small and very simple perl script to crawl the internet and fish for plain text emails. It was a pretty easy task since Perl is designed for being easy on text manipulation.

The procedure i used was as follows:

  1. I had a MySQL database for storing links to visit and emails.
  2. I connected to the database.
  3. Selected a link to visit.
  4. Make sure it’s not an image / pdf / google page / amazon page. If so i jumped back on step 3.
  5. Download the page.
  6. Pattern match for emails.
  7. Get the links within the page.
  8. Insert emails and new links on the database.
  9. Back to step 3.

This is roughly the idea. You can see below a small flowchart to get the general idea.

crawler

I would like to demonstrate how it’s done with perl. In order for you to run it, i’d suggest you have a machine with Perl installed (preferably Linux). If you don’t have Linux then you probably have to download a perl binary for Windows. Check out this page for furhter info.

Now, let’s cut to the chase. Here are the first lines of the script:

  1. #!/usr/bin/perl

Here we just include a couple of libraries. To be more specific we are going to use the Mechanize library in order to retrieve the pages easily and the DBI to connect to the MySQL.

  1. span class=”st0″>"127.0.0.1""db""username""password""dbi:mysql:$database;$host"

With this snippet we connect to the database we have. Next we need to send all the error messages to “/dev/null”. Now i did this because i had some stupid errors for 404’s and stuff which, when the script was done, i didn’t care. If you want to play with it just ignore this line in order to get all the errors printed on your screen.

  1. span class=”st0″>">/dev/null");

And from here on we are getting to some serious stuff. We are starting a loop through all the links we need to visit. Actually this loop is going to be endless since the links are always populated with new ones. Here is the start:

  1. span class=”st0″>""){

I will make a small break here to introduce the “get_url” function. With this one i retrieve the link from the database to visit. Here it is:

  1. span class=”st0″>"SELECT url FROM urls WHERE visited = ‘0’ LIMIT 1;"#deleting
  2. "UPDATE urls SET visited = ‘1’ WHERE url = \"$results[0]\";"

This way i get the next link. Now back to the while loop. Now, we have the link we want to visit but before we go on, we need to make sure of the following:

  1. It’s not some kind of image. If it is, there is no meaning in getting the contents since i am surely not going to get any email from that content. It’s all binary.
  2. It’s not a pdf. Same as above, this is binary so i won’t get any email from that either.
  3. It’s not an amazon page. Since so many pages add amazon ads, i might get links for amazon products or widget landing pages. There is no meaning at all in getting those pages.
  4. It’s not a google page. Same as above, due to the Google ads, many links in blogs lead to google pages that have no meaning for this crawler. So, we will ignore those too.

In order to ingore a link here is what we need to do. Let’s take the first example and ignore a jpg file.

  1.  

This way we can ignore all the above. I won’t get in the detail of it, there is no meaning to do so.

Now we are sure we want to visit the link. So, we are going to use the Mechanize library to retrieve the link’s content. Here is how to do it:

  1. span class=”st0″>’Windows IE 6′

Using this simple method we download the page indicated by the link. Notice that we are telling the server we are visiting that we are a “Windows IE 6” browser. Now, that’s nasty, but it makes it a more legitimate request. Now, we have on the “$content” variable the HTML of the link. Let’s parse it for emails:

  1. span class=”st0″>"jpg""INSERT INTO emails(email, site_seen) VALUES(\"$&\", \"$url\")");
  2.     }
  3. }

With this snippet we parse, using regular expressions, for emails. This parsing is done for plain simple emails like “foo@bar.com”. It’s not sophisticated but it surprisingly works! Do notice one thing. We exclude the emails that contain the word “jpg”. I do this because i noticed that all the images on Flickr have a name that looks like an email. This saved me from a lot of 404’s indeed!

Now, onto the links that are contained in this page. We need to extract them first. Check this out:

  1.  

With this simple line we have on the array “links” all the links contained within the page. Isn’t Mechanize awesome? Next thing we are going to loop through them and add those that are not images, amazon etc. to the database in order to visit them later.

  1. span class=”st0″>"INSERT INTO urls(url) VALUES(\"""\")");
  2.     }
  3. }

We are done! Now, the script is going to select the next link to visit. As simple as that!

From the above code you can clearly see how easy it is to write a simple crawler. When i put it to action, the url’s table contained only one link. That of a blog listing site. In a couple of hours it contained thousands of links that, most of them, where valid. This is how essentialy a crowler works. Starts from somewhere and expands it’s “web” to neighbouring sites, and then to the neighbours of the neighbours etc.

One more thing i’d like to point out is that this crawler is not sophisticated. It doesn’t check for pages with the same content and links that look very much alike and might point to the same content. For instance, if a site has a way through a getter on the link to change the background color, that would be something that we would like to dodge. The link could be “http://www.mysite.com/” and “http://www.mysite.com/?color=blue”. The content of the two seemingly different links is the same.

All in all it was a simplistic crawler but the principles of crawling are there. If you have any questions / suggestions or find any error i’d be glad to hear from you!