My picWell have you ever thought when surfing the net what the hell happens when you type a website address and click visit or whatever? How does the browser talk to the remote computer and gets the page you want? I mean all you do is type in an address and then everything is taken care for. But don’t you realy want to know what happens form the time you hit enter till you get the page? If so then keep reading…

Contacting a web server is the same as contacting any service, such as HTTP, FTP, etc etc. All those are based on what we call a protocol. In fact the names are descriptive enough, HTTP = Hyper Text Transfer Protocol. All the protocols are described in very long documents called RFC’s (Request For Comments). These documents describe with many many details all the protocols and standards. Here what i am trying to do is a crash programming small tutorial on the HTTP protocol and headers. If you need the full RFC on HTTP 1.1 then you should go here.

So, it seems you are still here rather than the RFC. Fine, let’s take a quick look on the HTTP 1.1 What is invlolved in the HTTP is a web server, such as apache (and all the sub-projects like tomcat, tapestry etc), IIS and all the others. When you hit http://www.google.com the following steps happen:

  • www.google.com is translated to the IP address of the server.
  • http on the begining of the address means that this is the protocol to be used is HTTP.
  • A socket connection is opened to the host (at the IP found above) at port 80 (since another port was not defined and the default for http is 80)
  • Then the browser starts telling the webserver who he is and what he wants and then the webserver replies with either an OK message and the content the browser asked or a not ok message along with the error code.

Let’s focus on the last step. The other ones are beyond the scope of this post. So, when we say the webserver says who he is and what he wants how does he do that exactly? Let’s clarify that all the messages between the two ends are plain text. Those messages are called “http headers”. Every header is delimited with a CRLF (or \r\n) which is the new line separator. After the client is done sending the headers it sends two new line separators.

From here on, the server starts parsing the headers the client sent and, first of all, determines if the request is a valid HTTP 1.1 (or 1.0). If it is not then a 400 error code is sent back. In general 4XX codes are sent to the client if an error occurs. For instance you must have seen the 404 error which means that the page requested does not exist. If a 200 code is sent this means that everything went just fine so be happy for it. For a full explanation on error codes you can reffer to this page.

If the request was successfull, the server will echo what we asked for. If, for instance, we asked for foo.html then it will echo the contents of foo.html file. In the headers there will be a very important one, “Content-length”, which will tell how many bytes, after the headers, are coming our way. Then the browser gets the contents, parses them accordingly and renders the page.

But what if the request was for an image? Or even an unknown format binary file? Here comes the header “Content-type”. If we ask for a plain html file this header will have the value “text/html”. If we ask for an image (png for instance) the header will be “Content-type: image/png”.

One important thing is that if you want to be HTTP 1.1 compliant (which means if you make an HTTP/1.1 request) then the headers should at least be:

  • GET /the/page/you/want HTTP/1.1
  • Host: thehost.com

Besides that, all the other headers are optional but would be realy helpful if you set the Keep-alive time, the Connection: close/keep-alive which will tell the server how much time you will be waiting.

Following is a dump from the headers sent by a client when he makes a request.

GET / HTTP/1.1
Host: localhost:2020
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; el; rv:1.8.1.12) Gecko/2008
0201 Firefox/2.0.0.12
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plai
n;q=0.8,image/png,*/*;q=0.5
Accept-Language: el-gr,el;q=0.7,en-us.;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-7,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

As you can see the things are pretty streight forward. He says who he is, what he accepts blah-blah and then he awaits a response. The following is a dump from an apache server after a valid HTTP 1.1 request.

HTTP/1.1 200 OK
Date: Thu, 28 Feb 2008 19:00:18 GMT
Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 PHP/5.2.0-8+etch10
mod_ssl/2.2.3 OpenSSL/0.9.8c mod_perl/2.0.2 Perl/v5.8.8
X-Powered-By: PHP/5.2.0-8+etch10
Content-Length: 10
Connection: close
Content-Type: text/html; charset=UTF-8

1204225218

So we made a valid request and the server responded. As you can see every header on both the above dumps is seperated with CRLF’s (although you can’t see them). The content on the reply from the server is seperated with two CRLF’s from the headers just like we said. Notice the header: Content-Length: 10, this means that the content is 10 bytes long. Count the numbers that are the response πŸ˜‰

But here comes a tricky one. What if the server uses sessions? Well here is the response of the server on the same page as above. This time the page uses sessions.

HTTP/1.1 200 OK
Date: Thu, 28 Feb 2008 19:00:08 GMT
Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 PHP/5.2.0-8+etch10
mod_ssl/2.2.3 OpenSSL/0.9.8c mod_perl/2.0.2 Perl/v5.8.8
X-Powered-By: PHP/5.2.0-8+etch10
Set-Cookie: PHPSESSID=4541b461b998bdfd2295534bdc372862; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Length: 10
Connection: close
Content-Type: text/html; charset=UTF-8

1204225208

So as you can see the difference is in these two headers, Set-Cookie: PHPSESSID=4541b461b998bdfd2295534bdc372862; path=/ which tells the browser to set a cookie (this is what a session is, a special cookie), in this header: Expires: Thu, 19 Nov 1981 08:52:00 GMT which tells the browser when the cookie expires, and those two: Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 and Pragma: no-cache which tells the browser how to handle the cache.

From all the above, one can figure out how easy it is to create his own simple browser or simple http server. Ofcourse there are alot of headers to implement on a server and alot of functions on a browser (javascript etc etc) which makes it just something you can do for fun.

Anyhow, one more use i found with this is that i can do some nice things if i can put PHP to request via http. Some would argue, “hey use fopen with a URL… duh!” but with the way i suggest you can do almost anything. For instance if there is a script on another server that gets some POST variables doing something, you can easily post stuff through PHP. Let’s say that the script gets your name via post and does some stuff on it (whatever, it could just say hello, the posting procedure is what matters here). This is what the post would look like.

POST / HTTP/1.1
Host: localhost:2020
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; el; rv:1.8.1.12) Gecko/2008
0201 Firefox/2.0.0.12
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plai
n;q=0.8,image/png,*/*;q=0.5
Accept-Language: el-gr,el;q=0.7,en-us.;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-7,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 12

name=Stratos

So, as you can see, all the browser does is it sends some content of 12 bytes and then on the content area (seperated with two CRLF’s from the headers as always) it adds the url encoded values. Beware! The post length is dependent on the PHP configuration (default 64Kb).

I’ve had this article as a draft for the past three days. This is because i wanted to write it as easily to understand as i could plus contain as much information as i could. I hope this helps/educates you guys and be sure some more on the same subject are coming (like file uploading this way). For now if you like the article, find it usefull or have some pointers please leave a comment.

/me out