Software Technologies for the Web
 

HTTP (exploring programmatically)

The Internet is a cornerstone of modern society. We use it through many everyday tasks such as communicating with others in the form of e-mail, managing finances through Internet banking and e-commerce, and purely just for fun.

HTTP

The Internet works by making use of a protocol known as Hypertext transfer protocol (HTTP), this protocol underpins the Internet. The development of HTTP was a joint effort by the World Wide Web Consortium and the Internet Engineering Task Force (IETF). The work done by these resulted in the Request for Comments document RFC 2616 (Wikipedia 2007). This document defines the HTTP/1.1 protocol that is in use today.

HTTP uses various request methods in order to obtain information, a brief summary of some of these are;

  • HEAD - to head method is very similar to that of the get method. The difference is in the information that is requested. Head requests only ask for the meta data (response headers), and not the full content of the URL requested (Marshall 1997). It can be used to perform checks on URLs without transferring all of the data, an example of this is to check whether a file has been recently modified (W3.org 2008).

  • GET - whereas the head method only retrieves a finite set of information, the GET method retrieves all the information related to that URI. The GET command comes in various flavours. It can be used in conjunction with conditions. A conditional GET will only work if the condition is fulfilled, for example only get the file if modified after a particular date and time (Fielding, et al. 1999) It can also be used to only partially request information. This get can be very useful, as they only request a small amount of the document. Therefore saving on system resources such as bandwidth.

  • POST - the POST method is a way of sending a block of data to the server. So whereas a GET method requests information, The POST method is often sending data that will be used to generate the output. An example of this is an online form sending data to the CGI script to be processed (Marshall 1997). (The GET method is commonly misused to do this also). For the POST method to work this way, extra headers are required to be sent. The most well-known of these is the Content-Type header (W3.org 2008), which is declared in the HTML form.
Back to top

Java Classes

We can demostrate what information these methods return by using a simple java web browser.
To do this we need to use a few classes in our program, they are;

java.net.URL
As the name suggests, this class deals with the URL exchange with the server. It performs two main functions. Firstly, to create a new URL, and secondly to parse a URL delivered to the program. When you build a URL object , with this protocol . Checks that you have entered a valid URL method such as “http”. As of the JDK version 1.1 (distributed by Sun) 10 protocols are supported. These are;
  • file
  • ftp
  • gopher
  • http
  • mailto
  • appletresource
  • doc
  • netdoc
  • systemresource
  • verbatim (Harold 1999)

Error-checking is done within this class, and if none of the above are found, an error in the form of a MalformedURLException will be issued. When parsing a URL this class again checks for a appropriate protocol. The protocol will determine what data is to be sent back to the program. In the example program that we are going to use the data is sent to the object input stream. The same error-checking, that is done when you construct a URL is done when parsing, an error will be issued to the program when a non-valid URL method is detected.


Create URL object, an example of this can be found in the HttpClient.java program.

// Create new URL object
URL url = new URL(address);

Parse URL example, an example of this can be found in the HttpClient.java program.

// Split a URL into the protocol, host, and filename.
String protocol = url.getProtocol();
String host = url.getHost();
String filename = url.getFile();
java.net.Socket
A socket is a mechanism by which networked computers can “talk” to each other, by means of the TCP/IP suite of protocols. It is one of the fundamental methods that underpin the Internet. The Java class socket allows us to create, use, and destroy sockets. To receive data using the socket class you must use the input stream method, the output stream method is used for sending data. An important aspect not to be overlooked is to remember to close a socket when you have finished with it.
If you wish to connect to more than one machine you will need to create a separate socket object for each of them, the rule is one socket per connection. (Harold 1999)
// Create socket, passing host and port
Socket socket = new Socket(host, port);
// Get data
Scanner serverOut =
new Scanner(socket.getInputStream());
// Send data
PrintWriter to_server =
new PrintWriter(new OutputStreamWriter
(socket.getOutputStream()));
// Close socket
socket.close();

Examples of this code in action can be found in the HttpClient.java program.

java.net.URLConnection
In the GetURLInfo.java program we are going to use the URLconnection class is to access the URL, much in the same way we used the socket class. It can get a variety of headers that form the HTTP request. Depending on the request method, different headers are available.
for example, if the POST method was declared, the variables getcontentlength and getcontenType would be issued back to the program.
It is necessary to include in this class if (as discussed earlier) . If you wished to use a conditional GET request method, as it would give you access to the variable Last modified. (Sun Microsystems 2003)

An example of this code can be found in the GetURLInfo.java program.

// url holds variable to be processed
URLConnection connection = url.openConnection();

Now we have a bit of background knoweledge we can take a look at the example programs.

Back to top

HttpClient.java

View Source Code

You can download and run any of these Java programs (making the assumption that you have some way of compiling them!)
If you have the SDK from Sun Microsystems you can compile them from the command prompt by typing:

javac prog_name.java

This will produce a class file.
You can run the program in a simular way from the command line by typing:

java prog_name

This program is a VERY simple web browser written in the Java language. It uses the three classes that we have already looked at.
It works by prompting the user for a URL. This must be given in the full format or the java.net.URL class will throw an error. It strips out three main elements; protocol, port, host, and filename. The first check is on the protocol, our program will only deal with Http and not any other.

URL url = new URL(path); String host = url.getHost();
String filename = url.getFile();
String protocol = url.getProtocol();
if (!protocol.equals("http")) {
throw new IllegalArgumentException("URL must use 'http:' protocol");
}

Next it checks for the port of the request, if one is not found it uses the default of port 80.

int port = url.getPort();
if (port == -1) {
port = 80;
// if no port, use the default HTTP port
}

Now we have checked the request is as we want it the next step is to open a connection with the server. We do this by making use of the java.net.Socket class to create a new socket and pass it the parameters of the host and port number to be used.

Socket socket = new Socket(host, port);

Once a connection has been established, we send the command to get the data using the desired Http request method. The first example uses the HEAD method.

// request data with method
to_server.println("HEAD" + filename + "HTTP/1.0\n");
// get data from server
Scanner serverOut = new Scanner(socket.getInputStream());

The program then iterates over the returned data outputting its contents to the screen (STDOUT).

while (serverOut.hasNextLine()) {
System.out.println(serverOut.nextLine());
}

When all the data has been delivered the program then closes the connection.

socket.close();

The source code has been fully anotated showing where in the program this all happens.

Firstly we will run it using the HEAD method.

Output using HEAD

A text file of the full output can be found here

Let's take a look at the responses we get.

Http status code
This variable is returned for all URLs requested. In this case 200 means "O.K.", page delivered with no problems.
Date
The time and date when the client machine requested the page.
Server
This gives information of the server that served the page. In our example it shows that is was from a UNIX machine using the Apache software (version 2.2.4) It also lists all of the extra modules that have been compiled in with that server. For example PHP version 5.2.1
Last-Modified
This is the datestamp of the file when it was last modified on the server.
Etag
This stands for Enitity Tag. It is an identifier for the page. The primary use of this is in the caching mechnesism.
Accept-Ranges
This shows the value used for any of the Range requests. As of Http/1.1 the standard is byte.
Content-Length
This shows the size of the body element of the requested page. It is shown using octal notation.
Content-Type
As with Content-Length this gives information about the body element. It gives the media type of the requested page, and the character encoding set. A common example of this is html/text
Connection
Defines a rule for the page. With our example the response of Close means that the connection must be terminated after the request, not held open.

Now we will run it again using the GET method, we will look at the differences in the output.

to_server.println("GET" + filename + "HTTP/1.0\n");

Output using GET

A text file of the full output can be found here

As you can see it returns ALL the data about the page, including the HTML tags! A conventional browser (like FireFox) parses this output delivering a formatted page. Any href tags are displayed as hyperlinks on the page.

Java isn't the only language you can do this in. Here is the code of a simular program written in Perl.
Notice how it performs the same tasks as the Java program, but it only calls one external library to Java's three classes. As you will notice it also does it in less lines of coding. Argubley more efficient!

Back to top

GetURLInfo.java

View Source Code (V3)

As users we don't want to see all this data, we are intrested in the information in the page. By using HTML tags a browser is able to display a variety of media in a friendly fashion. We are going to modify this program to list the hyperlinks (if any) from a URL that we request, like a simple search engine.
To do this we have to make a few changes to the original code.

N.B. The URL used for this example will be
http://java.sun.com/docs/books/tutorial/networking/index.html

We will start with a base program, and add some enhancements until we get our basic search engine.

Step 1

We shall start by jazzing up the user interface. We are all used to graphical user interfaces, so are program should have one too. Java has a rich library of graphical functions in the Swing library. We will use the method showInputDialog of JOptionPane (of the Java API) to create a GUI prompt.

String address = javax.swing.JOptionPane.showInputDialog(
null, "Please enter the FULL URL>");
Java GUI prompt

View source of GetURLInfo #1

Step 2

Now we need to configure the program to access the world wide web, including sites outside of the UWE system. To do this we need to set up our program to be able to pass through the UWE firewall and proxy server.
This is done by setting some system properties.

System.getProperties().put( "proxySet", "true" );
System.getProperties().put( "proxyHost", "proxysg.uwe.ac.uk" );

We now have access to the outside web.
View source of GetURLInfo #2
A text file of the full output can be found here

Step 3

Finally we want to filter the results returned by the request to output only the hyperlinks (the href tags).
We can do this by using a powerful feature called regular expressions. To find out more about using regexs with Java go here. A regex is basically a pattern matching tool. We will feed it the href tag and ask the program to output the characters that are contained within them.

// define a regex to search for
String regex = "<a href=\"(.*?)\"(.*?)</a>";
Pattern p = Pattern.compile(regex);
// set up a matcher object
Matcher m;

We can then make the program iterate (loop) through the data, checking for our regex, and when it finds it output the characters within it to the page.

while (in.hasNextLine())
{
m = p.matcher(in.nextLine());
while (m.find())
{
System.out.println(" " + m.group(1));
}
}
Screenshot of GetURLInfo3

We now have our basic search engine.
View source of GetURLInfo #3
A text file of the full output can be found here

The Perl program (http.pl) performs a simular job as the last GetURLInfo example. The main difference between the two are a result of the language type. Java is an object orientated programming language, Perl is a procedural language (although you can do OOP with it). This difference represents the amount of lines required to execute the same program. Java is longer because it has to create objects, this takes up lines of code. There are other subtle differences, these include the syntax used for pattern matching.

Back to top

References

Back to top