HTML Document Processor

Due Date:
To Be Determined
Deliverables:
When the assignment is complete, send me an e-mail message telling me the path to the project directory. The project directory is to contain an RCS subdirectory, and nothing else. In addition, the ~/man directory tree is to contain a man page for the executable program. You may also leave a README file in the project directory if you think that one is needed, but I neither expect nor require you to do so.
Requirements:
In addition to the information given here, be sure to consult the Grading Form for this assignment, which includes additional information about the requirements for this project.

Project Overview

You are to do the exercise as a sequence of steps as listed below, and the code for each step is to have its own RCS major version number. Because you will have to add modules in some of the later steps, there should be a different version of the project Makefile for each step, too.

Project Steps

  1. Write a function, parseURL which takes a pointer to a C string as an argument, and returns the following struct:
              struct parsedURL {
                char*    protocol;
                char*    hostname;
                char*    portNumber;
                char*    pathName;
                };
    
    When the struct is returned, the protocol field will normally point to the string "http" or "file". However, it will be a NULL pointer if the string begins with a different protocol name or if there is an error in the string that makes it impossible to parse. If the protocol is http, the hostname and portNumber fields must be defined, although the portNumber field will be be defined as NULL if the URL does not specify a port number. If protocol field is a non-NULL pointer, pathName must be a non-NULL pointer too. If the pathname part of the URL string is missing, make the pathName member point to the string "/".

    Write a main program that accepts a URL string as a command line argument. (If no command line argument is given, the program simply terminates.) Pass the command line argument to parseURL() and use ddd to verify that the string is parsed correctly. Try different URL strings to test your program fully. Have the program exit when parseURL() returns.

    At this point, your project directory should have the following files:

    Makefile
    Typing make with no arguments should build the executable program.
    parseURL.h
    This file should contain the declaration for the parsedURL data structure and the function prototype for parseURL().
    parseURL.cc
    Source code for the parseURL() function definition. (Since you are using the g++ compiler, you might as well use the .cc extension instead of .c.)
    main.cc
    Source code for your main() function definition.

    You may use different file names if you wish, but you should have the equivalent of these four files. When the program works, check these four files into RCS as version 1.1.
  2. Create a skeleton implementation of the entire browser project. Include the following modules: Here is a data structure you might use to hold the information about a URL at the various stages of its processing:

    struct urlInfo_t {
    
      // The original URL string.
      const char*   urlString;
    
      // Parsed Substrings of the URL string.
      char*         protocolName;
      char*         hostName;
      char*         portNumber;
      char*         pathName;
    
      // Raw Document Information.
      caddr_t       rawStart;
      size_t        rawLength;
      freeDoc_f*    freeDoc;
    
      // Parsed Document Information.
      paraInfo_t   *firstPara;
    
      };
    
    When you have this program working, check in all files, including the Makefile, as version 2.1 of your project.
  3. Complete a working Curses user interface for the project. The user is to be able to enter the following command characters, with the indicated effects: If the user presses any other key, the program should beep. This is version 3.1 of the project.
  4. Add the 'g' command to your Curses user interface that prompts for a URL string and processes it. This is version 4.1 of the project.
  5. Implement the http protocol handler so that documents can be retrieved from World Wide Web servers. This is version 5.1 of the project.
  6. Parse HTML documents into paragraphs. Remove all HTML tags, and use <P> and <Hn> pairs to divide the document into paragraphs. Modify your Curses renderer to display paragraphs properly. This is version 6.1 of the project.
  7. ...


Christopher Vickery
Queens College of CUNY