AI version of Assignment 4
by Nirmal Mukhi

 

A lot of what KnownSpace does is look at two bits of information and compare them. For example, cluster agents will look at two items of information and compare them to see if they are similar. Filter agents will look at an item of information and look for certain characteristics to make sure they are present or absent, and so on.

In this assignment, you will implement an algorithm to perform part of this functionality. Specifically, you will implement an algorithm which, given information on two web pages, compares them and returns a measure of the similarity of the information on these pages.

Information on a web page is given to you in the form of a file. This file contains some attributes of the page (such as its size, date of modification, author, title, url, and so on) and the page content itself. Here is a sample file.

Write a class (or set of classes, if you like) which examine two such files and returns a number between 0 and 1 indicating the extent of similarity between the two pages. Return 0 if the pages are completely dissimilar and 1 if they are very similar or identical. Use any kind of algorithm(s) you wish.

Some possibilities are:

* Using neural networks.
* Doing some kind of language processing of the page content using a dictionary or thesaurus.
* Comparing each attribute separately, then computing a measure of similarity based on these comparisons.
* Searching the web for algorithms which address this (there will be many) and using or implementing them. The Assignment Gallery might give you some useful links or ideas.

To test your program, pick up some pages off the web which you know to be similar in content, create sample files in the given format, and run your program on those files. You should get a similarity value fairly close to 1. Also check if pages which are really dissimilar are detected as such by your program. Try checking the similarity of a page against itself (needless to say, they should come out to be pretty similar). Document your tests as well as including all your source code.

This is a fun and challenging assignment and your code will likely be used in the KnownSpace implementation if it works well, so give it a good shot.