The part of the page below the line is the required page information format. If you feel the set of attributes given is not sufficient to detect similarity, feel free to add your own attributes and explain why you did so.

The information here is for the page titled "Personalised Information Maps", accessible through the main KS page. Each of the attributes will make more sense to you when you look at the HTML source for the page and the directory the page is in.

Each attribute of the page has the attribute followed by the attribute value. In case of a multi-valued attribute, the values are in quotes and separated by spaces.

Most of the attributes are self-explanatory. The Images attribute just gives you the number of images on the page. Citations give you those things that appear in a cite HTML tag. Fancy tags are HTML tags which don't appear on most pages---things like tables, javascript, audio/video, and so on. FreqWords gives you a list of important words that appear on the site. The Size attribute gives you the size of the HTML file alone, size of images gives you the size of all the images used in this page together.


Title: "Personalized Information Maps"

Links: "Library of Congress", "Altavista", "140 million pages", "But First, a Fable", "The Uses of Locality", "Hypothetical Interactions", "Mapping Information", "First Steps", "The Future"

Url: http://www.cs.indiana.edu/~rawlins/website/overview/overview.html

Images: 5

Citations: "Self-Organizing Maps", "The New York Times,", "Wired"

FancyTags: "table"

Comments: "navigation========================================================"

FreqWords: "personalized", "information", "maps", "processing", "human", "index", "pages", "book", "web", "data", "locality"

Author: rawlins

ModificationDate: 09/20

Size: 4422

SizeOfImages: 834

Content: 
< html>

< head>
        < title>
                Personalized Information Maps
        < /title>

        < link rel=stylesheet type="text/css" href="../stylesheets/standard-page.css">
< /head>

< body>

< table width="100%" bgcolor="#ccccff" border=0 cellspacing=0 cellpadding=0>
        < tr>
        < td align=left valign=bottom width=32>
        < img src="../images/overview.gif" width=32 height=32 alt="">
        < /td>

        < td align=right>
        < font face="Arial,Helvetica" size="+2">
        personalized information maps
        < /font>
        < /td>
        < /tr>
< /table>
< br>
< br>

< div class=content>

< blockquote class=quote>
        The purpose of intelligent information processing in general seems to be
        creation of simplified images of the observable world at various levels of
        abstraction, in relation to a particular subset of received data.
        < br>
        < div class=author>
                Teuvo Kohonen, < cite>Self-Organizing Maps< /cite>
        < /div>
< /blockquote>

< p>
Trying to find information on the web is like trying to find
something at a huge jumble sale: it's fun, and you can make serendipitous
discoveries, but for directed search it's better to go to a department
store; there, someone has already done much of the arranging for you.
Unfortunately, the web's growth, diversity, and volatility,
make human indexing impossible.

< p>
The
< a href="gopher://marvel.loc.gov:70/00/loc/facil/25.faqs">
Library of Congress< /a>,
one of the world's most comprehensive collection of
human knowledge, holds 112 million items
(17 million books, 95 million maps,
manuscripts, photos, films, tapes, paintings, prints, drawings,
and other items)
stretching over 532 miles of shelves. As of May 1998, however,
the
< a href="http://altavista.digital.com/av/content/about_our_strengths.htm">
AltaVista< /a>
search engine indexed over
< a href="http://searchenginewatch.com/reports/sizes.html">
140 million pages< /a>,
which at that time was probably only around a third of the entire web
(< cite>The New York Times,< /cite>
April 9, 1998, estimated the total then as 320 million pages).

< p>
Further, the Library
of Congress collection is only growing by 7,000 items every working day;
the web is growing by better than 1,000 pages a minute
(< cite>Wired< /cite>, July 1998, page 59).
The number of pages should cross a billion well before
January 1< sup>< small>st< /small>< /sup>, 2000.
And many of those pages are constantly changing---and constantly moving.

< p>
The information overload problem isn't restricted to the web---the
desktop itself is rapidly approaching the breaking point as well.
As of September 15< sup>< small>th< /small>< /sup> 1998,
a 12 gigabyte disk drive costs $300.
Since a 500-page textbook takes up about 1/2 megabyte (compressed),
$1,250 buys storage for the text of about 100,000 books.
Next year it will buy space for at least 200,000 books.

< p>
Further, most text pages average only between 3 and 4 kilobytes,
so $300 buys storage for 3 million text pages.
Today's operating systems, however, were designed in an era
when managing a few hundred pages was all that was required.
So today's users are given the hardware to store millions of pages
and the software to manage only a few hundred.

< p>
As the web grows it is becoming easier and easier to become lost in it.
To avoid that it seems necessary that the web first be mapped,
which is probably impossible. Even if it could be done,
any such map would not be of great use to all users since it would have to be
very general. And of course, it would be always out of date.

< p>
It is possible, however, to map relevant portions of the web incrementally
by starting with a nucleus of pages that a particular user
has already demonstrated interest in, then branching out from there.
Each new page can then be placed relative to the other pages
in a two- or three-dimensional space of pages,
thereby aiding search, organization, and recall.
The same thing can be done for the desktop itself.
In both cases, the key is to focus on the interests of each single user,
and to map pages to a navigable space.

< p>
Within this website, "pages" refer to webpages, documents, executables,
images, audio or video clips, symbolic links, or directories---any
digital data at all.

< br>
< br>
< table align=center>
        < colgroup>
                < col align=left>
        < /colgroup>

        < tr>
        < td>
        < a href="fable.html">
        But First, a Fable< /a>
        < /td>
        < /tr>

        < tr>
        < td>
        < a href="locality.html">
        The Uses of Locality< /a>
        < /td>
        < /tr>

        < tr>
        < td>
        < a href="interaction.html">
        Hypothetical Interactions< /a>
        < /td>
        < /tr>

        < tr>
        < td>
        < a href="mapping.html">
        Mapping Information< /a>
        < /td>
        < /tr>

        < tr>
        < td>
        < a href="steps.html">
        First Steps< /a>
        < /td>
        < /tr>

        < tr>
        < td>
        < a href="future.html">
        The Future< /a>
        < /td>
        < /tr>
< /table>

< /div>

< !--navigation========================================================-->

< table cols=4 width="100%">
< tr>
< td align=left>
< img src="../images/dot_clear.gif" width=30 height=30 border=0 alt="">

< td align=center>
< a href="../site-map.html">
< img src="../images/sitemap.gif" width=30 height=30 border=0 alt="| to sitemap |">< /a>

< td align=center>
< img src="../images/dot_clear.gif" width=30 height=30 border=0 alt="">< /a>

< td align=right>
< a href="fable.html">
< img src="../images/next.gif" width=30 height=30 border=0 alt="| next">< /a>
< /table>

< /body>

< /html>