‘What is this document about?’ is a common
question when navigating large document databases.
Overviews of document content have been an active
area of research in information visualization for
many years. Most reported works do not make use
of human-annotated linguistic structure in the visualization,
providing detail on topic content without
a consistent view that can be compared across documents.
We provide a visualization of document content
based on the human-annotated IS-A noun hierarchy
of WordNet
and embedded in the multi-view visualization system WordNet Explorer. The IS-A relation in WordNet is used in DocuBurst to cluster related terms and propagate counts to more general concepts. For example if the relation "robin IS-A bird" occurs in WordNet, then the word counts for "robin" will also be counted for "bird". In this way, more specific terms contribute to the visual significance of general themes.
The combined structure of WordNet hyponymy and
document lexical content is visualized in WordNet
Explorer as the DocuBurst visualization, which uses
a radial space filling layout technique. The root node
is shown as a circle. All other nodes are assigned to
a sector of an annulus with angular width which is
part of the parent node’s width. Angular width can
be either (a) proportional to the number of leaves
in the subtree rooted at that node (leaf count) or
(b) proportional to the number of word occurrences
counted for synsets in the subtree rooted at that node
(word count).
The width of each annulus is maximized to
allow for all visible graph elements to fit within the
display space (on initial display with neutral zoom
factor).
Document content is visualized through the transparency
of the fill colour of the nodes. Gray hue
is also used to distinguish nodes with zero occurrence counts. Highly coloured nodes have many occurrences;
almost transparent nodes have few occurrences.
Words and senses that are more
prominent in the document of interest stand out easily
against a more transparent context.
To use DocuBurst, a user loads a document of interest into the visualization, and chooses a WordNet node at which to root the visualization. Below, we see that "idea" was chosen to root the visualization, and the occurrences of concepts that fall under "idea" appear in the visualization. The gold coloured nodes indicate search results for nodes matching 'pl' at the beginning.

The visualization can be used to drill down into the loaded document. A paragraph browser, with a fish-eye lens, at the right of the DocuBurst display shows which paragraphs in the document from beginning to end (top to bottom) contain the selected node. By clicking any of the paragraph numbers the full text of that paragraph is shown in the details window, with occurrences of the selected node highlighted in the text.
Ongoing work includes extending DocuBurst to multi-document comparison, developing an ambient display version of DocuBurst to accept RSS feeds, and planning a study of the effectiveness of DocuBurst for information retrieval.
This work was created with the excellent prefuse information visualization toolkit and the Java WordNet Library. |