1Abstract
In the domain of document visualization, some efforts have been pursued to create efficient text summarization techniques. Within this framework, simpler presentation methods (e.g. providing an overview of the document by displaying a few words representing the document’s subject(s)) are particularly useful and might use topic extraction algorithms; essentially based on statistical methods, they aim at extracting these representative words.
Document structure, which is sometimes given by meta-data in the document (e.g. formatting), could be a hint of its semantic content. Automatic extraction of such structure is often called Text Tiling, where semantically homogeneous blocks are identified by using statistical methods.
Performing topic extraction on approximated homogeneous blocks, in place of the whole document, can ease the computation and spares resources for more complex treatments.
The scope of this project was originally to study the combination of document segmentation with topic extraction to produce a more understandable content visualization. The main challenge was to create a good topic extraction algorithm that would use the document structure. Because of the short duration imposed on the project, the combination of the two techniques aiming at the visualization has not been performed. The project goals were therefore reduced to the implementation of a segmentation algorithm and the development of a novel topic extraction algorithm working on small texts.
The major goal of this master thesis is more precisely to use an available semantic database (the EDR Electronic Dictionary) to construct a semantically aware algorithm for topic extraction. The semantic database contains hyponym and hypernym links between words; these relations are used to generate a synthetic representation of documents by aggregating representative keywords into concepts preserving the documents’ topics.
The combination of such a new topic extraction method with other existing methods (text tiling, visualization) will then lead to the creation of an efficient document visualization method presenting the topics found in the different text segments associated with a representation of the document structure itself.



