6V8 - Production of my Mind

Home page > Publications > Semantic Topic Extraction and Segmentation for Efficient Document (...)

Semantic Topic Extraction and Segmentation for Efficient Document Visualization

 
I’ve finally finished my report.
Final Report | 21 February 2004, by Mortimer

1Abstract

In the domain of document visualization, some efforts have been pursued to create efficient text summarization techniques. Within this framework, simpler presentation methods (e.g. providing an overview of the document by displaying a few words representing the document’s subject(s)) are particularly useful and might use topic extraction algorithms; essentially based on statistical methods, they aim at extracting these representative words.

Document structure, which is sometimes given by meta-data in the document (e.g. formatting), could be a hint of its semantic content. Automatic extraction of such structure is often called Text Tiling, where semantically homogeneous blocks are identified by using statistical methods.

Performing topic extraction on approximated homogeneous blocks, in place of the whole document, can ease the computation and spares resources for more complex treatments.

The scope of this project was originally to study the combination of document segmentation with topic extraction to produce a more understandable content visualization. The main challenge was to create a good topic extraction algorithm that would use the document structure. Because of the short duration imposed on the project, the combination of the two techniques aiming at the visualization has not been performed. The project goals were therefore reduced to the implementation of a segmentation algorithm and the development of a novel topic extraction algorithm working on small texts.

The major goal of this master thesis is more precisely to use an available semantic database (the EDR Electronic Dictionary) to construct a semantically aware algorithm for topic extraction. The semantic database contains hyponym and hypernym links between words; these relations are used to generate a synthetic representation of documents by aggregating representative keywords into concepts preserving the documents’ topics.

The combination of such a new topic extraction method with other existing methods (text tiling, visualization) will then lead to the creation of an efficient document visualization method presenting the topics found in the different text segments associated with a representation of the document structure itself.

2Readable

PDF - 857.3 kb
Final "official" report
This is the report that has been delivered for my master thesis.
837 download(s) since 25 July 2007
Date of online publication: 21 February 2004
last-update: 14 April 2004
Forum messages 2
visits:
11494

Some Right Reserved: All right reserved License, (c) 2007 Pierre Andrews
 

P.S.

this paper is copyrighted. © Pierre Andrews 2004 © LIA EPFL 2004

2 Forum messages

 

The most read articles

 
©
Pierre Andrews
York, uk
| Site Map | Site created with SPIP 1.9.2c [10268] | RSS | template by IZO, Mortimer. | clicky stats
published Tina of other tiramisu
published Tina of other tiramisu
published Tina of other tiramisu
published Tina of other tiramisu
published Tina of other tiramisu