Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas. During this survey, efficient visualization of documents was identified as a research theme to be further explored. Within this perspective topic extraction can be considered as potentially very useful and therefore needs to be studied in more details.
In the domain of document visualization, some efforts have been pursued to create efficient text summarization techniques. Within this framework, simpler presentation methods - providing an overview of the document by displaying a few words representing the document subject(s) - are particularly useful and might use topic extraction algorithms, essentially based on statistical methods, that aim at extracting this representative words. Document structure, which is sometime given by meta-data in the document (e.g. formatting), represents an added element to better provide the document content to the user by displaying its semantic structure. To this end, semantically homogenous blocks can be extracted for instance by a technique called Text Tiling that is well developed and uses statistical methods to identify the document segments. The scope of this project is therefore to study the combination of document structuring with topic extraction to produce a more understandable content visualization. The main challenge is to create a good topic extraction algorithm that would use the document structure. In particular one of the important goals of this master thesis is to use an available semantic database (the EDR dictionary) to construct a semantically aware algorithm for topic extraction. The semantic database contains hyponym and hyperonym links between words; these relations will be useful in generating a synthetic representation of documents or computing similarity between words.
The combination of such a new topic extraction method with other existing methods (text tiling, visualization) will then lead to the creation of an efficient document visualization method that will present the topics found in the different text segments associated with a representation of the document structure itself. The research work will be split in the following tasks:
- Definition and implementation of an automated procedure for the identification of semantically homogenous text segments (text tiling),
- Definition and implementation of an automated procedure for the extraction of topics from the identified segments,
- visualization of documents that efficiently combines the identified segments and topics,
- test and evaluation of the obtained prototype on a corpus of documents of substantial size.


