Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas.
In this document we present the results of this comprehensive survey. The report includes a description of the current state-of-the-art and practice, a roadmap for follow-up research in the identified areas, and recommendations for anticipated technological development in the domain of text mining.
In the part dedicated to document processing, the discussion focuses on research topics in natural language processing and information retrieval. More precisely, the work covers the tasks related with data selection, filtering and cleaning, morphological normalization and parsing, document representation and similarity computation, and various aspects of data analysis that have all been developed and successfully used in data mining.
In the part dedicated to the visualization, the study essentially focuses on the issue of high dimensionality for document representation. Indeed, the high dimensional representations that are produced in the various stages of the text mining process are usually not well suited for a simple and easily exploitable presentation of text mining results which require specific interpretation techniques, tightly connected to the task of document summarization. In addition, the study has identified a clear need for the development of a unified methodology in the field of visualization.
Publicly available here:


