Excess amount of unstructured data is easily accessible in digital format. This information overload places too heavy a burden on society for its analysis and execution needs. Focused (i.e. topic, query, question, category, etc.) multi-document summarization is an information reduction solution which has reached a state-of-the-art that now demands the need to further explore other techniques to model human summarization activity. Such techniques have been mainly extractive and rely on distribution and complex machine learning on corpora in order to perform closely to human summaries. Overall, these techniques are still being used, and the field now needs to move toward more abstractive approaches to model human way of summarizing. A simple, inexpensive and domain-independent system architecture is created for adding semantic analysis to the summarization process. The proposed system is novel in its use of a new semantic analysis metric to better score sentences for selection into a summary. It also simplifies semantic processing of sentences to better capture more likely semantic-related information, reduce redundancy and reduce complexity. The system is evaluated against participants in the Document Understanding Conference and the later Text Analysis Conference using the performance ROUGE measures of n-gram recall between automated systems, human and baseline gold standard baseline summaries. The goal was to show that semantic analysis used for summarization can perform well, while remaining simple and inexpensive without significant loss of recall as compared to the foundational baseline system. Current results show improvement over the gold standard baseline when all factors of this work's semantic analysis technique are used in combination. These factors are the semantic cue words feature and semantic class weighting to determine sentences with important information. Also, the semantic triples clustering used to decompose natural language sentences to their most basic meaning and select the most important sentences added to this improvement. In competition against the gold standard baseline system on the standardized summarization evaluation metric ROUGE, this work outperforms the baseline system by more than ten position rankings. This work shows that semantic analysis and light-weight, open-domain techniques have potential.
展开▼