The availability of web and search engines has made the search easier nowadays. Information overload is one of the major problems which require algorithms and tools for faster access. Electronic documents are one of the major sources of information for business and academic information. In order to fully utilizing these on-line documents effectively, it is crucial to be able to extract the summary of these documents. Summarization system will be one of the solutions to the above problem. This project proposes a summarizer system which will be able to perform summarization of multiple documents. The input text documents are analyzed through a parser which parses the input documents and generates parse tree for each sentence. RDF triples are extracted from each sentence by analyzing the typed dependencies generated from the parser in the form of subject, verb and object. Semantic distance is computed between each pair of sentences and a matrix containing the semantic distance for sentences are computed. The measure adopted to compute semantic distance is Wu and Palmer distance. A clustering algorithm is applied to the extracted subject, verb and object space and the extracted RDF triples are grouped into clusters. The important sentences are selected for final summary are extracted using sentence selection algorithm.
Automatic data extraction from Web pages is a challenging yet significant problem in the fields of Information Retrieval and Data Mining. The problem arises particularly on the World-Wide Web, because search engines wrap up the results of user queries on web response pages .These response pages are often decorated with side bars, branding banners and advertisements. Automatic data extraction therefore has to deal with extracting relevant data from these pages Though many automated and manual text analysis solutions to this problem exist, most of them are heavily dependent on the specifics of HTML and they have to be changed according to the changes in markup language. This paper proposes , a novel and language independent technique to solve the data extraction problem using a combined approach that make use of features of DOM tree and also the visual features of html elements .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.