Abstract:The linked open data cloud, with a huge volume of data from heterogeneous and interlinked datasets, has turned the Web into a large data store. It has attracted the attention of both developers and researchers in the last few years, opening up new dimensions in machine learning and knowledge discovery. Information extraction procedures for these processes use different approaches, e.g., template-based, federated to multiple sources, fixed depth link traversal, etc. They are limited by problems in online access to datasets' SPARQL endpoints, such as servers being down for maintenance, bandwidth narrowing, limited numbers of access points to datasets in particular time slots, etc., which may result in imprecise and incomplete sets of feature vector generation, affecting the quality of knowledge discovered.The work presented here addresses the disadvantages of online data retrieval by proposing a simple and automatic way to extract features from the linked open data cloud using a linked traversal approach in a local environment with previously identified and known sets of interlinked RDF datasets. The user is given the flexibility to determine the depth of the neighboring properties to be traversed for information extraction to generate the feature vector, which can be used for machine learning and knowledge discovery. The experiment is performed locally with Virtuoso Triple Store for storage of datasets and an interface developed to build the feature vector. The evaluation is performed by comparing the obtained feature vector with gold standard instances annotated manually and with a case study for estimating the effects of demography in movie production for a country. The advantage of the proposed approach lies in overcoming problems with online access of data from the linked data cloud, RDF dataset integration in both local and web environments to build feature vectors for machine learning, and generating background knowledge from the linked data cloud.