<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>
This research aims to analyse 1) the interest of the European Union on Renewable Energy Directive; 2) the protection forms of the European Union on Renewable Energy Directive toward Indonesian Palm Oil; 3) the transformation of Indonesia palm oil managerial after of European Union Renewable Energy Directive. The result of this research shows that there is two interest of the European Union in implementing the policy of Renewable Energy Directive, that is in environment protection with simultaneous criteria and palm oil protection in European unions. It is found that trading protection towards Indonesian palm oil is a form of green protectionism. This protection then implements non-tariff protection as a form of trade barrier. It impacts palm oil exports from Indonesia to the European Union. This protection influenced the policy transformation of palm oil management in Indonesia. The policies are RSPO, ISPO, Presidential Directive on Primary Forest and Peatland, and Presidential Directive on Moratorium and Forest Land Allocation. This research proves that palm oil production in Indonesia changed after the implementation of the Renewable Energy Directive in the European Union. This improvement of palm oil production proves that the policy is influenced by the market drives and not an only environmental issue.
: The horror story of Dancer Village in Indonesia is a viral topic that has become a talk of citizens on Twitter social media. Various responses and public opinions emerged related to the truth of the story of supernatural experiences of students during a Real Work Lecture in an East Java region of Indonesia. This study conducted a sentiment analysis of community comments on Twitter social media on the viral topic using the Lexicon Based method. Sentiment classification is divided into 3 classes namely positive, negative and neutral. The research phase consists of data collection, pre-processing, processing (sentiment analysis) and visualization. Data collection uses Twitter Search API with 1000 Penari Desa keywords in Indonesian. The lexicon assessment results from 1000 tweets data obtained 33 positive, 767 neutral and 200 negative. The percentage of tweets containing positive comments by 3.3%, neutral 76.7% and negative by 20%
This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods were able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.
The growth of digital correspondence documents with various types, different naming rules, and no sufficient search system complicates the search process with certain content, especially if there are unclassified documents, the search becomes inaccurate and takes a long time. This research proposed archiving method with automatic hierarchical classification and the content-based search method which displays ontology classification information as the solution to the content-based search problems. The method consists of preprocessing (creation of automatic hierarchical classification model using a combination of convolutional neural network (CNN) and regular expression method), archiving (document archiving with automatic classification), and retrieval (content-based search by displaying ontology relationships from the document classification). The archiving of 100 documents using the automatic hierarchical classification was found to be 79% accurate as indicated by the 99% accuracy for CNN and 80% for Regex. Moreover, the search results for classified content-based documents through the display of ontology relationships were discovered to be 100% accurate. This research succeeded in improving the quality of search results for digital correspondence documents as indicated by its higher specificity, accuracy, and speed compared to conventional methods based on file names, annotations, and unclassified content.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.