Using a novel database, ProDES, developed by the Crime and Justice Research Center at Temple University, this article investigates the relationship between spatial characteristics and juvenile delinquency and recidivism-the proportion of delinquents who commit crimes following completion of a court-ordered program-in Philadelphia, PA. ProDES was originally a case-based sample, where the cases were adjudicated in family court, 1994-2004. For our analysis, we focused attention on studying 6768 juvenile males from the data set. To address the difficult issue of nonstationarity in the data, we considered various two-way clustering algorithms to group the juveniles into 'types' by way of the many variables that described the juveniles. Following different modeling scenarios, we applied the plaid biclustering algorithm in which a sequence of subsets ('layers') of both juveniles and variables are extracted from the data one layer at a time, but where overlapping layers are allowed. This type of 'biclustering' is a new way of studying juvenile-offense data. We show that the juveniles within each layer can be viewed as spatially clustered. The layers were determined as descriptive tools to aid in identifying subsets of the data that could be useful in policy making. Statistical relationships of the variables and juveniles within each layer are then studied using neural network models. Results indicate that the methods of this paper are more successful in predicting juvenile recidivism in urban environments when different crimes are modeled as separate data sets rather than being pooled together as a single data set.
For health and human services, fraud detection and other security services, identity resolution is a core requirement for understanding big data in the cloud. Due to the lack of a globally unique identifier and captured typographic differences for the same identity, identity resolution has high spatial and temporal complexities. We propose a filter and verify method to substantially increase the speed of approximate string matching using edit distance. This method has been found to be almost 80 times faster (130 times when combined with other optimizations) than Damerau-Levenshtein edit distance and preserves all approximate matches. Our method creates compressed signatures for data fields and uses Boolean operations and an enhanced bit counter to quickly compare the distance between the fields. This method is intended to be applied to data records whose fields contain relatively shortlength strings, such as those found in most demographic data. Without loss of accuracy, the proposed Fast Bitwise Filter will provide substantial performance gain to approximate string comparison in database, record linkage and deduplication data processing systems.
Abstract. The appropriate choice of a method for imputation of missing data becomes especially important when the fraction of missing values is large and the data are of mixed type. The proposed dynamic clustering imputation (DCI) algorithm relies on similarity information from shared neighbors, where mixed type variables are considered together. When evaluated on a public social science dataset of 46,043 mixed type instances with up to 33% missing values, DCI resulted in more than 20% improved imputation accuracy over Multiple Imputation, Predictive Mean Matching, Linear and Multilevel Regression, and Mean Mode Replacement methods. Data imputed by 6 methods were used for test of NB-Tree, Random Subset Selection and Neural Network-based classification models. In our experiments classification accuracy obtained using DCI-preprocessed data was a lot better than when relying on alternative imputation methods for data preprocessing.
Our research explores the practice of Record Linkage (RL), also known as Entity Resolution, Record Matching and the Object Identity Problem, in Big health services databases as is commonly practiced within the domain, and some of the approximate string matching methods used for this purpose. We also propose potential improvements to RL and string matching that have been shown in experiments to increase the quality and efficiency for information systems tasked with this problem. We have developed an in-memory graph-based data model, Aggregate Link and Iterative Match (ALIM), which compresses data by eliminating redundancy and stores alias, approximate and phonetic match links between stored data. We have also developed an enhanced edit-distance optimization, the Probabilistic Signature Hash Filter (PSH), which can perform the Damerau-Levenshtein (DL) edit-distance comparison nearly 6000 times faster than DL alone and produce the same exact approximate match results. Our experiments show significant accuracy and performance gains over a system currently in use by a local health department.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.