Investigating The Reproducibility of NPM Packages

Goswami, Pronnoy; Gupta, Saksham; Li, Zhiyuan; Meng, Na; Yao, Daphne

doi:10.1109/icsme46990.2020.00071

Cited by 21 publications

(18 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several researchers have proposed checking for differences between packages hosted on registries and their purported source code as a way of detecting malware. Goswami et al [13] report that this is difficult for npm packages due to many irrelevant but nonmalicious differences, an experience that tallies with ours. Vu et al [28,30] study the same problem for PyPI, and similarly conclude that non-reproducibility by itself is a weak indicator of maliciousness and needs to be combined with other techniques to become effective, which is what we have done in this work.…”

Section: Related Worksupporting

confidence: 82%

“…9 Consequently, being able to reproduce a package version from its source code is a good indicator that it is benign. As has been noted previously [13], even perfectly benign packages may fail to reproduce for a variety of reasons, but this is acceptable in our case since we are only using this criterion to filter out benign packages erroneously flagged as malicious, not to detect new ones.…”

Section: Introductionmentioning

confidence: 95%

“…On the third day, we retrained the classifiers using the basic corpus as well as both 𝑁 1 an 𝑁 2 , and so forth for each subsequent day. 13 The intuition here is that we want to mimic a usage pattern where results from the classifiers are inspected by a human auditor, and the classifiers are then retrained with the additional ground truth obtained in this way.…”

Section: Experiments 1: Classifying Newly Published Packagesmentioning

confidence: 99%

“…Malicious-package detection. Previous work in this area can be broadly divided into four categories: general-purpose maliciouspackage detection approaches using machine learning [10] or program analysis [8,21,23]; techniques for rebuilding packages from source [13,28,30]; and finally work that specifically targets typosquatting [26,31].…”

Section: Related Workmentioning

confidence: 99%

“…To eliminate false positives, we borrow another insight from the literature [13,28,30]: malicious package versions tend not to have their source code publicly available, in order to avoid detection. 9 Consequently, being able to reproduce a package version from its source code is a good indicator that it is benign.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Practical Automated Detection of Malicious npm Packages

Sejfia,

Schäfer

2022

Preprint

View full text Add to dashboard Cite

The npm registry is one of the pillars of the JavaScript and Type-Script ecosystems, hosting over 1.7 million packages ranging from simple utility libraries to complex frameworks and entire applications. Each day, developers publish tens of thousands of updates as well as hundreds of new packages. Due to the overwhelming popularity of npm, it has become a prime target for malicious actors, who publish new packages or compromise existing packages to introduce malware that tampers with or exfiltrates sensitive data from users who install either these packages or any package that (transitively) depends on them. Defending against such attacks is essential to maintaining the integrity of the software supply chain, but the sheer volume of package updates makes comprehensive manual review infeasible. We present Amalfi, a machine-learning based approach for automatically detecting potentially malicious packages comprised of three complementary techniques. We start with classifiers trained on known examples of malicious and benign packages. If a package is flagged as malicious by a classifier, we then check whether it includes metadata about its source repository, and if so whether the package can be reproduced from its source code. Packages that are reproducible from source are not usually malicious, so this step allows us to weed out false positives. Finally, we also employ a simple textual clone-detection technique to identify copies of malicious packages that may have been missed by the classifiers, reducing the number of false negatives. Amalfi improves on the state of the art in that it is lightweight, requiring only a few seconds per package to extract features and run the classifiers, and gives good results in practice: running it on 96287 package versions published over the course of one week, we were able to identify 95 previously unknown malware samples, with a manageable number of false positives. CCS CONCEPTS• Security and privacy → Malware and its mitigation.

show abstract

Section: Related Worksupporting

confidence: 82%

Section: Introductionmentioning

confidence: 95%

Section: Experiments 1: Classifying Newly Published Packagesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Practical Automated Detection of Malicious npm Packages

Sejfia,

Schäfer

2022

Preprint

View full text Add to dashboard Cite

show abstract

CBKmodel composition using paired web services and executable functions: A demonstration for individualizing preventive services

Flynn

Taksler

Caverly

et al. 2022

Learning Health Systems

View full text Add to dashboard Cite

Introduction Learning health systems are challenged to combine computable biomedical knowledge (CBK) models. Using common technical capabilities of the World Wide Web (WWW), digital objects called Knowledge Objects, and a new pattern of activating CBK models brought forth here, we aim to show that it is possible to compose CBK models in more highly standardized and potentially easier, more useful ways. Methods Using previously specified compound digital objects called Knowledge Objects, CBK models are packaged with metadata, API descriptions, and runtime requirements. Using open‐source runtimes and a tool we developed called the KGrid Activator, CBK models can be instantiated inside runtimes and made accessible via RESTful APIs by the KGrid Activator. The KGrid Activator then serves as a gateway and provides a means to interconnect CBK model outputs and inputs, thereby establishing a CBK model composition method. Results To demonstrate our model composition method, we developed a complex composite CBK model from 42 CBK submodels. The resulting model called CM‐IPP is used to compute life‐gain estimates for individuals based their personal characteristics. Our result is an externalized, highly modularized CM‐IPP implementation that can be distributed and made runnable in any common server environment. Discussion CBK model composition using compound digital objects and the distributed computing technologies is feasible. Our method of model composition might be usefully extended to bring about large ecosystems of distinct CBK models that can be fitted and re‐fitted in various ways to form new composites. Remaining challenges related to the design of composite models include identifying appropriate model boundaries and organizing submodels to separate computational concerns while optimizing reuse potential. Conclusion Learning health systems need methods for combining CBK models from a variety of sources to create more complex and useful composite models. It is feasible to leverage Knowledge Objects and common API methods in combination to compose CBK models into complex composite models.

show abstract