Open-source software (OSS) supply chain enlarges the attack surface of a software system, which makes package registries attractive targets for attacks. Recently, multiple package registries have received intensified attacks with malicious packages. Of those package registries, NPM and PyPI are two of the most severe victims. Existing malicious package detectors are developed with features from a list of packages of the same ecosystem and deployed within the same ecosystem exclusively, which is infeasible to utilize the knowledge of a new malicious NPM package detected recently to detect the new malicious package in PyPI. Moreover, existing detectors lack support to model malicious behavior of OSS packages in a sequential way
To address the two limitations, we propose a single detection model using malicious behavior sequence, named
Cerebro
, to detect malicious packages in NPM and PyPI. We curate a feature set based on a high-level abstraction of malicious behavior to enable multi-lingual knowledge fusing. We organize extracted features into a behavior sequence to model sequential malicious behavior. We fine-tune the pre-trained language model to understand the semantics of malicious behavior. Extensive evaluation has demonstrated the effectiveness of Cerebro over the state-of-the-art as well as the practically acceptable efficiency.
Cerebro
has detected 683 and 799 new malicious packages in PyPI and NPM, and received 707 thank letters from the official PyPI and NPM teams.