Analysis of Subword based Word Representations Case Study: Fasttext Malayalam

Vivek, M R; Chandran, Priya

doi:10.1109/indicon56171.2022.10040147

Cited by 1 publication

(1 citation statement)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, all six models share a similar architecture which is based on 1-D convolutional neural networks (CNNs) which take blocks of raw bytes as input and embed them into a trainable latent space. Shifting individual bytes into a latent space was inspired by the current state-of-the-art natural language processing models where words, or sub-words, are embedded into a common latent space before being sent through a neural network [30]- [34]. The use of byte embeddings instead of 1-hot encoding or hand-crafted features such as input is, arguably, one of the key insights offered by the FiFTy research paper.…”

Section: B Approaches To File Fragment Type Identificationmentioning

confidence: 99%

ByteRCNN: Enhancing File Fragment Type Identification With Recurrent and Convolutional Neural Networks

Skračić,

Petrović,

Pale

2023

IEEE Access

View full text Add to dashboard Cite

File fragment type identification is an important step in file carving and data recovery. Machine learning techniques, especially neural networks, have been utilized for this problem, some with very promising results. This paper presents a novel neural network architecture for identifying file fragment types using a combination of byte embeddings as well as recurrent and convolutional elements. The corresponding classification model, ByteRCNN, has been trained on the publicly available file fragment FiFTy dataset and evaluated in closed-set and open-set recognition settings using FiFTy and other available file fragment datasets. Evaluation results have demonstrated that ByteRCNN can compete with state-of-the-art models described in literature in terms of classification accuracy, with 71.1% average accuracy on 512-byte fragments and 83.9% average accuracy on 4,096-byte fragments from the FiFTy dataset. When evaluated on other publicly available datasets in closed-set and open-set recognition settings, ByteRCNN similarly or slightly better than the FiFTy classification model. Obtained results overall suggest that ByteRCNN is a competitive file fragment classification model, but they also reveal that there is still plenty of space for further improving file type identification methods using more complex datasets or in open-set recognition settings. ByteRCNN is publicly available at [GitHub, after publication].

show abstract

Section: B Approaches To File Fragment Type Identificationmentioning

confidence: 99%