Content Based Image Retrieval, CBIR, is a highly active leading research field with numerous applications that are currently expanding beyond traditional CBIR methodologies. In this paper, a CBIR methodology is proposed to meet such demands. Query inputs of the proposed methodology are an image and a text. For instance, having an image, a user would like to obtain a similar one with some modification described in text format that we refer to as a text-modifier. The proposed methodology uses a set of neural networks that operate in feature space and perform feature composition in a uniform-known domain which is the textual feature domain. In this methodology, ResNet is used to extract image features and LSTM to extract text features to form query inputs. The proposed methodology uses a set of three single-hiddenlayer non-linear feedforward networks in a cascading structure labeled NetA, NetC, and NetB. NetA maps image features into corresponding textual features. NetC composes the textual features produced by NetA with text-modifier features to form target image textual features. NetB maps target textual features to target image features that are used to recall the target image from the image-base based on cosine similarity. The proposed architecture was tested using ResNet 18, 50 and 152 for extracting image features. The testing results are promising and can compete with the most recent approaches to our knowledge as listed in section 5.