Recent development trends in artificial intelligence applications have seen increasing interest in the design of automated systems for depression detection and diagnosis among the affective computing community. Particularly, active research has been conducted in depression diagnosis, based on multi-modal approaches in deep learning technology, which enable utilization of various information through fusion of varied data types. This study proposes a four-stream-based depression diagnosis model consisting of Bidirectional Long Short-Term Memory (Bi-LSTM) and convolutional neural networks (CNN), using speech and text data. One-dimensional features of audio signals are extracted using Mel Frequency Cepstral Coefficients and Gammatone Cepstral Coefficients, and two-dimensional features are extracted from Bark, equivalent rectangular bandwidth, and Log-Mel spectrograms, based on time-frequency transform. The extracted features are applied to Bi-LSTM and CNN-based transfer learning models. Word encoding was used for mapping of text to sequences with numeric indices, and word embedding used for representation of all words in numeric dense vectors. These were applied to Bi-LSTM and ngram-based CNN models. Finally, an ensemble of the softmax values output from the four deep learning models was used to perform depression diagnosis, based on the proposed four-stream model. Using the proposed model, experiments were performed with the Extended Distress Analysis Interview Corpus Wizard of Oz depression database and other datasets. Experimental results showed improved performance by 10.7% to 11.9% over twostream-based state-of-the-art methods. This demonstrates that the proposed model is effective for depression diagnosis.