Metal-organic frameworks (MOFs) are a class of crystalline porous materials that exhibit a vast chemical space due to their tunable molecular building blocks with diverse topologies. Given that an unlimited number of MOFs can, in principle, be synthesized, constructing structure-property relationships through a machine learning approach allows for efficient exploration of this vast chemical space, resulting in identifying optimal candidates with desired properties. In this work, we introduce MOFTransformer, a multi-model Transformer encoder pre-trained with 1 million hypothetical MOFs. This multi-modal model utilizes integrated atom-based graph and energy-grid embeddings to capture both local and global features of MOFs, respectively. By fine-tuning the pre-trained model with small datasets ranging from 5,000 to 20,000 MOFs, our model achieves state-of-the-art results for predicting across various properties including gas adsorption, diffusion, electronic properties, and even text-mined data. Beyond its universal transfer learning capabilities, MOFTransformer generates chemical insights by analyzing feature importance through attention scores within the self-attention layers. As such, this model can serve as a bedrock platform for other MOF researchers that seek to develop new machine learning models for their work.