Humans naturally use referring expressions with verbal utterances and nonverbal gestures to refer to objects and events. As these referring expressions can be interpreted differently from the speaker's or the observer's perspective, people effectively decide on the perspective in comprehending the expressions. However, existing models do not explicitly learn perspective grounding, which often causes the models to perform poorly in understanding embodied referring expressions. To make it exacerbate, these models are often trained on datasets collected in non-embodied settings without nonverbal gestures and curated from an exocentric perspective. To address these issues, in this paper, we present a perspective-aware multitask learning model, called PATRON, for relation and object grounding tasks in embodied settings by utilizing verbal utterances and nonverbal cues. In PATRON, we have developed a guided fusion approach, where a perspective grounding task guides the relation and object grounding task. Through this approach, PATRON learns disentangled task-specific and task-guidance representations, where task-guidance representations guide the extraction of salient multimodal features to ground the relation and object accurately. Furthermore, we have curated a synthetic dataset of embodied referring expressions with multimodal cues, called CAESAR-PRO. The experimental results suggest that PATRON outperforms the evaluated state-of-the-art visual-language models. Additionally, the results indicate that learning to ground perspective helps machine learning models to improve the performance of the relation and object grounding task. Furthermore, the insights from the extensive experimental results and the proposed dataset will enable researchers to evaluate visual-language models' effectiveness in understanding referring expressions in other embodied settings.