The authors propose a method to improve activity recognition by including the contextual information from first person vision (FPV). Adding the context, i.e. objects seen while performing an activity, increases the activity recognition precision. This is because, in goal-oriented tasks, human gaze precedes the action and tends to focus on relevant objects. They extract object information from FPV images and combine it with the activity information from external or FPV videos to train an Artificial Neural Network (ANN). They used four configurations as combination of gaze/eye-tracker, head-mounted and externally mounted cameras using three standard cooking datasets from Georgia Tech Egocentric Activities Gaze, Technische Universität München kitchen and CMU multi-modal activity database. Adding object information when training the ANN increased the precision and accuracy of activity recognition from average 58.02% (and 89.78%) to 74.03% (and 93.42%). Experiments also showed that when objects are not considered, having an external camera is necessary. However, when objects are considered, the combination of internal and external cameras is optimal because of their complementary advantages in observing hand and objects. Adding object information also decreases ANN training cycles from 513.25 to 139, which shows that it provides critical information that speeds up training.