An important and challenging research problem in web of things is how to select an appropriate composition of concrete services in a dynamic and unpredictable environment. The main goal of this article is to select from all possible compositions the optimal one without knowing a priori the users' quality of service (QoS) preferences. From a theoretical point of view, we give bounds on the problem search space. As the QoS user's preferences are unknown, we propose a vector-valued MDP approach for finding the optimal QoS-aware services composition. The algorithm alternatively solves MDP with dynamic programming and learns the preferences via direct queries to the user. An important feature of the proposed algorithm is that it is able to get the optimal composition and, at the same time, limits the number of interactions with the user. Experiments on a real-world large size dataset with more than 3500 web services show that our algorithm finds the optimal composite services with around 50 interactions with the user. Index Terms-Quality of services (QoS), reinforcement learning (RL), services composition, web of things (WoT). I. INTRODUCTION T HE web of things (WoT) opens the way for the development of new intelligent applications that can be used to provide innovative and valuable services in several domains, such as smart cities, smart homes, ambient assisted living, and connected cars [11], [35]. One of the WoT challenges is to deal with services composition to ensure suitable services flexibility and customization. This is crucial to meet the evolving needs of users, which can be expressed through several parameters of quality of service (QoS) (such as response time, throughput, availability, price, popularity, etc.). Following the web service paradigm [4] services, composition problems are specified as a workflow involving abstract and concrete services.