Consider a transmission scheme with a single transmitter and multiple receivers over a faulty broadcast channel. For each receiver, the transmitter has a unique infinite stream of packets, and its goal is to deliver them at the highest throughput possible. While such multiple-unicast models are unsolved in general, several network coding based schemes were suggested. In such schemes, the transmitter can either send an uncoded packet, or a coded packet which is a function of a few packets. Sent packets can be received by the designated receiver (with some probability) or heard and stored by other receivers. Two functional modes are considered; the first presumes that the storage time is unlimited, while in the second it is limited by a given Time To Live (TTL) parameter. We model the transmission process as an infinitehorizon Markov Decision Process (MDP). Since the large state space renders exact solutions computationally impractical, we introduce policy restricted and induced MDPs with significantly reduced state space, which with properly chosen reward have equal optimal value function. We then derive a reinforcement learning algorithm, which approximates the optimal strategy and significantly improves over uncoded schemes. The algorithm adapts to the packet loss rates, unknown in advance, attains high gain over the uncoded setup and is comparable with the upper bound by Wang, derived for a much stronger coding scheme.