Physics-based manipulation in clutter involves complex interaction between multiple objects. In this paper, we consider the problem of learning, from interaction in a physics simulator, manipulation skills to solve this multi-step sequential decision making problem in the real world. Our approach has two key properties: (i) the ability to generalize and transfer manipulation skills (over the type, shape, and number of objects in the scene) using an abstract imagebased representation that enables a neural network to learn useful features; and (ii) the ability to perform look-ahead planning in the image space using a physics simulator, which is essential for such multi-step problems. We show, in sets of simulated and real-world experiments (video available on https://youtu.be/EmkUQfyvwkY), that by learning to evaluate actions in an abstract image-based representation of the real world, the robot can generalize and adapt to the object shapes in challenging real-world environments.