As robots move off factory floors and into our homes and workplaces, they face the challenge of interacting with the articulated objects frequently found in environments built by and for humans (e.g., drawers, ovens, refrigerators, and faucets). Typically, this interaction is predefined in the form of a manipulation policy that must be (manually) specified for each object that the robot is expected to interact with. Such an approach may be reasonable for robots that interact with a small number of objects, but human environments contain a large number of diverse objects. In an effort to improve efficiency and generalizability, recent work employs visual demonstrations to learn representations that describe the motion of these parts in the form of kinematic models that express the rotational, prismatic, and rigid relationships between object parts. These structured object-relative models, which constrain the object's motion manifold, are suitable for trajectory controllers, provide a common representation amenable to transfer between objects, and allow for manipulation policies that are more efficient and deliberate than reactive policies (Fig. 1). However, such visual cues may be too time-consuming to provide or may not be readily available, such as when a user is remotely commanding a robot over a bandwidth-limited channel (e.g., for disaster relief). Further, reliance solely on vision makes these methods sensitive to common errors in data association, object segmentation, and tracking (e.g., tracking features over time and associating them with the correct object part) that occur as a result of clutter, occlusions, and a dearth of visual features. Consequently, most existing systems require scenes to be free of distractors and that object parts be labeled with fiducial markers.
展开▼