The invention introduces a system and a methodology that incorporates scene and natural language understanding to resolve ambiguities in user's verbal (natural) interaction with the in-vehicle virtual assistant with special focus to POIs (Places of interest) on the road that are visible to the user and the vehicle. The proposal assumes the existence of external sensors {including scene perception SW capabilities) in the vehicles for automated and/or autonomous driving. We address ambiguities in user's spoken intent in his/her interaction with in-vehicle virtual assistant and propose to resolve them via understanding the scene outside the vehicle (leveraging the external visual sensors in the vehicle). For example, ("What are the opening hours of this restaurant?", "is there a parking space over there?"). Understanding the scene, enables the resolution of queries that are (1) not only specific to a particular location but also other objects in the scene that dynamically change as people, cars, objects on the street (2) reduce the need for an in-vehicle driver monitoring (gaze/head pose) system to understand the intent via gaze User intent understanding is a key challenge for conversational virtual assistants. There are already systems that address this challenge via multi-modal user data with focus specifically on gaze information. Our invention proposes the usage of external sensors and scene perception ability, that already exist in the vehicle for automated driving, to understand user's spoken interaction. We propose to fuse real-time external scene understanding output with user's spoken query and thus enhance the natural language understanding ability of the in-vehicle virtual assistant. We also show automation on how to delegate the resolution of intent to the right module in the system (e.g., automatic park assist) "Disclosed anonymously" This invention is a new and non-obvious methodology that performs intent resolution in speech queries that relate to POIs using multi-modal (speech and visual) data 1.Determines whether there is any reference to an external object that requires scene perception, using the NLU (natural language understanding) system (e.g., specific restaurant, person on the street, vehicle, person").
展开▼