Described embodiments provide systems and methods for capture and translation of voice commands into user interface commands and gestures. A transducer of a device, such as a microphone, may receive an audible or spoken command from a user and may translate the input audio into text via a speech-to-text engine, either as part of the operating system of the device or via a separate agent (which may be executed by the device or a remote server). The text may be interpreted via a natural language parser (either on the device or the remote server) to identify a command, such as scrolling, panning, zooming, or other such gestures. A context may be retrieved, such as coordinates of a cursor or other interface element within a hosted application or SaaS application, and the command may be applied based on the coordinates of the cursor.
展开▼