Apple’s Siri and Microsoft’s Cortana are well-known implementations of so called “Intelligent Personal Assisstants” (IPA), providing users voice-interaction with a more or less “intelligent” system capable of delivering information about the weather, common facts or traffic jams.

But both Siri as well as Cortana are proprietary “black box” systems, so many of you folks (including me) may think: Meh.

I personally love to build my own stuff (especially when it comes to a system that I want to “trust”), so I thought about building my own IPA system. A true “intelligent” system is way to complex for a single person to implement (especially for a part-time AI expert like me), so I want to focus on intuitive voice commands for interaction. Therefore, I will not refer to the system as intelligent but as interactive.

How cool would an IPA be, that is not coupled to a device (like Siri), but executable on different systems (a Raspberry Pi, your phone – or maybe soon your car)? A system that brings modularization and allows easy customization of its features?

I decided to start working on mentioned system (working title: butler.js), and thought about three basic requirements:

  • Implemented as a Web Application (Platform Independency)
  • Voice-Interaction (In- & Output)
  • Modularity & Customizability as architecture principles (each module implements a single feature via a third-party system like e.g. a Reminder, News, Translation, etc.)

After a first review of the State of the Art in speech recognition, I felt pretty amazed about annyang!, a JS library for Speech-To-Text that works with the modern SpeechRecognition standard (unfortunately, the only browser supporting it, is Google “Blackbox” Chrome). I also discovered ResponsiveVoiceJS (free for non-commercial use) as a reliable technology for voice output (Text-to-Speech).

Here are some cool things that came to my mind; things that I want the assisstant to be capable of:

  • Reading me the News (“What happened today?“)
  • Giving me information about routes & navigation (“How long do I need to get to Moe’s house?“)
  • Answering me questions about common facts (“Tell me something about World War II.“)
  • (FYI: I’m german) Translating english words for me (“What means¬† ‘sophisticated’ in German?“)
  • Reminding me of things that I tell him (“I want to call Nik at 4pm.“)
  • ¬†Controlling my Spotify (“Play my ‘Classic Rock’ playlist“)

Using modern REST/Web APIs, most of those features can be implemented on-the-fly (especially when using JavaScript, because most of today’s web APIs “talk” JSON by default)


To be continued.

(I will push my contributions to this public GitHub repository. Feel free to check it out)

Leave a Reply

Your email address will not be published. Required fields are marked *