Seeing with Sound: Empowering the Visually Impaired with GPT-4V(ision) and Text-to-Speech


Towards Data Science

This post was co-authored with Rafael Guedes.

OpenAI’s latest developments have taken AI’s usability to a whole different level with the release of GPT-4V(ision) and Text-to-Speech (TTS) APIs. Why? Let’s motivate their usefulness with a use case. Walking down the street is a simple task for most of us, but for those with visual impairments, every step can be a challenge. Traditional aids like guide dogs and canes have been useful, but the integration of AI technologies opens up a new chapter in improving the independence and mobility of the blind community. Simple glasses equipped with a discreet camera would be enough to revolutionize how the visually impaired experience their surroundings. We will explain how it can be done using the latest releases from OpenAI.

Another interesting use case is to change our experience in museums and other similar venues. Imagine for a second that audio guide systems commonly found in museums are replaced by a discreet camera pinned to your shirt. Let’s assume that you are visiting an art museum. As you walk through the museum, this technology can provide you with information about each painting and it can do so in a specific style chosen by you. Let’s say that you are a bit tired and you need something engaging and lightweight, you could prompt it to ‘Give me some historical context on the painting but make it engaging and fun, you can even add some jokes to it’.

What about Augmented Reality (AR)? Can this new technology improve or even replace it? Right now, AR is seen as this digital layer that we can overlay on our visual perception of the real world. The problem is that this can quickly become noisy. These new technologies could replace AR in some use cases. In other cases, it can make AR personalized for each one of us so that we can experience the world at our own pace.

In this post, we will explore how to combine GPT-4V(ision) and Text-to-Speech to make the world more inclusive and navigable for the visually impaired. We will start by explaining how GPT-4V(ision) works and its architecture (we will use some open-source counterparts to get the intuition since…

