But - jokes aside - computers all around us are growing ever more intelligent (so we hear). They can play chess. Drive cars. Some might even do these things at the same time. Without crashing (quite literally).
Likewise, we've been making steady progress equipping Rosie Patrol with some useful skills. Skills needed to bring much needed law and order to the world. She can move. She can see. She can sense. And in our last episode, she began to read. With a little helping hand from some (considerably) bigger computers at the mothership that is Google.
But can she really read? You know. Read out aloud?
True to our style, there is only one way to find out. It's time to invoke Code:
All superheroes need:
- One Raspberry Pi 3, running Raspbian OS. Connected to the Internet.
- Computer from which you are connecting to the Raspberry Pi. Do we have to keep reminding you about this one?
- Yes, you'll need a speaker. Clearly, because we want the Pi to make some noise (that we can hear). We used one from Betron.
Already completed these missions?You'll need to have completed the Rosie series. Then:
- Lights in shining armour
- I'm just angling around
- Eye would like to have some I's
- Eh, P.I?
- Lights, camera, satisfaction
- Beam me up, Rosie!
- a, b, see, d
Your mission, should you accept it, is to:
- Connect your speaker to the Pi. We used the Pi's headphone jack. Other methods of connectivity are possible.
- Download and install the Python Google Text-to-Speech (gTTS) module. This helpful Python module allows us to convert text into a mp3 file of it being spoken. Pretty much the whole point of this task really.
- Play the mp3 file back through the speaker using omxplayer
- Write some Python code to read out aloud a Moomin book. Other children's books are available.
The brief:Text-to-Speech (TTS) is Speech Synthesis technology that allows computers to say human words - funnily enough, like humans. Idea being that computers can interpret text, and generate audio of 'human-like' speech with the appropriate intonation. And with the ability to speak back to us (potentially in many different languages), robots can tell us what they think. Acknowledge our wholly irresponsible commands. Or tease us for making them. All human-like behaviour that we like robots to
And like most things that are useful, people far cleverer than us have already had a go at this. A very good go, in fact. So like with Optical Character Recognition (OCR), our best chances of success (in the time available) rely on using something that someone else has developed, using a REST API, to generate the speech from our text, on our behalf.
Thankfully, Google comes to the rescue (again). Because the Python Google Text-to-Speech (gTTS) module allows us to do just that. It allows us to interact with Google's Text-to-Speech service, and usefully generate a mp3 file in Python that we can play back through the speaker.
So here's our basic blueprint: Raspberry Pi Camera -> Google Cloud Vision API (for OCR) -> Google Text-to-Speech -> Play back using speaker. All sounds rather plausible, doesn't it?
The devil is in the detail:We helped Rosie Patrol read in our previous task, using Google Cloud Vision API and a Raspberry Pi Camera. But we now want to take this obsession of ours further. Specifically, we want her to read out aloud.
And in order to do this, first of all, we need a device that produces sound. One that allows us to play back audio from the Pi. Thankfully, there is little actual work required to get this to work. Raspberry Pi 3 has a number of ways it can be hooked up to some speakers: via HDMI cable, Bluetooth or headphone jack to name a few. We have a USB-powered speaker from Betron, which can be connected to the headphone jack. And we've decided to proceed with this simple setup.
In short, there is a USB cable to power our little speaker. And a connection to the headphone jack for the audio. With the speaker set to 'aux' mode, we can play back sounds (using the omxplayer command). Not at all complicated so far.
Next, we install gTTS using pip3.
sudo pip3 install gtts...installs the Google Text-to-Speech (gTTS) library using pip3
Why? Because gTTS is the clever bit that we actually need for this mission. It basically allows us to create audio files (in mp3), based on text that we want to have spoken. gTTS stands for Google Text-to-Speech so the clue was actually in the name.
Behind the scenes, gTTS is yet another API that allows us to interact with a useful Google service living somewhere out there in the Cloud. Only this time, this library does most of the work for us (without us having to manually construct our REST API calls using the Requests module).
We'll see how this works in a minute. But for now, let's get it installed.
Once installed, we can test it out. And what better way to test it than to use IPython. With just 4 lines of code, we can use gTTS to produce a mp3 audio file based on the contents of a string that we'd like to have spoken. From importing the gTTS module, to creating a gTTS object (tts) from the content of the speech string, everything is fairly self-explanatory. The very final thing we do is to save() the gTTS object to a mp3 file. Because that's how we play them back later.
from gtts import gTTS speech = "Hello, my name is Rosie Patrol. Nice to meet you!" tts = gTTS(text=speech, lang="en") tts.save("speech/speech.mp3")
To play the mp3 file in Raspbian OS, we'll use omxplayer. There are other audio players out there - as you can imagine - if you decide to choose a different tool for whatever reason (we couldn't think of any). If the speaker is correctly attached, and audio is working, you'll hear a soothing voice say the words that you've had stored in your speech string variable.
omxplayer speech/speech.mp3...plays the mp3 file using omxplayer
Hello, my name is Rosie Patrol. Nice to meet you!
Can speech be extracted from a text file instead? Sure. Here's a little masterpiece that has been written for this experiment. It's about a gingerbread man. And (spoiler alert) it doesn't end well for the little brown biscuit.
Here's the cover, that plays no part in this experiment.
The story has been typed in by the talented author, and content stored in a text file. Once it's uploaded to the Pi, we're ready to use gTTS to have the Pi read these words out aloud.
We use open() and read() to read in the content of the text file. Once this is stored in our speech string variable, we can use gTTS to store the resulting audio as a mp3 file, just like before. Notice that we remove the \n new-line character using replace(). This is done so that we end up with one long string variable.
from gtts import gTTS with open("speech/gingerbreadman.txt", "r") as file: speech = file.read().replace("\n", "") tts = gTTS(text=speech, lang="en") tts.save("speech/speech.mp3")
We can now play the mp3 file using omxplayer, and sit back and listen to the adventures of the gingerbread man. And how he (apparently) meets his dreadful end.
Did we promise that Rosie Patrol would be reading pages from a Moomin picture book? We think we did. Once you've fully recovered from learning the fate of our biscuit hero, let's look at what we need to build to get Rosie Patrol to narrate to us the goings-on in Moomin Valley.
Aim is to combine our gTTS code with our little Python application from before. The one which used Google Cloud Vision API and picamera to get Rosie Patrol to recognise the text in front of the camera, using OCR. Remember that? Good. Well, with a few new lines of code, we can store the result from the OCR task as a string variable, and use gTTS to produce the audio in mp3, exactly like we did before.
...Which means, at regular intervals that we think is the most appropriate, photos are taken of our Moomin book's pages using the Raspberry Pi Camera and picamera. This image is being base64 encoded and sent to Google Cloud Vision API for OCR using Requests, and we are storing the results as a string variable - discovered.
Here is an example of a photo taken by the Raspberry Pi Camera.
Finally, we use gTTS to have Google Text-to-Speech API convert our discovered string into a mp3 audio file. And, of course, we play it back using omxplayer. We repeat this for every page. And with luck, we can observe Rosie Patrol appearing to read the text in front of her out aloud, albeit with a slight delay while photos are being taken, and API calls are being made to Google (for both OCR and Text-to-Speech).
The results can be a little hit and miss, depending on the quality of the photos and legibility of the text. Nonetheless, our little Python application seems to trundle through our Moomin picture book relatively well.
Of course, behind the scenes, we're using using API calls to Google. Specifically, our Google Cloud Vision API call is linked to our Google Compute Platform account. Don't forget to stop your script when finished, and to monitor usage of the API through the GCP Console, so that you don't exceed your quota.
Does Rosie Patrol now read books? She sure can!