Skip to main content

Code: read

Humans are said to have 5 senses in total.  Common, clearly isn't one of them.

But - jokes aside - computers all around us are growing ever more intelligent (so we hear).  They can play chess.  Drive cars.  Some might even do these things at the same time.  Without crashing (quite literally).

Likewise, we've been making steady progress equipping Rosie Patrol with some useful skills.  Skills needed to bring much needed law and order to the world.  She can move.  She can see.  She can sense.  And in our last episode, she began to read.  With a little helping hand from some (considerably) bigger computers at the mothership that is Google.

But can she really read?  You know.  Read out aloud?

True to our style, there is only one way to find out.  It's time to invoke Code: Red Read.  And take the power of reading to the next level (or page).

All superheroes need:

  • One Raspberry Pi 3, running Raspbian OS.  Connected to the Internet.
  • Computer from which you are connecting to the Raspberry Pi.  Do we have to keep reminding you about this one?
  • Yes, you'll need a speaker.  Clearly, because we want the Pi to make some noise (that we can hear).  We used one from Betron.

Already completed these missions?

You'll need to have completed the Rosie series.  Then:
  1. Lights in shining armour 
  2. I'm just angling around 
  3. Eye would like to have some I's
  4. Eh, P.I?  
  5. Lights, camera, satisfaction
  6. Beam me up, Rosie!
  7. a, b, see, d 

Your mission, should you accept it, is to:

    • Connect your speaker to the Pi.  We used the Pi's headphone jack.  Other methods of connectivity are possible.
    • Download and install the Python Google Text-to-Speech (gTTS) module.  This helpful Python module allows us to convert text into a mp3 file of it being spoken.  Pretty much the whole point of this task really.
    • Play the mp3 file back through the speaker using omxplayer
    • Write some Python code to read out aloud a Moomin book.  Other children's books are available.

    The brief:

    Text-to-Speech (TTS) is Speech Synthesis technology that allows computers to say human words - funnily enough, like humans.  Idea being that computers can interpret text, and generate audio of 'human-like' speech with the appropriate intonation. And with the ability to speak back to us (potentially in many different languages), robots can tell us what they think.  Acknowledge our wholly irresponsible commands.  Or tease us for making them.  All human-like behaviour that we like robots to not have.

    And like most things that are useful, people far cleverer than us have already had a go at this.  A very good go, in fact.  So like with Optical Character Recognition (OCR), our best chances of success (in the time available) rely on using something that someone else has developed, using a REST API, to generate the speech from our text, on our behalf.

    Thankfully, Google comes to the rescue (again).  Because the Python Google Text-to-Speech (gTTS) module allows us to do just that.  It allows us to interact with Google's Text-to-Speech service, and usefully generate a mp3 file in Python that we can play back through the speaker.

    So here's our basic blueprint: Raspberry Pi Camera -> Google Cloud Vision API (for OCR) -> Google Text-to-Speech -> Play back using speaker.  All sounds rather plausible, doesn't it?

    The devil is in the detail:

    We helped Rosie Patrol read in our previous task, using Google Cloud Vision API and a Raspberry Pi Camera.  But we now want to take this obsession of ours further.  Specifically, we want her to read out aloud.

    And in order to do this, first of all, we need a device that produces sound.  One that allows us to play back audio from the Pi.  Thankfully, there is little actual work required to get this to work.  Raspberry Pi 3 has a number of ways it can be hooked up to some speakers: via HDMI cable, Bluetooth or headphone jack to name a few.  We have a USB-powered speaker from Betron, which can be connected to the headphone jack.  And we've decided to proceed with this simple setup.


    In short, there is a USB cable to power our little speaker.  And a connection to the headphone jack for the audio.  With the speaker set to 'aux' mode, we can play back sounds (using the omxplayer command).  Not at all complicated so far.

    Next, we install gTTS using pip3.

    sudo pip3 install gtts
    
    ...installs the Google Text-to-Speech (gTTS) library using pip3
     
    Why?  Because gTTS is the clever bit that we actually need for this mission.  It basically allows us to create audio files (in mp3), based on text that we want to have spoken.  gTTS stands for Google Text-to-Speech so the clue was actually in the name.

    Behind the scenes, gTTS is yet another API that allows us to interact with a useful Google service living somewhere out there in the Cloud.  Only this time, this library does most of the work for us (without us having to manually construct our REST API calls using the Requests module).

    We'll see how this works in a minute.  But for now, let's get it installed.


    Once installed, we can test it out.  And what better way to test it than to use IPython.  With just 4 lines of code, we can use gTTS to produce a mp3 audio file based on the contents of a string that we'd like to have spoken.  From importing the gTTS module, to creating a gTTS object (tts) from the content of the speech string, everything is fairly self-explanatory.  The very final thing we do is to save() the gTTS object to a mp3 file.  Because that's how we play them back later.

    from gtts import gTTS
    speech = "Hello, my name is Rosie Patrol. Nice to meet you!"
    tts = gTTS(text=speech, lang="en")
    tts.save("speech/speech.mp3") 
    


    To play the mp3 file in Raspbian OS, we'll use omxplayer.  There are other audio players out there - as you can imagine - if you decide to choose a different tool for whatever reason (we couldn't think of any).  If the speaker is correctly attached, and audio is working, you'll hear a soothing voice say the words that you've had stored in your speech string variable.

    omxplayer speech/speech.mp3
    
    ...plays the mp3 file using omxplayer

    Hello, my name is Rosie Patrol. Nice to meet you!



    Can speech be extracted from a text file instead?  Sure.  Here's a little masterpiece that has been written for this experiment.  It's about a gingerbread man.  And (spoiler alert) it doesn't end well for the little brown biscuit.

    Here's the cover, that plays no part in this experiment.


    The story has been typed in by the talented author, and content stored in a text file.  Once it's uploaded to the Pi, we're ready to use gTTS to have the Pi read these words out aloud.


    We use open() and read() to read in the content of the text file.  Once this is stored in our speech string variable, we can use gTTS to store the resulting audio as a mp3 file, just like before.  Notice that we remove the \n new-line character using replace().  This is done so that we end up with one long string variable.

    from gtts import gTTS
    with open("speech/gingerbreadman.txt", "r") as file:
        speech = file.read().replace("\n", "")
    tts = gTTS(text=speech, lang="en")
    tts.save("speech/speech.mp3")
    


    We can now play the mp3 file using omxplayer, and sit back and listen to the adventures of the gingerbread man.  And how he (apparently) meets his dreadful end.



    Did we promise that Rosie Patrol would be reading pages from a Moomin picture book?  We think we did.  Once you've fully recovered from learning the fate of our biscuit hero, let's look at what we need to build to get Rosie Patrol to narrate to us the goings-on in Moomin Valley.

    Aim is to combine our gTTS code with our little Python application from before.  The one which used Google Cloud Vision API and picamera to get Rosie Patrol to recognise the text in front of the camera, using OCR.  Remember that?  Good.  Well, with a few new lines of code, we can store the result from the OCR task as a string variable, and use gTTS to produce the audio in mp3, exactly like we did before.

    ...Which means, at regular intervals that we think is the most appropriate, photos are taken of our Moomin book's pages using the Raspberry Pi Camera and picamera.  This image is being base64 encoded and sent to Google Cloud Vision API for OCR using Requests, and we are storing the results as a string variable - discovered.

    Here is an example of a photo taken by the Raspberry Pi Camera.


    Finally, we use gTTS to have Google Text-to-Speech API convert our discovered string into a mp3 audio file.  And, of course, we play it back using omxplayer.  We repeat this for every page.  And with luck, we can observe Rosie Patrol appearing to read the text in front of her out aloud, albeit with a slight delay while photos are being taken, and API calls are being made to Google (for both OCR and Text-to-Speech).


    The results can be a little hit and miss, depending on the quality of the photos and legibility of the text.  Nonetheless, our little Python application seems to trundle through our Moomin picture book relatively well.

    Of course, behind the scenes, we're using using API calls to Google.  Specifically, our Google Cloud Vision API call is linked to our Google Compute Platform account.  Don't forget to stop your script when finished, and to monitor usage of the API through the GCP Console, so that you don't exceed your quota.


    Does Rosie Patrol now read books?  She sure can!

    Information overload:

    Raspberry Pi's guide to using omxplayer to play audio:
    gTTS documentation can be found here:

    Comments

    MOST VISITED (APPARENTLY)

    LoRa-Wan Kenobi

    In the regurgitated words of Michael BublĂ©: It's a new dawn .  It's a new day .  It's a new Star Wars film .  For me .  And I'm (George Lucas, and I'm) feeling good .  Unfortunately for Canadian Mike, the Grammy that year was won by the novelty disco classic with the famous refrain: We love IoT, even in Planet Tatooine * . *Not true. Clearly, the Star Wars producers didn't sincerely mean the last Jedi the previous time around.  Return of the Jedi, released during the decade that spearheaded cultural renaissance 2.0 with the mullet and hair-metal , was less economic with the truth.  Either way, we're going to take inspiration from the impressive longevity of the money-spinning space-opera and reboot our franchise with some Jedi mind tricks.  Except this particular flick doesn't require an ever-growing cast of unrecognisable characters, unless ASCII or UTF counts.  In place of an ensemble gathering of Hollywood stars and starlets, we will b

    Battle of BLEtain

    The trolling . The doxing . An army of perplexing emojis. And endless links to the same - supposedly funny - viral video of a cat confusing a reflection from a dangling key for a golden hamster, while taking part in the mice bucket challenge. Has social media really been this immense force for good? Has it actually contributed significantly to the continued enlightenment of the human (or feline) race? In order to answer these poignant existential questions about the role of prominent platforms such as Critter, StinkedIn and Binterest, employing exceptional scientific rigour equal to that demonstrated by Theranos , we're going to set up a ground-breaking experiment using the Bluetooth Low Energy feature of MicroPython v1.12, and two ESP32 development boards with inexplicable hatred for one another.  And let them hurl quintessentially British expressions (others call them abuse) at each other like two Wiltshire residents who have had their internet access curbed by the co

    Hard grapht

    You would all be forgiven for assuming that bar , pie and queue line are favourite pastimes of the British .  Yet, in fact – yes, we did learn this back in GCSE maths – they are also mechanisms through which meaningless, mundane data of suspect origin can be given a Gok Wan -grade makeover, with the prime objective of padding out biblical 187-page PowerPoint presentations and 871-page Word reports (*other Microsoft productivity tools are available).  In other words, documents that nobody has the intention of ever reading.  But it becomes apparent over the years; this is perhaps the one skill which serves you well for a lifetime in certain careers.  In sales.  Consultancy.  Politics.  Or any other profession in which the only known entry requirement is the ability to chat loudly over a whizzy graph of dubious quality and value, preferably while frantically waving your arms around. Nevertheless, we are acutely conscious of the fact that we have spent an inordinate amount