Skip to main content

a, b, see, d



Here's a deeply philosophical conundrum that we've been struggling with in between Rosie Patrol's crime-fighting escapades:

Do robots need to be able to read their own instruction manuals?  And if so, do they need to read an instruction manual first in order to learn how to read one?

We'll be honest.  We're not quite sure what the answer should be.  Besides, the question is actually quite silly.  And probably pointless.  But wouldn't it be handy if your robot superhero could read human text?  Because, then, it could read danger signs posted outside alligator pens (where evil masterminds apparently hang out).  After all, Pattern Recognition and Computer Vision are all the rage these days.  It helps computers make sense of the world around them, and possibly, makes them seem more intelligent.  Well, more intelligent than us humans, supposedly.

And that's not hard.

So without further delay, let's get Rosie Patrol to read some signs on her own.  Well - as it turns out - with a helping hand from a little outfit called Google.

All superheroes need:

  • One Raspberry Pi 3, running Raspbian OS.  Internet connectivity is a must.
  • Computer from which you are connecting to the Raspberry Pi
  • A Google Cloud Platform account.  We'll find out why, later.

Already completed these missions?

You'll need to have completed the Rosie series.  Then:
  1. Lights in shining armour 
  2. I'm just angling around 
  3. Eye would like to have some I's
  4. Eh, P.I?  
  5. Lights, camera, satisfaction
  6. Beam me up, Rosie!

Your mission, should you accept it, is to:

  • Set up a Google Cloud Platform (GCP) account (if you haven't got one already)
  • Create a Project in the GCP Console.  Enable Google Cloud Vision API.  Create an API key.
  • Test the API, using Python's Requests module
  • Do something useful with the API (if you can think of any)

The brief:

Simply put, the task here is to get our Python code to identify text found in images.  Images taken by the Pi using a camera.  Sounds fun.  Where do we sign up?  As it happens, we sign up with the Google Cloud Platform but let's first explore the other options out there.

Recognising text in images could be accomplished in a number of different ways.  You could:
  • Build your own Machine Learning algorithm to do Optical Character Recognition (OCR), using Python machine learning tools such as scikit-learn.  You could develop your own algorithm, use it to train your computer using training data, and test your algorithm's effectiveness, over and over (and over...) again.   Probably not the best use of your time, unless you want to be a data scientist.
  • Use specialist OCR tools, such as Tesseract OCR.  Less fiddly compared to the first option.  But still relatively time consuming.  And why bother, if you can...
  • Be very lazy.  Use publicly available APIs to get someone else to recognise text in your images for you.
The decision for us in the end was quite simple.  We have a limited amount of time.  We don't have PhDs (but - rather selfishly - would like to use systems developed by people who do).  Besides, we only wanted to prove very basic character recognition, using deliberately legible text, before swiftly moving onto our next assignment.

And that's why we looked towards the Cloud, and at the number of API-based OCR services that familiar Internet giants offer, to get the job done. 

The devil is in the detail:

In the end we decided to use Google Cloud Vision API.  Why?  Because it looks rather easy to use according to its documentation.  It does OCR.  And - at the time of writing - it's free to use up to 1000 units a month according to its pricing.  Microsoft Azure appears to have a similar service.  So does IBM Watson Visual Recognition (although no OCR capability as far as we can tell).  It's always best to check that the Cloud-based provider you choose is appropriate for your project.  And that you won't be hit with a hefty charge at the end of the month.  Cue the first scary-looking warning.


Always check the small print of your chosen 'Cloud-based' service.  Most have a very generous trial scheme that allows you to use it for free for very low volumes, or for a fixed duration.  Always set a billing alert so that you don't get sent an unexpected invoice at the end of the month.


Still with us?  Great. Let's proceed with this computer vision malarkey.

First things first, you'll need to setup a Google Cloud Platform account (if you haven't got one already).  It's best to refer to the most up-to-date instructions on how to create yourself an account, and what's available to use for free (and for how long).  All, of course, on the Google Cloud Platform website.

Once you have yourself an account, you're able to navigate to its Console.  From there, you'll want to create yourself a Project.  We found a 'Select a project' menu somewhere in the top navigation bar, from which we were able to create a new Project.  Give it a meaningful name, so you don't forget what it is.


After a while, your project will emerge in the list of Projects.

You can now 'open' this Project, and view its details.  But you won't find anything interesting in there. Not just yet.  Because nothing has been set up.

What we want is the ability to use Google Cloud Vision API.  This means that we need to enable it.  Go to the 'APIs and services' Dashboard in your Project. 


'Enable APIs and services' looks like a good bet.  Select it, then search for the 'Google Cloud Vision API', and finally, enable it.

Now, we need an API key for when we fire off our REST API request at Google.  It's how they know it's us.  From Credentials, you can create one and give it a memorable name.


The newly created API key should now appear under Credentials.  Don't try and remember it.  It's meant to be complex for a reason.


It's not often we flash two warnings in a post.  It must be your (un)lucky day!

PINs.  Passwords.  Keys to your granny's house.  What do they all have in common?  That's right.  You never share them with anyone.  You can add API keys to that list as well.  These keys are not for sharing, and in the wrong hands, they can be used to access services that are enabled in your account without your knowledge, or permission.  Bad.  Very bad.

Also, never store your API keys directly in your Python code.  Remember - quite often - code ends up getting shared (for good reason).  You don't want your sensitive keys being shared around at the same time.


Believe it or not, we are now all ready to go.

The official guide here tells us how to use the REST API.  There's two important bits of information.

Firstly, the URL to which we send our HTTP POST requests is mentioned:

https://vision.googleapis.com/v1/images:annotate?key=YOUR_API_KEY

Through wonders of the Internet, our request will reach a bored Google server somewhere, eager to respond to our request.  By the way did you notice the YOUR_API_KEY string at the end?  This is where we inject our very own API key we generated earlier in the Console.  This is how Google identifies that it is us (more specifically in our case, rosie-02) sending it our request.  Can you see how easy it is to pretend to be someone else if you had their key?

But... what data do we include in our request?  What about the image we want analysing?  The API documentation tells us that Google is expecting JSON data in our POST request.  And it needs to be formed like this:

{
  "requests": [
    {
      "image": {
        "content": "/9j/7QBEUGhvdG9zaG9...base64-encoded-image-content...fXNWzvDEeYxxxzj/Coa6Bax//Z"
      },
      "features": [
        {
          "type": "TEXT_DETECTION"
        }
      ]
    }
  ]
}

We played around with JSON before, using the Requests module, so a lot of this should look vaguely familiar.  The bit that interests us the most for this task is the content key, where we need to inject our very own image taken by Rosie Patrol.  The image file needs to be base64-encoded, which can be done using a simple function in Python.

We believe we now know just about enough about what Google is expecting, so let's fire up Python.

Firstly, we'll set up a private function - _construct_google_vision_json() - to form a skeleton of our JSON data as a dictionary data structure.  We'll leave the content key empty to start off with, but will populate it with the result from our _encode_image() function.  Throughout, image, is a string with the location of Rosie Patrol's latest .jpg photo.

def _construct_google_vision_json(image=None):
    data = {
        "requests": [
            {
                "image": {
                    "content": ""
                },
                "features": [
                    {
                        "type": "TEXT_DETECTION"
                    }
                ]
            }
        ]
    } 
    data["requests"][0]["image"]["content"] = _encode_image(image)
    return data

The _encode_image() private function doesn't contain much, except the base64.b64encode() function to read in the content of the image file, and to encode it in base64 bytes format. The result needs to be formatted into a string before sending via API, hence the decode() function cheekily appended to the end.  Note that base64 is a module that needs to be imported.

def _encode_image(image=None):
    with open(image, "rb") as file:
        return base64.b64encode(file.read()).decode('UTF-8')

We'll already be familiar with using Requests and Json.  Here's a function we'll use to send our POST requests with the JSON data.

def _post_json_request(url=None, data=None):
    head = {'Content-Type' : 'application/json'}
    try:
        return requests.post(url, data=json.dumps(data), headers=head)
    except requests.exceptions.RequestException as e:
        print("ROSIE: Failed to connect to", url)
        print(e)

The final(-ish) function we have is to fire off our HTTP POST request, and to interpret the results back.  In it, we check that the image file exists, then launch a HTTP POST request using our JSON data.  If the request was received successfully (status_code of 200), we look at the specific key / value that contains Google Cloud Vision API's conclusion on what the word is.  Notice that we also have a little function called _tidy_response() as we wanted to remove any new line characters and spaces.  This might not be appropriate for you if you are analysing sentences, or even paragraphs.

def _find_text_in_image(image=None, url=None, token=None):
    if not path.exists(image):
        print("File", image, "does not exist")
        sys.exit()
    r = _post_json_request(url+token, _construct_google_vision_json(image))
    if r.status_code == 200:
        if not len(r.json()["responses"][0]) == 0:
            discovered_text = _tidy_response(r.json()["responses"][0]["fullTextAnnotation"]["text"])
            return discovered_text
        else:
            return None

Let's say that the latest picture that Rosie Patrol has captured is always named latest.jpg (and found in the /capture sub-directory, relative to current location).  We've setup the following constants:

SOURCE_IMAGE = "capture/latest.jpg"
GOOGLE_VISION_API = "https://vision.googleapis.com/v1/images:annotate?key="
if "ROSIE_GOOGLE_API_TOKEN" in environ:
    GOOGLE_VISION_TOKEN = environ["ROSIE_GOOGLE_API_TOKEN"]
else:
    print("Environment variable not found")
    sys.exit()
ROSIE_CONTROL_API = "http://rosie-01:5000/api/v1/control"

Do you remember that we never store API tokens (or any other type of password) in our code?  What we are doing here with the GOOGLE_VISION_TOKEN constant is reading the value from a Raspbian OS (Linux) environment variable.  To store our API key as an environment variable called ROSIE_GOOGLE_API_TOKEN in Linux, do this:

export ROSIE_GOOGLE_API_TOKEN='your_secret_key'
env | grep ROSIE_GOOGLE_API_TOKEN 
...sets an environment variable.  Check that it's set afterwards using env | grep.


Our entire code, which we called rosie-02-text-detect1.py, now looks like this.

Every time it is run, providing that Rosie Patrol produced a photo called latest.jpg in our /capture sub-directory (relative to current path), it returns us a string with the text found by Google Cloud Vision API.

OK.  This is all very interesting.  But how does Rosie Patrol get to use any of this?  Do we have proof that it works?

We've already learnt how to take pictures using a Raspberry Pi Camera and Python, and to store them in the /capture directory.  The key decision is, therefore, how often we take these photos and send them via REST API for analysis.  Doing this every second would result in 60 API requests a minute.  Probably enough to reach your quota sharpish if you like to spend a lot of time with your robot roaming about.  However, doing this less frequently, or even on-demand, will make it less responsive to the world around it.  In other words, it might be a while before your robot notices the warning sign on the alligator pen.  Too late, in fact.  Snap!

Let's therefore move onto our very scientific experiment.  We asked Rosie Patrol to take photos of two different words printed on A4 paper: START and STOP.  They are in a ridiculously large font, and in black and white.

Here's a totally unnecessary photo of Rosie Patrol looking at the printouts, which have been stuck to an equally plain wall.  If this character recognition experiment was to be compared to a penalty shoot-out in football, we're standing a meter away from goal, with the goalkeeper away on a lunch break.  Yes, failing this task would simply be embarrassing.


We use Python to take a photo using the Raspberry Pi Camera.  As per before, the file appears in our /capture directory.  We also make sure it's called latest.jpg.  Let's download it from the Pi, and take a look at what it is that Rosie took a photo of, while we were taking a photo of her taking the photo.


Apart from the nasty shadow of Rosie Patrol's massive head, the text in the photo is perfectly clear to us humans.  There isn't a lot of visual noise either (like a sudden burst of wedding confetti), and you'd hope that any half decent character recognition tool or service would identify this text.

Let's run rosie-02-text-detect1.py and see if Google Cloud Vision API can spot the text in the image.  Off goes our REST API request into the murky world of the Internet, and back comes Google's response.


Good news.  Google, sitting comfortably somewhere amongst the Clouds, has reported back to Rosie Patrol.  And its response is exactly what the word states in our printout - 'STOP'.  This is promising.

By the way - now - checking GCP's Console will show you statistics on the REST API traffic you have been generating.


Let's repeat this with our other word - 'START'.


Let's wait (again) for our Python program to traverse the Internet with its REST API.  And for the response to return.


Great!  2 out of 2.  We have our answer, and it's correct.

Clearly Rosie Patrol is benefiting from big (deliberately) legible letters.  If the environment has a lot of visual noise, or if the picture quality is low, the results are likely to be a lot more erratic.  More so, if we're trying to comprehend paragraphs of continuous text, split out across many lines.  And it wouldn't help if those words had been handwritten in the snow by a Yeti, in a language that only Yetis use to talk to each other about the wonders of robotics.  That might be one example where developing your own machine learning algorithm would be needed.

So how do we best demonstrate all this with Rosie Patrol?

There was a reason for picking those two words.  Now that Rosie Patrol can take simple visual cues using her camera, it's not very difficult to wrap this all up in wider Python code.  Here is an example where we flash our START and STOP cards in front of her camera to get her to... can you guess... start and stop moving.


Information overload:

Google Cloud Vision API documentation can be found here:
Python Requests docs, if you need a refresher:

Comments

MOST VISITED (APPARENTLY)

LoRa-Wan Kenobi

In the regurgitated words of Michael BublĂ©: It's a new dawn .  It's a new day .  It's a new Star Wars film .  For me .  And I'm (George Lucas, and I'm) feeling good .  Unfortunately for Canadian Mike, the Grammy that year was won by the novelty disco classic with the famous refrain: We love IoT, even in Planet Tatooine * . *Not true. Clearly, the Star Wars producers didn't sincerely mean the last Jedi the previous time around.  Return of the Jedi, released during the decade that spearheaded cultural renaissance 2.0 with the mullet and hair-metal , was less economic with the truth.  Either way, we're going to take inspiration from the impressive longevity of the money-spinning space-opera and reboot our franchise with some Jedi mind tricks.  Except this particular flick doesn't require an ever-growing cast of unrecognisable characters, unless ASCII or UTF counts.  In place of an ensemble gathering of Hollywood stars and starlets, we will b

Battle of BLEtain

The trolling . The doxing . An army of perplexing emojis. And endless links to the same - supposedly funny - viral video of a cat confusing a reflection from a dangling key for a golden hamster, while taking part in the mice bucket challenge. Has social media really been this immense force for good? Has it actually contributed significantly to the continued enlightenment of the human (or feline) race? In order to answer these poignant existential questions about the role of prominent platforms such as Critter, StinkedIn and Binterest, employing exceptional scientific rigour equal to that demonstrated by Theranos , we're going to set up a ground-breaking experiment using the Bluetooth Low Energy feature of MicroPython v1.12, and two ESP32 development boards with inexplicable hatred for one another.  And let them hurl quintessentially British expressions (others call them abuse) at each other like two Wiltshire residents who have had their internet access curbed by the co

Hard grapht

You would all be forgiven for assuming that bar , pie and queue line are favourite pastimes of the British .  Yet, in fact – yes, we did learn this back in GCSE maths – they are also mechanisms through which meaningless, mundane data of suspect origin can be given a Gok Wan -grade makeover, with the prime objective of padding out biblical 187-page PowerPoint presentations and 871-page Word reports (*other Microsoft productivity tools are available).  In other words, documents that nobody has the intention of ever reading.  But it becomes apparent over the years; this is perhaps the one skill which serves you well for a lifetime in certain careers.  In sales.  Consultancy.  Politics.  Or any other profession in which the only known entry requirement is the ability to chat loudly over a whizzy graph of dubious quality and value, preferably while frantically waving your arms around. Nevertheless, we are acutely conscious of the fact that we have spent an inordinate amount