Google Indexes Scanned Docs – Brings Light To Dark Web


When you search on Google, you would now also get access to scanned documents in PDF format – this is a major leap for Google when you consider that scanned documents are typically images, and do not contain any text data. Google has apparently been working on on this using OCR (Optical Character Recognition) for quite some time.
OCR by itself is not a new technology, but to able to implement it on such a large scale deserves merit. Accuracy however has been an issue with most OCR software, and it will be interesting to how much of accuracy Google is able to bring to the table.
As the GoogleBlog states:

In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.

As Brigid Gaffikin at GigaOm says, this is also a step towards lighting up the dark web – the huge mass of data that is unsearchable either because it is password protected, behind peer-to-peer networks, or in formats such as scanned images or PDFs that are difficult to search.
Check out the search results at [Steady success in a volatile world] – the first result is a scanned document.

Share and Enjoy:
  • Print this article!
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • MySpace
  • Reddit
  • RSS
  • Twitter
  • Yahoo! Bookmarks
  • Yahoo! Buzz
  • E-mail this story to a friend!
  • FriendFeed
  • IndianPad
  • Internetmedia
  • StumbleUpon
  • Technorati
  • Turn this article into a PDF!

Related posts:

  1. Coming Up – SearchWiki: Rank Your Searches On Google
  2. Google Comes Up With Audio Indexing
  3. Gmail Gets It’s Own Google Search
  4. Document printing goes mobile
  5. Standardising Web Analytics

Go ahead, subscribe to the Wildblueskies RSS feed.

Leave a Reply


Hosted by Octopus Labs - Web hosting, Blog Hosting and Online marketing
Partner Sites: Creative Quest | Online Marketing | eGovernance | Aquarium | Bed Linens