Blog | Google Now Indexes Scanned Documents

Post By: on Friday, 31 October 2008

Google has announced that it will now begin including scanned documents in its search results - a feat that requires an immense amount of processing power and advanced image recognition technology. Unlike standard text documents, scanned files don't contain any text data that Google's spiders can index. Instead, Google has employed Optical Character Recognition (OCR) technology, converting photos of words into digital text files.

 

In the past Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata - not the contents of the documents. From now on Google searches will include the text within these scanned images in normal search results. When you encounter a scanned document you'll be able to view it in its original form as a PDF, or as a converted text file (click "View As HTML").

Such technology has existed for quite a while, but accuracy has always been an issue - and the fact that Google is doing it on such massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers).

Here's an example (the first result is a scanned document): Repairing Aluminum Wiring

For more, check out the announcement here.

scource: http://www.washingtonpost.com

Comments (0)

Leave a comment

You are commenting as guest. Optional login below.

Cancel Submitting comment...
contact us to find out what

OD CAN DO

for you
0845 8697654
Suite 2.9, Renslade House
Bonhay Rd, Exeter, Devon, EX4 3AY