Tag Archive for OCR

Best in class OCR: Leafdoc set new standard in hosted conversion of scanned images to PDF documents

PRLog (Press Release)Mar. 12, 2013BROMLEY, Kent, U.K.Leafdoc announce a significant upgrade to their on-line OCR service, a way to convert scanned images back into PDF’s that benefit businesses by allowing users to search for words and phrases within the image.
Following an intensive period of research and development, Leafdoc can now claim their service is now best in class: no other supplier is able to offer the accuracy in conversion along with access via via email, file transfer (FTP) and a programmable interface (web services API).
For users throughout Europe and the Americas, Leafdoc’s conversion service is both fast and efficient.  This means that customers need not install any software in advance to create searchable PDF documents and save money by using service on demand rather than investing in applications that require constant upgrading and maintenance payments.

Leafdoc, a company formed in 2011, have concentrated on the on-line conversion of images since their inception and the company is set-up to take full advantages of cloud computing and virtual services; their structure and cost-base reflects this approach, giving them a unique position in the marketplace.

Article source: http://www.prlog.org/12096963-best-in-class-ocr-leafdoc-set-new-standard-in-hosted-conversion-of-scanned-images-to-pdf-documents.html

La Bibliotheca Alexandrina choisit le logiciel OCR NovoVerus pour numériser une précieuse collection d’articles de presse égyptiens


- NovoDynamics contribue à la sauvegarde numérique de plus de 800 000 articles dans le cadre d’un projet de création d’archives en ligne accessibles au public mis en place par la Bibliotheca Alexandrina et le Centre d’Etudes et de Documentation Economiques, Juridiques et Sociales.

ANN ARBOR, Michigan, 6 mars 2013 /PRNewswire/ – NovoDynamics, développeur de premier plan de technologies d’analyse et de reconnaissance de formes, et la bibliothèque égyptienne Bibliotheca Alexandrina (BA) ont annoncé aujourd’hui que le logiciel NovoVerus™ de NovoDynamics® sera utilisé pour numériser plus de 800 000 coupures de journaux égyptiennes rassemblées depuis 1976 par le Centre d’Etudes et de Documentation Economiques, Juridiques et Sociales (CEDEJ), qui relève du Centre National de la Recherche Scientifique (CNRS). L’objectif premier de cette entreprise de collaboration entre la BA et le CEDEJ est de permettre au public d’accéder à cette collection rare d’articles par une recherche dans des archives en ligne. Une fois numérisés, les articles pourront être conservés indéfiniment en parfait état, indexés en vue d’une recherche en ligne et mis à disposition dans le monde entier.

NovoVerus a été sélectionné pour entreprendre cet énorme travail de numérisation en raison de sa capacité à prendre totalement en charge les langues du Moyen-Orient et son extrême précision en matière de reconnaissance optique de caractères (OCR), notamment en présence de documents de piètre qualité. Les fonctions d’amélioration automatique de l’image de NovoVerus faciliteront le traitement des coupures de journaux vieillissantes, pour la plupart jaunies, abîmées ou ayant subi d’autres formes de détérioration.

« La BA est réputée pour sa prééminence dans le traitement numérique de texte en arabe, et NovoVerus est un logiciel fiable pour la prise en charge de projets OCR ambitieux qui a fait la preuve de sa grande performance sur des images scannées d’originaux abîmés et posant problème du fait, entre autres, de la mauvaise qualité du texte », a souligné Rami K. Rouchdi, expert en traitement numérique au sein de la Bibliotheca Alexandrina.

David Rock, président-directeur général de NovoDynamics, a déclaré : « Nous sommes fiers de participer à cette initiative internationale majeure par l’intermédiaire de notre produit phare, NovoVerus. Notre nouveau logiciel permet un traitement plus rapide et une réduction sensible de la mémoire nécessaire tout en offrant une précision plus grande et toute une gamme d’autres améliorations qui, combinées, font de NovoVerus 4.0 la solution OCR de langues internationales
phare du secteur.

À propos de NovoVerus 4.0

Capable de traiter des documents contenant des termes en plusieurs langues, NovoVerus 4.0 prend en charge des langues du monde entier, dont l’arabe, le persan (farsi, dari), le pashto, l’hébreu, le ourdou, le chinois, le coréen, le russe, l’espagnol, le français et l’anglais. Largement déployé dans des applications rigoureuses utilisées par le milieu universitaire ainsi que par les secteurs privé et public, NovoVerus nettoie et convertit automatiquement les documents les plus complexes, notamment les originaux abîmés et copies détériorées, en texte numérique prêt à être post-traité et analysé. Pour plus d’informations, veuillez consulter le site www.NovoDynamics.com/NovoVerus.

À propos de la Bibliotheca Alexandrina

La nouvelle bibliothèque d’Alexandrie, la Bibliotheca Alexandrina (BA), a été inaugurée en octobre 2002. Elle a vocation à restituer l’esprit d’ouverture de l’ancienne bibliothèque de la ville en favorisant la participation d’un public mondial grâce aux innovations technologiques de l’ère numérique. Plus qu’une simple bibliothèque, la BA est un vaste complexe offrant à travers une plate-forme unique située sur les rives de la Méditerranée des ressources scientifiques, artistiques, technologiques et historiques et constituant un centre de recherche et de dialogue. Depuis sa création, elle a vocation à être une bibliothèque numérique universelle, largement centrée sur la numérisation et la préservation du patrimoine régional et mondial, tant indépendamment qu’en collaboration avec des partenaires internationaux. La BA s’emploie à expérimenter des concepts numériques dans l’intérêt des hommes et femmes de savoir du monde entier. Pour plus d’informations, veuillez consulter le site www.bibalex.org.  

À propos de NovoDynamics

NovoDynamics, Inc., une société d’In-Q-Tel constituée en 2001, conçoit des logiciels de saisie d’information intelligents et fournit des solutions d’analyse de pointe qui transforment les données en informations précieuses en vue de la prise de meilleures décisions. Les produits et solutions de NovoDynamics sont utilisés dans le monde entier tant dans le milieu universitaire que dans les secteurs public et privé. Pour en savoir plus, consultez le site www.NovoDynamics.com.

Contact presse :
Karen Zanon
NovoDynamics, Inc.
+1-734-205-9162
kzanon@novodynamics.com

 

SOURCE NovoDynamics

RELATED LINKS
http://www.novodynamics.com

Article source: http://www.prnewswire.co.uk/news-releases/la-bibliotheca-alexandrina-choisit-le-logiciel-ocr-novoverus-pour-numeriser-une-precieuse-collection-darticles-de-presse-egyptiens-195700081.html

Bibliotheca Alexandrina Selects NovoVerus OCR Software to Digitize Valuable Collection of Egyptian Press Articles


- NovoDynamics collaborates on the digital preservation of more than 800,000 articles as Bibliotheca Alexandrina and Centre d’Etudes et de Documentation Economiques, Juridiques et Sociales create online archive for public access.

ANN ARBOR, Michigan, March 5, 2013 /PRNewswire/ – NovoDynamics, a leading developer of pattern recognition and analytics technologies, and Egypt‘s Bibliotheca Alexandrina (BA) today announced that NovoDynamics® NovoVerus™ software will be used in the digitization process for more than 800,000 Egyptian newspaper clippings that have been collected since 1976 by the Centre d’Etudes et de Documentation Economiques, Juridiques et Sociales (CEDEJ — Center for Economic, Legal and Social Studies and Documentation), an affiliate of the Centre Nationale de la Recherche Scientifique (CNRS — the French National Scientific Research Center). The primary objective of the collaborative endeavor between the BA and the CEDEJ is to make this rare collection accessible to the public in a searchable online archive. Once digitized, the articles can be preserved indefinitely in perfect condition, indexed for online search and made available internationally.

NovoVerus was selected for this massive digitization initiative given its comprehensive support for Middle Eastern languages and its superior optical character recognition (OCR) accuracy, especially when it comes to degraded text quality. The automated image enhancement capabilities built into NovoVerus will facilitate the processing of aging newsprint clippings, many of which are yellowed, damaged or otherwise degraded.

“The BA is well-known for its leadership in the digital handling of Arabic text, and NovoVerus has been a reliable choice when it comes to ambitious OCR undertakings, demonstrating high performance on images scanned from challenging, degraded originals with poor text quality,” said Bibliotheca Alexandrina Senior Digital Production Engineer Rami K. Rouchdi.

NovoDynamics President and CEO David Rock added, “We are proud to be supporting this important international effort with our flagship product, NovoVerus. Our new release enables faster processing, increased accuracy, a significant reduction in memory usage and a host of other enhancements that combine to make NovoVerus 4.0 the industry leading global language
OCR solution.”

About NovoVerus 4.0

Including support for mixed language documents, NovoVerus 4.0 handles global languages including Arabic, Persian (Farsi, Dari), Pashto, Hebrew, Urdu, Chinese, Korean, Russian, Spanish, French and English. Widely deployed in rigorous government, commercial and academic applications, NovoVerus automatically cleans and converts even the most challenging documents — including damaged originals and degraded copies — into digital text, ready for post-processing and analysis. Please visit www.NovoDynamics.com/NovoVerus for more information.

About the Bibliotheca Alexandrina

The new Library of Alexandria, Bibliotheca Alexandrina (BA), was inaugurated in October 2002. The Library endeavors to recapture the spirit of openness of the ancient Library, engaging global audiences through the technological innovations of the digital age. It is much more than a library of books; rather, it is a vast complex providing science, research, art, history, technology and dialogue through a single hub located on the Mediterranean shore. Ever since its inception, the BA has been dedicated to serving as a universal digital library, largely focusing on the digitization and preservation of regional and global heritage, both independently and in collaboration with international partners. The BA strives to pioneer digital concepts for the benefit of the international knowledge community. For more information, visit www.bibalex.org.  

About NovoDynamics

NovoDynamics, Inc., an In-Q-Tel portfolio company incorporated in 2001, develops intelligent information capture software and provides advanced analytics solutions that transform data into actionable insights needed to make better decisions. NovoDynamics products and solutions are used worldwide by commercial industries, governments and academia. Learn more at www.NovoDynamics.com.

Media Contact:
Karen Zanon
NovoDynamics, Inc.
+1-734-205-9162
kzanon@novodynamics.com

SOURCE NovoDynamics

RELATED LINKS
http://www.novodynamics.com

Article source: http://www.prnewswire.co.uk/news-releases/bibliotheca-alexandrina-selects-novoverus-ocr-software-to-digitize-valuable-collection-of-egyptian-press-articles-195330471.html

Computer science part of EBacc

Computer screen in schoolComputer science will now become part of a group of core subjects measured in league tables

Computer science is going to become part of the English Baccalaureate – one of the measures used in school league tables in England.

It will be included as one of the science options that count towards this measure.

The English Baccalaureate (EBacc) requires pupils to get good GCSE grades in English, maths, sciences, history or geography and a language.

Technology firms have been calling for a bigger role for studying computing.

Microsoft’s education director Steve Beswick welcomed the announcement as the “start of a journey” in changing how computer science is taught.

He wants the subject to be taught to even younger children, including in primary school.

A Google spokeswoman said this “marks a significant further investment in the next generation of British computer scientists”.

Core subjects

The decision by Education Secretary Michael Gove will mean that computing will count as a science in the English Baccalaureate for secondary school league tables from January 2014 – alongside physics, chemistry, biology and pupils taking double science.

The Department for Education says the change is intended to reflect the “importance of computer science to both education and the economy”.

In January 2012, Mr Gove announced he was replacing the information and communications technology (ICT) curriculum in schools with a more challenging computer science curriculum, developed to meet the needs of technology firms.

In October, a panel of technology experts, including representatives of Google and Microsoft, called for the inclusion of computer science in the English Baccalaureate.

The English Baccalaureate was introduced as a measure of school performance, appearing in league tables, and showing the proportion of pupils achieving GCSEs grade C and above and some AS-levels in specified key subjects.

The planned changes in qualifications in England will see some of these core subjects becoming English Baccalaureate Certificates, replacing the current GCSEs.

There have been several lobbying campaigns to add further subjects to the English Baccalaureate – including arts and religious education – with concerns that subjects outside this group could be marginalised.

Computer science will be the first extra subject to be added.

The subject is being offered by the OCR and AQA exam boards.

AQA’s first cohort will take the exam in summer 2014 – with 231 schools signed up to teach the subject. OCR introduced the subject in 2010 with a first 50 candidates taking the exam in 2011. This had risen to 2,000 candidates for the subject in summer 2012.

Alongside sciences, the English Baccalaureate comprises English, maths and humanities – which is a choice of history or geography – and a language.

Languages can be either ancient or modern, drawn from a list of 172 course options, ranging from classical Greek to Japanese and Urdu.

The introduction of computer science prompted arts campaigners to call for a further widening of the English Baccalaureate.

Deborah Annetts, chief executive of the Incorporated Society of Musicians, called on the government to address “the needs of the creative economy and introduce rigorous creative subjects”.

Stephen Twigg, Labour’s education spokesman, said: “Adding computer science into the EBacc is too little, too late. Gove’s exams still place no value on creative subjects like art, music and drama, and no value on practical subjects like engineering, design and technology and construction.”

A Department for Education spokesman said: “We need to bring computational thinking into our schools. Having computer science in the EBacc will have a big impact on schools over the next decade.

“It will mean millions of children learning to write computer code so they are active creators and controllers of technology instead of just being passive users.”

Article source: http://www.bbc.co.uk/news/education-21261442#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa

Shorten Your Book Stacks With 1DollarScan

1DollarScan1DollarScan

This was a good year in terms of getting my home office less cluttered, but I’m still fighting a major battle when it comes to my books. Paperback, hardback, non-fiction, and fiction… I finally exceeded my bookshelf space this year. It got bad enough with the stacks that I finally took a weekend and went through everything, dividing it all up into KEEP or DONATE. Six boxes of books made their way to Goodwill and I’ve still got a few more boxes that are going to a local high school’s library. Even with the culling, however, I still identified a number of KEEP books that I really wished I had purchased in digital format. Many of these books weren’t in digital format when I originally purchased them, and others were available as ebooks but only after reading the print version did I realize a digital version would have been a better investment. It’s a rare book that I purchase both in print and digital, and since 99.9% of publishers won’t provide a digital copy when you purchase a print copy, I now have a large number of print books that I can only reference from home (or by lugging them wherever I might wish to access them).

Another large stack that exists in my home can be found in my workshop — about five years’ worth of The Family Handyman (TFH) magazine. I can now get TFH in digital format on my iPad, but here’s the rub — I hate digital magazines! Yes, I read the occasional digital issues of Wired, Popular Mechanics, and Men’s Health on my iPad, and I love the multimedia features they offer such as animations, interactive elements, and other special effects… but each issue takes up a lot of space. You can delete an issue and reload it at any time, but that’s a hassle when I’d just like to reference a single article I read (not to mention the hassle in trying to remember the correct issue to reload). Until digital magazines offer me the ability to rip out an article and save it to my iPad so I can ditch the rest of the issue, I’m sticking with print magazines. Every month, I sit down with my personal scanner and scan in a small stack of articles that I’ve ripped from my print subscriptions — I convert them with OCR so I can search using keywords. It works great!

But the problem with The Family Handyman is that often 75% or more of the magazine I wish to keep. The magazine has tool reviews, How-To articles, special hands-on projects, and much more. That’s why I don’t rip up my TFH magazines! I just can’t predict when I might need that How-To article on repairing a toilet or adding additional wiring to a closet or finding the Editor’s Best circular saw. I still end up having to hunt through the issues occasionally, trying to find some special article that I remember reading but uncertain of which issue. TFH doesn’t have a year-end Index in the December issue, so I frequently find myself wishing I had the digital versions so I could do keyword searches.

So, to summarize, I’ve got hardback and paperback books I wish I had in digital format, but I don’t want to spend more money buying the digital version. I’ve also got TFH magazines taking up space because I don’t have the storage space on my iPad to store years’ worth of issues. Add to this the fact that I can’t search the TFH magazines quickly for a particular review or article.

As a self-described efficiency ninja, these issues have been driving me crazy for some time. I’ve investigated building myself a book scanner — there are plenty of plans out there for building your own, but none that have impressed me in terms of easy-to-build or easy-to-use… or both. So I started looking for alternatives.

I can’t remember where I read about 1DollarScan.com, but I owe someone out there a big thanks for the recommendation. 1DollarScan is a service that will take your books (hardback and paperback), magazines, and other business documents and convert them for you at very reasonable prices. I decided to put 1DollarScan to the test a few weeks ago, and I’m now going to share with you my experiences.

The Process

1DollarScan.com requires you to create an account before you start using their service. All the services they offer must be paid for up-front; 1DollarScan makes it pretty easy to figure out the total cost by charging $1 for every 100 pages of a book (rounded up). Want to scan a 325 page book? $4 will get you a basic scan of that book (cover included for paperbacks, not for hardbacks). This is the basic, no-frills scan. You won’t get OCR, high-resolution scan (600DPI versus the basic 300DPI), or insurance (a rescan if you’re unhappy with the results). But if you’ve got a $10, $15, or $30 book that you’d like converted to digital, $1 per 100 pages will get it done. 1DollarScan calls that 100 pages a set, so start thinking in terms of sets. A 380 page book = 4 sets. A 135 page magazine = 2 sets. And so on.

It’s the frills where 1DollarScan really makes its money. You select from various options such as OCR (so you can search the PDF scans using keywords), high-resolution scan (600DPI versus 300DPI), document compression, and more. Magazines are charged just like books, but you pay $1 extra per set, so a 120 page magazine will cost you $4 ($2 for the pages and $2 for 2 sets). You can pay $2 per set for Express Service (faster scanning and delivery) and $2 per set for high resolution scanning. If you’re not careful, you’ll quickly find a 400 page book costing $10 or more for the scan and frills, fast approaching the potential price of the actual digital version you can purchase from Amazon or elsewhere.

The FrillsThe Frills

After picking your options, you pay, get a special Scan ID number, and print out a couple of forms to go with your box of books (one being a signed document that you are the legal owner of the books — 1DollarScan covering themselves legally). You then ship your books and magazines to 1DollarScan, they scan them, and then they recycle the books. You get an email that provides you with links to download your PDF files.

Pretty simple, but there’s one more factor you need to consider — shipping costs. Unless you go with the basic no-frills scan, by the time you’ve added in a few frills per set and divided the cost of shipping by the number of books you’re sending, you may find that the average cost of each scan is $15 or higher… much higher than the cost to just buy a digital version online. Use the USPS for slow-shipping to save some money and only initiate the service when you have a large number of books to scan so you keep the average shipping cost per book as low as possible.

My Test

My test of the 1DollarScan process began by picking out three different items. First, I picked an 84 page copy of The Family Handyman magazine. The second item was a large paperback book titled How to Keep Your Volkswagen Alive. (An excellent book for any kid – great drawings, great explanations of mechanical concepts, and just a classic book to own!) The third book was a hardback titled The Next Decade. Here’s how the basic no-frills scan breaks down:

Three Test ItemsThree Test Items

The Results

The three PDFs did not have the names of the books/magazine as the filenames. 1DollarScan charges extra for this ($1 per set!!), but honestly… just open the document, see what it is, and then rename the file yourself. After opening them and renaming the files, here are the file sizes:

Mag scan -- Hi ResMag scan -- Hi Res

Volkswagen scanVolkswagen scan

If you look very carefully, you might notice a slight angle on the image. I didn’t pay for the Angle Correction, so not every scanned page is perfectly vertical. Looking through the book, I’d estimate about 1 out of every 20 pages or so has a slight angle to the page. I can live with that, but if you cannot you’ll want to pay the $2 per set for the High Quality Touch Up that includes OCR, Angle Correction, and Compression.

Speaking of Compression, there’s another really cool feature that 1DollarScan offers at no additional cost. It’s called Fine-Tuning, and it’s a service that you can use to apply some additional work to the PDF if you know you’ll be displaying it on a specific digital device. Included in this Fine-Tuning is a bit of compression, reducing the file size. How much? I ran all three of my PDFs through the Fine-Tune process (you can only submit one PDF at a time, and it seems to take 30 minutes to an hour for the process to be completed… you get another email when the document is done). Here are the results:

Now, here’s the deal… the PDFs were greatly reduced in size, but the quality of the scan is also degraded a little bit… or a lot. I think it depends on the digital reader you select to use for the Fine-Tune (you choose from iPad, iPhone, iPod, Kindle 4, Kindle 3, Nook, Android Tablet, and many more options). I’ve taken a screenshot of the Basic versus Enhanced (Fine-Tuned) versions side by side that I’m sharing here. It might be a bit difficult for you to see, but the text on the right with the Fine-Tuned version is a little fuzzy around the edges if you look closely. The original, hi-resolution version is on the left. On some pages, it’s not very noticeable… on others, it’s quite obvious. On my iPad, however, the Fine-Tuned version is acceptable. And since it’s about 1/10th the filesize, if I wanted to keep these files on my iPad, I could store hundreds of issues without worrying about running out of space.

side by sideside by side

VW side by sideVW side by side

Article source: http://www.wired.com/geekdad/2012/12/stacks-with-1dollarscan/