Jump to content
The Great War (1914-1918) Forum

Remembered Today:

JPEG to Word or TEXT document


John Gilinsky
 Share

Recommended Posts

First please accept my apologies but I could not figure out the BEST place in this large discussion group to post this.

Second I am very very serious about getting the best up to date knowledge about the following,viz.:

I have scanned thousands and counting images of archival documents with my digital camera. The files are of course graphic files (that is jpegs). HOWEVER I wish and need to rapidly and accurately without file,image or text distortions to convert individual files, small batches and large batches from jpegs to .DOC of microsoft word or text files. I have checked the interest and apparently there are some softwares that advertise to do this. I know that there is software that will convert a PDF to a DOC file but I do not wish to go through multi steps (that is JPEG TO PDF and then PDF TO DOC,etc...!).

Can ANYONE please help me out by giving me specifics, data on particularlly effective and safe software, and their own experiences?

Thanks,

John

Toronto,Ontario

CANADA

B)

P.S. Today November 5 is my birthday!

:rolleyes:

Link to comment
Share on other sites

The only thing that I can think of is to print out each one and scan them using OCR software, but I don't think that will be easy, quick or very accurate.

p.s. Happy Birthday!

Link to comment
Share on other sites

Yeah I get a dozen free bagels today(yeah!) but would rather have a positive answer to my query above. At any rate thanks for the birthday wishes!

BIG birthday boy!

:lol:

Link to comment
Share on other sites

John,

Happy birthday. :)

My only thoughts were exactly the same as Ken - i.e. to use OCR software on prints. :(

Link to comment
Share on other sites

John,

Likewise Happy Birthday.

Converting your jpegs to pdf won't help, as they will still be images rather than 'live' text. The programs that convert pdfs to doc format only work with files that contain 'real' text as opposed to images of text. 'Real' text will have been input via a wordprocessing, typesetting or web-authoring program and be encapsulated in the pdf format together with its fonts, layout parameters, etc. Your only solution, unfortunately, is to re-key the text or try OCR-scanning it — both of which are very time-consuming and error-prone.

Enjoy your dozen bagels — hope you have something tasty to put in them.

Mick

Link to comment
Share on other sites

Are you simply trying to insert the pictures into a word document? If so use the insert drop down menu and follow the options to insert a picture file into the document. You may want to ensure that the picture is not too large for the page first though. If so resize it and save the resized version as a copy. It's always good practice to keep the original photofile completely untampered with and work with copies when altering an image.

Hope that helps.

By the way why do you need to do this? A bit more background may provoke a few more suggestions.

Link to comment
Share on other sites

It depends on your images and the original documents. Forgive me if I am stating the obvious, but if the documents include handwriting, your only solution is manual transcription. You could maybe reduce the work by dictating the text to speech recognition software like Dragon Naturally Speaking. If the documents are printed, you can use optical character recognition (OCR) software to translate pixels into characters and text that you can edit and work with in Word and other text programs.

On a good image, the accuracy of OCR software is now very high. So this is where image quality comes in. If your images are sharp, have decent contrast, and are a good size, then you have a chance. But this is quite hard to achieve with a hand-help camera in indifferent lighting. Any ‘noise’ in the image will reduce the accuracy of the OCR conversion and you soon get to a point where you have to do so much correcting to the result that it is no longer worthwhile. OCR software can cope with some skew and perspective in the image but not too much. Leading programs like Abbyy and Omnipage can handle quite a range of image formats, including PDF, JPG, BMP, PCX and so on.

There is an added complication if your documents are forms. If is not to deliver a jumble of words, the OCR program needs to recognise and ‘understand’ the layout. Omnipage claims to do this but I have not tried the feature. I am not sure about Abbyy.

Your best bet is to download a trial copy of one of these programs, and see how it gets on with your images.

Link to comment
Share on other sites

Are you simply trying to insert the pictures into a word document? If so use the insert drop down menu and follow the options to insert a picture file into the document. You may want to ensure that the picture is not too large for the page first though. If so resize it and save the resized version as a copy. It's always good practice to keep the original photofile completely untampered with and work with copies when altering an image.

Hope that helps.

By the way why do you need to do this? A bit more background may provoke a few more suggestions.

I am using my digital camera as a way to take notes in archives. For example in 10 minutes or less I copied a 41 pages legal sized detailed list of neurological cases treated in a major Canadian neurological hospital in WWI! This includes their full names, unit, medical condition (diagnosis) and military CEF serial numbers (as well as civilians who were obviously discharged soldiers). I have hospital admission register books, manuscripts, forms, handwriting, ledgers, photographs, postcards, etc....to copy and then use as my research notes. I need to process some 10,000 (or even more) personnel files in RG 150 at the National Archives in Ottawa. I do hope someone can help me as most of the responses tend to be a little pessimistic as to my being able to accurately and fairly quickly do conversions.

:unsure:

Link to comment
Share on other sites

I had the same problem. I couldn't any thing for a PC to address this issue. Most software you will find requires the image to be scanned by an OCR reader, which mean jpeg photos won't work. However, I did find some software for my apple mac. It's called either Iris or ReadIris, I forget the version. If you find software for this purpose that works on a PC let me know.

Andy M

Link to comment
Share on other sites

John,

The phrase 'I need to process ...' says everything about your impatience to get on with your project, and your frustration at the 'pessimism' of the answers you have received is understandable, but I'm afraid your predicament reminds me of the story of the man who built a yacht in his basement.

Out of interest, how would you have tackled this project without the aid of your digital camera?

Mick

Link to comment
Share on other sites

... I need to process some 10,000 (or even more) personnel files ...

John

You have taken on a major task, which will require a major effort on your part. The ability to photograph the documents at the Archive is a major plus in your favour - but it really only means that the effort of transcribing the files against an artificial time limit (the Archive's closing time) is removed. You can now apply your own time limit (bed, food, work, etc allowing) to the task of transcribing.

The task remains, and as Pals above have said - realistically, not pessimistically - the computer software to do this from very variable hand-written sources I'm sure just doesn't exist. You may though have some joy with OCR software with any type-written documents, as Clive has said.

Best of luck, and hope your birthday bagels went down well :) .

Jim

Link to comment
Share on other sites

John

Further to my previous post and JC's comments - I can confirm that the bundled sw that came with my AIO has the functionality to take a JPG or TIF from your hard drive and convert it directly into a word document.

HOWEVER

the jpg must have a resolution of 300dpi or better and, of course, as Jim says it will depend on the quality of the original document.

I have never had the need to try this myself, so cannot comment on the quality of the results.

Hope that this does not muddy the waters too much.

Rgds

Andy

JC - I hope that you don't get Muddy Waters Blues as a result of reading this :lol: .

Link to comment
Share on other sites

I have never had the need to try this myself, so cannot comment on the quality of the results.

I have. It drove me crackers.

Readiris was bundled with my HP software. It is good, but only as good as the document will allow. On older documents, such as archive newspaper cuttings, I found I needed to make so many corrections that it was simpy not cost-effective in terms of my time.

Gwyn

Link to comment
Share on other sites

I think you'll find that what you call pessimism is in fact realism!

To quote a famous American admiral of the Civil War: (roughly of course)

"Damn those jpegs! Full scans ahead!"

:D

Link to comment
Share on other sites

John,

The phrase 'I need to process ...' says everything about your impatience to get on with your project, and your frustration at the 'pessimism' of the answers you have received is understandable, but I'm afraid your predicament reminds me of the story of the man who built a yacht in his basement.

Out of interest, how would you have tackled this project without the aid of your digital camera?

Mick

Impatience is not quite the accurate descriptive word that you infer from my own diction.

Eagerness beyond measure. For the first time we have in large measure due to modern technology the ability to access and use many records of all sorts to academically study shell shock (and of course all sorts of other subjects). Digitized newspapers online for example add local colour and invaluable details. Even with my 60 to 70 wpm skills just for the 41 pages digitally photographed it would have taken me about 1 week or at least a few days compared to the under 10 minutes to acquire readable information from invaluable contemporary sources. By the way can you or anyone else confirm whether the National Archives in Kew Richmond Surrey AND / OR the National Archives in Washington D.C. ALSO now allow hand held no flash digital photography by archival users?

Tx for your thoughts and consideration everyone! This shell shock research is quite dear to my heart and mind!

Link to comment
Share on other sites

Non-flash Digital Photography at the NA at Kew. Absolutely YES. (Subject to fragility of records I presume)

Steve.

Link to comment
Share on other sites

John

Further to my previous post and JC's comments - I can confirm that the bundled sw that came with my AIO has the functionality to take a JPG or TIF from your hard drive and convert it directly into a word document.

HOWEVER

the jpg must have a resolution of 300dpi or better and, of course, as Jim says it will depend on the quality of the original document.

I have never had the need to try this myself, so cannot comment on the quality of the results.

Hope that this does not muddy the waters too much.

Rgds

Andy

JC - I hope that you don't get Muddy Waters Blues as a result of reading this :lol: .

300 to 600 dpi have been referred to. How do you convert megapixels generally though to dpi and viceversa? I have a 6 megapixel camera that was used carefully but almost exclusively hand held. The images are fairly good. Any ideas or suggestions for converting these thousands of images directly from jpegs to docs? Are there any NORTH AMERICAN similiar conversion software besides OMNI page?

Tx all!

John

Damned colonial! :lol:

Link to comment
Share on other sites

John,

If you would like to email me a typical image, I will be happy to see what results I can get with Abbyy. By typical I mean as to resolution, contrast, sharpness, skew, perspective, noise and original document. No handwriting of course!

Link to comment
Share on other sites

As several pals have said, there are OCR packages that claim to be able to convert images of text into 'real' text in Word format, but their success rate varies enormously according to the quality of the image and the quality and characteristics of the original document. I downloaded the fully-functional free trial version of Readiris a couple of days ago and tried it on several SHQ jpeg images of WW1 era documents, both plain text and form-format, containing print alone, typescript alone, manuscript alone, and combinations thereof. Plain text printed material works reasonably well, but still contains some errors and requires careful checking, especially if you are concerned with the rigorous accuracy of spellings (names) and numbers (dates, etc). Plain text typescript works slightly less well, likewise contains errors, and approaches the 'not worth it' point on documents typed on a machine with an old ribbon or keys that do not strike cleanly. The other combinations were basically not worth the bother — mending them would take as long as re-keying them.

So, I echo Clive's advice to John — try it yourself on a selection of your image files.

Good luck

Mick

Link to comment
Share on other sites

300 to 600 dpi have been referred to. How do you convert megapixels generally though to dpi and viceversa? I have a 6 megapixel camera that was used carefully but almost exclusively hand held. The images are fairly good. Any ideas or suggestions for converting these thousands of images directly from jpegs to docs? Are there any NORTH AMERICAN similiar conversion software besides OMNI page?

Tx all!

John

Damned colonial! :lol:

Open JPEGs in say Photoshop and use image sizing dialogue to up the DPI to 300 (you will probably find the default of the image is around 72 dpi), save as TIFF.

Then apply a small amount of Unsharp-mask (commonly found under the Filters drop-down menu).

Adjustment of contrast may help ease out the worst of small blemishes in the paper (if the media is coloured try removing colour and adjusting contrast) such as creases, spill marks etc.

Not knowing what your camera type is but it may be able to shoot RAW image files (some handily can store a RAW and JPEG image simultaneously). From these TIFFs can be extracted using software that is either provided by the camera manufacturer or supplied as a Photoshop plug-in. This will assist in providing image files with higher resolution and gamma control capability.

As others have indicated, and as my experience with OCR informs, you may well be better off biting the bullet and entering details into a WP or spreadsheet using the keyboard as surface blemishes can impede. Be very careful when transcribing details in obscure handwriting, alow a space for notes on alternative iteration possibilities.

Link to comment
Share on other sites

Open JPEGs in say Photoshop and use image sizing dialogue to up the DPI to 300 (you will probably find the default of the image is around 72 dpi), save as TIFF.

Then apply a small amount of Unsharp-mask (commonly found under the Filters drop-down menu).

Adjustment of contrast may help ease out the worst of small blemishes in the paper (if the media is coloured try removing colour and adjusting contrast) such as creases, spill marks etc.

Not knowing what your camera type is but it may be able to shoot RAW image files (some handily can store a RAW and JPEG image simultaneously). From these TIFFs can be extracted using software that is either provided by the camera manufacturer or supplied as a Photoshop plug-in. This will assist in providing image files with higher resolution and gamma control capability.

As others have indicated, and as my experience with OCR informs, you may well be better off biting the bullet and entering details into a WP or spreadsheet using the keyboard as surface blemishes can impede. Be very careful when transcribing details in obscure handwriting, alow a space for notes on alternative iteration possibilities.

My camera is the Nikon COOLPIX 6 (megapixel 6) digital camera (black body). Will have to doublecheck about your 2nd to last paragraph supra regarding capabilities though. I just want something that works all the time and gives excellent quality.

Thank you for your information.

John

Link to comment
Share on other sites

For what it is worth I copied several hundred pages of documents at the NA this summer using a digital camera. I have tried three ways of turning the typed documents into a Word file:

1. OCR scanning;

2. Dictation using Dragon Naturally Speaking;

3. Typing in manually.

I suspect that, overall, that either 2 or 3 were the best.

Typed documents of this era are so blurry, even when originals rather than copies, that almost whatever I tried with changing contrast, sharpness, etc., the results were still poor and took a lot of manual correction.

Dictation was not bad, better than scanning and I was able to use this on things like war diaries pretty well. It is frustrating though as, however well you train and teach the software, some basic words are always wrong. It is not good at contextual interpretation. On the other hand, teach it to recognise odd names, like French villages, etc., and it is very good.

If you can type reasonably quickly then this is OK too. I am a four finger typist and found I got along quite quickly, certainly better than OCR which seems only to be good on modern crisp printing (but not some newsprint) and books.

Dragon Naturally Speaking Standard is not that expensive and is available in Canada. Amazon.ca have it for CDN$129.99

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...