Friday, April 2, 2010

Space between two words is not...

Hi, in my acrobat plugin, am using the following line to judge if there is space between two words as based on this I have to join two words.

attrWord = PDWordGetAttr(pdWord);

bHasAdjacentSpace = (attrWord %26amp; WXE_AdjacentSpace);

But, for two words in a particular PDF, it is not giving the correct result. Though visually I see there is space?between these words, but the above method signifies there is NO space between the two.

Could you please let me know whats going wrong here?

Or if is there any other way to rightly judge the spacing between the two words and if they need to be joined or not?

Space between two words is not...

It may be an issue of HOW MUCH space is between the words...

In an untagged/unstructured PDF, Acrobat can only guess at what things are ''words'' and how they go together. So the amount of ''white space'' between two sets of text drawing instructions is used as a guide. Too much, and they probably aren't connected (eg. two separate columns).

Space between two words is not...

Thanks for your reply Leonard!

Ok, so what is the full-proof way of judging the space between the words?

Is there any other way that I can take to achieve this?

Please note that I am taking this approach of judging the space as based on this, I am joining two words.

You could also suggest me another way based on which to join words.

You can get the bounding box (Quads) for each word that you get back and them make your own decision about how much white space is acceptable for joining...

Hey thanks Leonard. I was thinking about the same approach to go for..i.e approximately judging the space between the words based on their quads.

It's just I am really not sure if this would work in all?possible cases. Is there any other approach or SDK method which would tell us if the words need to be joined?

For untagged documents, it's all a guessing game.

When the document is tagged, then you'll get the correct information.

Ah! It is actually related to tagging. I added tags to my PDF from Advance %26gt; Accessibility %26gt; Add Tags menu and it properly considered the spacing between the words.

Could you please explain me what is a tagged document and the exact difference between this and untagged?

Also, if this is the case, for any PDF, can I take an approach of finding out if it is tagged and in case not programmatically add tags to it? Is this possible?

If so, what are the api's to find out if a PDF is tagged and to add tags to it? Also, this corresponds at the document level and not the page correct?

Thank you Leonard.

Also, what is it that this ''Advance %26gt; Accessibility %26gt; Add Tags'' does internally to judge the space between two words? Am sure it is not consiering the word attribut and wxe_adjacent space to find this. Any ideas?

Correct.

The process of tagging ''recodes'' the PDF content stream into something that it believes is the ''most optimal'' and correct to enable proper content extraction.

For details on tagged - look at the PDF Reference/ISO 32000. There are two entire chapters on the subject.

You can even programmatically do the ''Add Tags'' feature from your own plugin. I believe it's an AVCommand.

Each page is tagged individually, but there is also document level ''common stuff'' as well.

No comments:

Post a Comment