Home » Server Options » Text & interMedia » PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 (Database 11.2.0.1.0 / 12.1.0.1.0)
icon5.gif  PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650141] Sun, 17 April 2016 04:25 Go to next message
bwelter
Messages: 4
Registered: January 2012
Location: Netherlands
Junior Member
Converting the same PDF doc gives different result between Oracle 11.2 and 12.1.
Using plaintext => false to get HTML output

Code:
declare
l_blob blob; -- holding PDF
l_clob clob; -- result of conversion
begin
--loading blob with pdf:
...
-- set policy:
ctx_ddl.create_policy('test_policy','ctxsys.auto_filter');
......
-- convert PDF:
ctx_doc.policy_filter( policy_name => 'test_policy' , document => l_blob , restab => l_clob , plaintext => false);
l_clob := replace(trim(g_clob), chr(13), chr(10));
l_clob := replace(g_clob, chr(10), chr(32) || '<<EOL>>' || chr(10)||'<<BOL>>');
....
end;

In the Oracle 12 database I get in l_clob:
<<BOL>><div class="c" style="top:592px;left:218px;font-size:9px;font-family:Arial, sans-serif;" <<EOL>>
<<BOL>>>TRANSFORMER SINGLE PHASE, PR AC440V SEC AC220/5,</div> <<EOL>>
<<BOL>><div class="c" style="top:592px;left:38px;font-size:9px;font-family:Arial, sans-serif;" <<EOL>>


In the Oracle 11 database I get with the same PDF the following result in l_clob:
<<BOL>> <<EOL>>
<<BOL>><p><font size="1" face="Arial">TRANSFORMER SINGLE PHASE, PR AC440V SEC AC220/5,</font></p> <<EOL>>
<<BOL>> <<EOL>>

I explicitly need this part of the converted PDF content:
..top:592px;left:218px..

Maybe it has something to do with settings?
What is the solution?

NB: I am aware of the fact that not all PDF documents contain nicely formatted texts and x-y positions. For my purpose now this is however a good solution.

[Updated on: Sun, 17 April 2016 04:27]

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650154 is a reply to message #650141] Sun, 17 April 2016 09:49 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
As far as I know, there is no setting that affects that. I have heard that Oracle uses a third party auto_filter and there have been changes between versions. So, it is probably just a difference between versions. To verify this, you might try posting your question on the OTN Oracle Text forum. Oracle Text product manager Roger Ford usually responds there.

https://community.oracle.com/community/database/text/content

[Updated on: Sun, 17 April 2016 09:50]

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650155 is a reply to message #650154] Sun, 17 April 2016 10:48 Go to previous messageGo to next message
bwelter
Messages: 4
Registered: January 2012
Location: Netherlands
Junior Member
I posted the question there. Thanks!
Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650171 is a reply to message #650155] Mon, 18 April 2016 07:20 Go to previous messageGo to next message
bwelter
Messages: 4
Registered: January 2012
Location: Netherlands
Junior Member
the answer I got from Roger:
AUTO_FILTER is designed to create indexable text from formatted files. It makes no claims to produce any specific layout in the output files.

I don't think there are any settings which will enable 12c to produce the same output as 11g, However it might be possible to take the ctxhx executable from an 11g installation and put it into the 12c environment. I'm not sure if there are library files that might need to be transferred as well.
Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650181 is a reply to message #650171] Mon, 18 April 2016 13:37 Go to previous message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
Thanks for letting us know. So, do you plan to try using the 11g ctxhx in 12c? If so, please let us know if that works or not.
Previous Topic: contains query not returning expected results
Next Topic: Fulltext search
Goto Forum:
  


Current Time: Tue Mar 19 04:54:38 CDT 2024