Skip to content
Advertisement

Converting hex string to binary file makes it corrupt and unable to open

When converting the hexadecimal value, a PDF file, the file is corrupted.
This is the partial hex content of a simple pdf file I want to convert:

0x255044462D312E370D0A25B5B5B5B50D0A312030206F626A0D0A3C3C2F547970652F436174

Full string: jsfiddle, pastebin

This question is a continuation of this question, where I said that I have to do a data migration between two programs that handle files differently. The source program stores the files hex encoded in the database.

I could successfully extract and convert text files to binary files with the following code:

file_put_contents(
    'document.pdf', 
    hex2bin(str_replace('0x', '', $hexPdfString))
);

But when I run this function on a pdf file or other binary file, it is corrupted.
My question is pretty much the same as this one but discussion over there was unfortunately discontinued.

Advertisement

Answer

The result of hex decoding your string is corrupted because your string is incomplete, it only contains the first 65535 characters. After hex decoding one can see that the PDF is cut off inside a metadata stream:

20 0 obj
<</Type/Metadata/Subtype/XML/Length 3064>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""  xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Microsoft® Word 2019</pdf:Producer></rdf:Description>
<rdf:Description rdf:about=""  xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:creator><rdf:Seq><rdf:li>Samuel Gfeller</rdf:li></rdf:Seq></dc:creator></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreatorTool>Microsoft® Word 2019</xmp:CreatorTool><xmp:CreateDate>2021-06-17T13:00:19+02:00</xmp:CreateDate><xmp:ModifyDate>2021-06-17T13:00:19+02:00</xmp:ModifyDate></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:DocumentID><xmpMM:InstanceID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:InstanceID></rdf:Description>
                                                                                                    
                                                                                                    
                                                                          

The length 65535 of course is special, it’s 0xFFFF. Apparently some mechanism you used in retrieving that string could not handle strings longer than 65535 characters. Thus, you have to investigate the source of that string.

Considering the question you consider this question a continuation of, I’d assume that either the field in the MS SQL database you retrieve the data from is limited to 65535 bytes or your database value retrieval code cuts it down.

In the former case there’d be nothing you can do, the database contents simply would be incomplete. In the latter case you’d simply have to enable your database access code to handle long strings.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement