Skip to content
Advertisement

How to decode a specific byte string of varints in PHP

I’m trying to decode a string in a specific format (Hearthstone deck code), using PHP, like this:

AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA

or

AAEBAf0GBAD6DoyAA6CAAw37AZwCigbJB/gHlA+CEIUQrRDy0AL2/QKJgAPRgAMA

The specification (original description) is:

The datastring is a base64-encoded byte string.

Unless specified otherwise, every value that follows is an integer, encoded as an unsigned varint.

  1. Header block

    • Reserved byte 0x00
    • Version (1)
    • Format
  2. Data block
    The data block is split in four pairs of length + array in the following order:

    • Heroes
    • Single-copy cards
    • 2-copy cards
    • n-copy cards

Each pair has a leading varint specifying the number of items in the array. For the first three blocks, those are arrays of varints. For the last block, it is an array of pairs of varints. The goal of this structure is to make the datastring as compact as possible.

I have begun to put something together, but I’m a novice when it comes to handling raw bytes. My code is:

    // I found this to decode Variable-length quantity (varint)
    function vlq_decode(array $bytes) {
        $result = [];
        $integer = 0;
        foreach ($bytes as $byte) {
            if ($integer > 0xfffffff - 0x7f) {
                throw new OverflowException('The value exceeds the maximum allowed.');
            }
            $integer <<= 7;
            $integer |= 0x7f & $byte;

            if (($byte & 0x80) === 0) {
                $result[] = $integer;
                $integer = 0;
            }
        }
        if (($byte & 0x80) !== 0) {
            throw new InvalidArgumentException('Incomplete byte sequence.');
        }
        return $result;
    }

    $datastring = 'AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA';

    $raw = base64_decode($datastring);

    $byte_array = unpack('C*', $raw);

    $result = vlq_decode($byte_array);

    print_r($result);

The only thing I’m certain about is the base64_decode. I can’t tell, if the unpack parameters are the correct ones, or whether the vlq_decode function works as intended, because I didn’t write it myself.

On the original site there are reference implementations in Python and Javascript, but they are over my head and I couldn’t use the code to make mine work.

Update:

The code does indeed produce an array that looks similar to what I’d expect, but many of the values don’t seem correct. I’m thinking the conversion from varint is still somewhat off.

// this is the $result I get (wrong)
Array (
    [0] => 0 // this is always 0
    [1] => 1 // Version
    [2] => 1 // Format
    [3] => 1 // What follows is an array of length 1 (data block Heroes)
    [4] => 1267842
    [5] => 8 // What follows is an array of length 8 (data block single-copy cards)
    [6] => 8193
    [7] => 13956
    [8] => 13957
    [9] => 15245
    [10] => 11025
    [11] => 120322
    [12] => 1867138
    [13] => 524291
    [14] => 11 // What follows is an array of length 11 (data block 2-copy cards)
    [15] => 9347
    [16] => 5508
    [17] => 9604
    [18] => 15756
    [19] => 656
    [20] => 1173890
    [21] => 373762
    [22] => 867842
    [23] => 1262723
    [24] => 1426563
    [25] => 511363
    [26] => 0  // What follows is an array of length 0 (data block n-copy cards)
)

The Python implementation (Gist) produces different numbers, in a slightly different format, which match nicely to the database containing the data for the IDs (in the dbfId field)

// this is the expected (correct) $result
Array (
    [0] => 0
    [1] => 1
    [2] => 1
    [3] => 1
    [4] => 39117
    [5] => 8
    [6] => 192 
    [7] => 621 
    [8] => 749 
    [9] => 1783 
    [10] => 2262 
    [11] => 38407 
    [12] => 48625 
    [13] => 49184 
    [14] => 11
    [15] => 457 
    [16] => 555 
    [17] => 587 
    [18] => 1659 
    [19] => 2053 
    [20] => 43463 
    [21] => 46102 
    [22] => 48692 
    [23] => 50381 
    [24] => 50391 
    [25] => 52639
    [26] => 0
)

Any help is appreciated!

There already is a question about this topic, but it was badly written with no code samples, so I’m giving this another go.

Advertisement

Answer

It’s an endian problem, aka you need to push the bits from each varint byte in reverse order. The clue is that the values below 128 make it through unmolested.

The below is an illustrative hack and should not be used in actual code:

str_split(decbin(1267842),7)

Yields:

array(3) {
  [0]=>
  string(7) "1001101"
  [1]=>
  string(7) "0110001"
  [2]=>
  string(7) "0000010"
}

Super convenient that it’s already a multiple of 7 bits, but probably also a symptom of the endian problem.

Reverse, implode, convert back:

bindec(implode('', array_reverse(str_split(decbin(1267842),7))))

Yields:

int(39117)

I re-jiggered that function to be able to address this:

function vlq_decode(array $bytes, $swap_endian=false) {
    $result = [];
    $segments = [];
    foreach ($bytes as $byte) {
        if( $swap_endian ) {
            array_unshift($segments, 0x7f & $byte);
        } else {
            $segments[] = ( 0x7f & $byte );
        }

        if (($byte & 0x80) === 0) {
            $integer = 0;
            foreach($segments as $segment) {
                $integer <<= 7;
                $integer |= ( 0x7f & $segment );
            }
            $result[] = $integer;
            $segments = [];
        }
    }
    if (($byte & 0x80) !== 0) {
        throw new InvalidArgumentException('Incomplete byte sequence.');
    }
    return $result;
}

and then vlq_decode($byte_array, true); will get you what you want.

I pared out that bunk overflow code as it will never actually detect an actual one, and also hobbles you to 32-bit ints. If you do want to detect an overflow during decoding you need to count the bits you’re unpacking, and that’s just a pain in the ass 😛

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement