Python fastest way to read binary file

Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.

To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)

Here are the snippets of code that were compared:

@TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        size = fmt1.size
        while True:
            buffer = inp.read(size)
            if len(buffer) != size:  # EOF?
                break
            structures.append(fmt1.unpack(buffer))
    return structures

@TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
    offset, unpack, size = 0, fmt1.unpack, fmt1.size
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        buffer = inp.read()  # read entire file
        while True:
            chunk = buffer[offset: offset+size]
            if len(chunk) != size:  # EOF?
                break
            structures.append(unpack(chunk))
            offset += size

    return structures

@TESTCASE('Convert to array (@randomir part 1)')
def convert_to_array():
    data = array.array('d')
    record_size_in_bytes = 9*4 + 16*8   # 9 ints + 16 doubles (standard sizes)

    with open(test_filenames[0], 'rb') as fin:
        for record in iter(partial(fin.read, record_size_in_bytes), b''):
            values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
            data.extend(values)

    return data

@TESTCASE('Read array file (@randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
    data = array.array('d')
    with open(test_filenames[1], 'rb') as fin:
        n = os.fstat(fin.fileno()).st_size // 8
        data.fromfile(fin, n)
    return data

def create_preprocessed_file():
    """ Save array created by convert_to_array() into a separate test file. """
    test_filename = test_filenames[1]
    if not os.path.isfile(test_filename):  # doesn't already exist?
        data = convert_to_array()
        with open(test_filename, 'wb') as file:
            data.tofile(file)

And here were the results running them on my system:

Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.06430 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative  6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative  6.73x ( 573.09% slower)
    Convert to array (@randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)

Interestingly, most of the snippets are actually faster in Python 2...

Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.03586 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative  7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
    Convert to array (@randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)

Binary files are files that are not normal text files. Example: An Image File. These files are also stored as a sequence of bytes in the computer hard disk. These types of binary files cannot be opened in the normal mode and read as text.

You can read binary file by opening the file in binary mode using the open('filename', 'rb').

When working with the problems like image classification in Machine learning, you may need to open the file in binary mode and read the bytes to create ML models. In this situation, you can open the file in binary mode, and read the file as bytes. In this case, decoding of bytes to the relevant characters will not be attempted. On the other hand, when you open a normal file in the normal read mode, the bytes will be decoded to string or the other relevant characters based on the file encoding.

If You’re in Hurry…

You can open the file using open() method by passing b parameter to open it in binary mode and read the file bytes.

open('filename', "rb") opens the binary file in read mode.

r– To specify to open the file in reading mode
b – To specify it’s a binary file. No decoding of bytes to string attempt will be made.

Example

The below example reads the file one byte at a time and prints the byte.

try:
    with open("c:\temp\Binary_File.jpg", "rb") as f:
        byte = f.read(1)
        while byte:
            # Do stuff with byte.
            byte = f.read(1)
            print(byte)
except IOError:
     print('Error While Opening the file!')  

If You Want to Understand Details, Read on…

In this tutorial, you’ll learn how to read binary files in different ways.

  • Read binary file byte by byte
  • Python Read Binary File into Byte Array
  • Python read binary file into numpy array
  • Read binary file Line by Line
  • Read Binary File Fully in One Shot
  • Python Read Binary File and Convert to Ascii
  • Read binary file into dataframe
  • Read binary file skip header
  • Readind Binary file using Pickle
  • Conclusion

Read binary file byte by byte

In this section, you’ll learn how to read a binary file byte by byte and print it. This is one of the fastest ways to read the binary file.

The file is opened using the open() method and the mode is mentioned as “rb” which means opening the file in reading mode and denoting it’s a binary file. In this case, decoding of the bytes to string will not be made. It’ll just be read as bytes.

The below example shows how the file is read byte by byte using the file.read(1) method.

The parameter value 1 ensures one byte is read during each read() method call.

Example

try:
    with open("c:\temp\Binary_File.jpg", "rb") as f:
        byte = f.read(1)
        while byte:
            # Do stuff with byte.
            byte = f.read(1)
            print(byte)
except IOError:
     print('Error While Opening the file!')  

Output

    b'\xd8'
    b'\xff'
    b'\xe0'
    b'\x00'
    b'\x10'
    b'J'
    b'F'
    b'I'
    b'F'
    b'\x00'
    b'\x01'
    b'\x01'
    b'\x00'
    b'\x00'
    b'\x01'
    b'\x00'
    b'\x01'
    b'\x00'
    b'\x00'
    b'\xff'
    b'\xed'
    b'\x00'
    b'|'
    b'P'
    b'h'
    b'o'
    b't'
    b'o'
    b's'
    b'h'
    b'o'
    b'p'
    b' '
    b'3'
    b'.'
    b'0'
    b'\xc6'
    b'\xb3'
    b'\xff'
    b'\xd9'
    b''

Python Read Binary File into Byte Array

In this section, you’ll learn how to read the binary files into a byte array.

First, the file is opened in therb mode.

A byte array called mybytearray is initialized using the bytearray() method.

Then the file is read one byte at a time using f.read(1) and appended to the byte array using += operator. Each byte is appended to the bytearray.

At last, you can print the bytearray to display the bytes that are read.

Example

try:
    with open("c:\temp\Binary_File.jpg", "rb") as f:

        mybytearray = bytearray()

        # Do stuff with byte.
        mybytearray+=f.read(1)
        mybytearray+=f.read(1)
        mybytearray+=f.read(1)
        mybytearray+=f.read(1)
        mybytearray+=f.read(1)

        print(mybytearray)

except IOError:
    print('Error While Opening the file!')    

Output

    bytearray(b'\xff\xd8\xff\xe0\x00\x10')

Python read binary file into numpy array

In this section, you’ll learn how to read the binary file into a NumPy array.

First, import numpy as np to import the numpy library.

Then specify the datatype as bytes for the np object using np.dtype('B')

Next, open the binary file in reading mode.

Now, create the NumPy array using the fromfile() method using the np object.

Parameters are the file object and the datatype initialized as bytes. This will create a NumPy array of bytes.

numpy_data = np.fromfile(f,dtype)

Example

import numpy as np

dtype = np.dtype('B')
try:
    with open("c:\temp\Binary_File.jpg", "rb") as f:
        numpy_data = np.fromfile(f,dtype)
    print(numpy_data)
except IOError:
    print('Error While Opening the file!')    

Output

[255 216 255 ... 179 255 217]


The bytes are read into the numpy array and the bytes are printed.

Read binary file Line by Line

In this section, you’ll learn how to read binary file line by line.

You can read the file line by line using the readlines() method available in the file object.

Each line will be stored as an item in the list. This list can be iterated to access each line of the file.

rstrip() method is used to remove the spaces in the beginning and end of the lines while printing the lines.

Example

f = open("c:\temp\Binary_File.jpg",'rb')

lines = f.readlines()

for line in lines:

    print(line.rstrip())

Output

    b'\x07\x07\x07\x07'
    b''
    b''
    b''
    b''
    b''
    b'\x0c\x0f\x0c\x0c\x0c\x0c\x0c\x0c\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x12\x12\x12\x12\x12\x12\x15\x15\x15\x15\x15\x17\x17\x17\x17\x17\x17\x17\x17\x17\x17\xff\xdb\x00C\x01\x04\x04\x04\x06\x06\x06'
    b'\x06\x06'

Read Binary File Fully in One Shot

In this section, you’ll learn how to read binary file in one shot.

You can do this by passing -1 to the file.read() method. This will read the binary file fully in one shot as shown below.

Example

try:
    f = open("c:\temp\Binary_File.jpg", 'rb')
    while True:
        binarycontent = f.read(-1)  
        if not binarycontent:
            break
        print(binarycontent)
except IOError:
    print('Error While Opening the file!')

Output

 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xed\x00|Photoshop 3.0\x008BIM\x04\x04\x00\x00\x00\x00\x00\x1c\x02(\x00ZFBMD2300096c010000fe0e000032160000051b00003d2b000055300000d6360000bb3c0000ce4100008b490000\x00\xff\xdb\x00C\x00\x03\x03\x03\x03\x03\x03\x05\x03\x03\x05\x07\x05\x05\x05\x07\n\x07\x07\x07\x07\n\x0c\n\n\n\n\n\x0c\x0f\x0c\x0c\x0c\x0c\x0c\x0c\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x12\x12\x12\x12\x12\x12\x15\x15\x15\x15\x15\x17\x17\x17\x17\x17\x17\x17\x17\x17\x17\xff\xdb\x00C\x01\x04\x04\x04\x06\x06\x06\n\x06\x06\n\x18\x11\x0e\x11\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18\x18

Python Read Binary File and Convert to Ascii

In this section, you’ll learn how to read a binary file and convert to ASCII using the binascii library. This will convert all the bytes into ASCII characters.

Read the file as binary as explained in the previous section.

Next, use the method binascii.b2a_uu(bytes). This will convert the bytes into ascii and return an ascii value.

Then you can print this to check the ascii characters.

Example

import binascii

try:
    with open("c:\temp\Binary_File.jpg", "rb") as f:

        mybytes = f.read(45)

        data_bytes2ascii = binascii.b2a_uu(mybytes)

        print("Binary String to Ascii")

        print(data_bytes2ascii)

except IOError:

    print("Error While opening the file!")

Output

 Binary String to Ascii
 b'M_]C_X  02D9)[email protected] ! 0   0 !  #_[0!\\4&AO=&]S:&]P(#,N,  X0DE-! 0 \n'

Read binary file into dataframe

In this section, you’ll learn how to read the binary file into pandas dataframe.

First, you need to read the binary file into a numpy array. Because there is no method available to read the binary file to dataframe directly.

Once you have the numpy array, then you can create a dataframe with the numpy array.

Pass the NumPy array data into the pd.DataFrame(). Then you’ll have the dataframe with the bytes read from the binary file.

Example

import numpy as np

import pandas as pd

# Create a dtype with the binary data format and the desired column names
try:

    dt = np.dtype('B')

    data = np.fromfile("c:\temp\Binary_File.jpg", dtype=dt)

    df = pd.DataFrame(data)

    print(df)

except IOError:

    print("Error while opening the file!")

Output

             0
    0      255
    1      216
    2      255
    3      224
    4        0
    ...    ...
    18822    0
    18823  198
    18824  179
    18825  255
    18826  217

    [18827 rows x 1 columns]

This is how you can read a binary file using NumPy and use that NumPy array to create the pandas dataframe.

With the NumPy array, you can also read the bytes into the dictionary.

Read binary file skip header

In this section, you’ll learn how to read binary file, skipping the header line in the binary file. Some binary files will be having the ASCII header in them.

This skip header method can be useful when reading the binary files with the ASCII headers.

You can use the readlines() method available in the File object and specify [1:] as an additional parameter. This means the line from index 1 will be read.

The ASCII header line 0 will be ignored.

Example

f = open("c:\temp\Binary_File.jpg",'rb')

lines = f.readlines()[1:]
for line in lines:
    print(line.rstrip())

Output

    b'\x07\x07\x07\x07'
    b''
    b''
    b''
    b''
    b''
    b'\x0c\x0f\x0c\x0c\x0c\x0c\x0c\x0c\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x0f\x12\x12\x12\x12\x12\x12\x15\x15\x15\x15\x15\x17\x17\x17\x17\x17\x17\x17\x17\x17\x17\xff\xdb\x00C\x01\x04\x04\x04\x06\x06\x06'
    b'\x06\x06'

    b"\x93\x80\x18\x98\xc9\xdc\x8bm\x90&'\xc5U\xb18\x81\xc7y\xf0\x80\x00\x14\x1c\xceQd\x83\x13\xa0\xbf-D9\xe0\xae;\x8f\\LK\xb8\xc3\x8ae\xd4\xd1C\x10\x7f\x02\x02\xa6\x822K&D\x9a\x04\xd4\xc8\xfbC\x87\xf2\x8d\xdcN\xdes)rq\xbbI\x92\xb6\xeeu8\x1d\xfdG\xabv\xe8q\xa5\xb6\xb56\xe0\xa1\x06\x84n#\xf0\x1c\x86\xb0\x83\xee\x99\xe7\xc6\xaaN\xafY\xdf\xd9\xcfe\xd5\x84"

    b'\xd9\x0b\xc2\x1b0\xa1Q\x17\x88\xb4et\x81u8\xed\xf5\xe8\xd9#c\t\xf9\xc0\xa7\x06\xa2/={\x87l\x01K\x870\xe3\xa1\x024\xdc^\x11\x96\x96\xba\[email protected]\x91A\xd6U\xea\xe1\xbb\xb733'

Readind Binary file using Pickle

In this section, you’ll learn how to read binary files in python using the Pickle.

This is really tricky as all the types of binary files cannot be read in this mode. You may face problems while pickling a binary file. As invalid load key errors may occur.

Hence it’s not recommended to use this method.

Example

import pickle


file_to_read = open("c:\temp\Binary_File.jpg", "rb")

loaded_dictionary = pickle.load(file_to_read)

print(loaded_dictionary)

Output

    ---------------------------------------------------------------------------

    UnpicklingError                           Traceback (most recent call last)

    <ipython-input-23-dea0d83e3f49> in <module>
          7 file_to_read = open("E:\Vikram_Blogging\Stack_Vidhya\Python_Notebooks\Read_Binary_File_Python\Binary_File.jpg", "rb")
          8 
    ----> 9 loaded_dictionary = pickle.load(file_to_read)
         10 
         11 print(loaded_dictionary)


    UnpicklingError: invalid load key, '\xff'.

Conclusion

Reading a binary file is an important functionality. For example, reading the bytes of an image file is very useful when you are working with image classification problems. In this case, you can read the image file as binary and read the bytes to create the model.

In this tutorial, you’ve learned the different methods available to read binary files in python and the different libraries available in it.

If you have any questions, feel free to comment below.

How do I read a binary file in Python?

To open a file in binary format, add 'b' to the mode parameter. Hence the "rb" mode opens the file in binary format for reading, while the "wb" mode opens the file in binary format for writing. Unlike text files, binary files are not human-readable. When opened using any text editor, the data is unrecognizable.

How do I read a binary file?

To read from a binary file.
Use the ReadAllBytes method, which returns the contents of a file as a byte array. This example reads from the file C:/Documents and Settings/selfportrait. ... .
For large binary files, you can use the Read method of the FileStream object to read from the file only a specified amount at a time..

How do I read a binary image in Python?

fromfile("c:\temp\Binary_File. jpg", dtype=dt) df = pd. DataFrame(data) print(df) except IOError: print("Error while opening the file!") This is how you can read a binary file using NumPy and use that NumPy array to create the pandas dataframe.

How do you read a binary file in Python chunks?

Read a Binary File in Chunks.
Line [1] assigns the size of the chunk to the variable chunk_size ..
Line [2] assigns the variable image_file to the file to be read in..
Line [3] opens the image_file ..
Line [4] instantiates the while loop and executes while True . Line [5] reads in a chunk of the open file for each loop..