Unique File Format¶
The unique file format is a naive binary file format that stores at each
position the minimum length k, a k-mer sequence is unique from the sequence ID
(e.g. chr1) matched against a generated index (see index). The
range of values for k depends on the parameters of search.
Values of 0 are positions in the sequence where no unique length was found.
Unique files can be used to generate mappability datasets for a given k-mer
size since a k-mer that is unique to that sequence will also be assumed to be
unique for a “k+1”-mer. The conversion to mappability datasets for a given
k-mer length is done with the track tool.
Data types¶
The suffix of the filename specifies the type of the underlying binary data.
The suffix uint8 specifies that each minimum length is represented with a
single unsigned 8 bit integer (1 byte each), and uint16 likewise has each
length represented by unsigned 16 bit integer (2 bytes each). No other data is
stored in the file. The data type is chosen based on the maximum length
specified in range specified to search. For example, in a search
range from 20 to 255, the maximum unique minimum length is less than or equal
to 255 (which is the maximum value that can be represented with an unsigned
byte), therefore the uint8 format will be used.
Usage¶
Since the file format is simple, it can be easily processed with any language.
Below is an example of how to read a unique.uint8 file in Python using
numpy.
import numpy as np
unique_lengths = np.fromfile('chr1.unique.uint8', dtype=np.uint8)
# Print fraction of unique lengths found
print(np.count_nonzero(unique_lengths) / len(unique_lengths))