Safe Haskell | Safe-Inferred |
---|
Bio.Sequence
Contents
Description
This is a meta-module importing and re-exporting sequence-related stuff.
It encompasses the Bio.Sequence.SeqData, Bio.Sequence.Fasta, and Bio.Sequence.TwoBit modules.
- data Sequence t = Seq !SeqData !SeqData !(Maybe QualData)
- data Unknown
- type Offset = Int64
- type SeqData = ByteString
- type Qual = Word8
- type QualData = ByteString
- seqlength :: Sequence a -> Offset
- seqlabel :: Sequence a -> SeqData
- seqheader :: Sequence a -> SeqData
- seqdata :: Sequence a -> SeqData
- seqqual :: Sequence a -> QualData
- (!) :: Sequence a -> Offset -> Char
- appendHeader :: Sequence a -> String -> Sequence a
- setHeader :: Sequence a -> String -> Sequence a
- fromStr :: String -> SeqData
- toStr :: SeqData -> String
- compl :: Char -> Char
- revcompl :: Sequence Nuc -> Sequence Nuc
- revcompl' :: SeqData -> SeqData
- data Nuc
- castToNuc :: Sequence a -> Sequence Nuc
- data Amino
- translate :: Sequence Nuc -> Offset -> [Amino]
- fromIUPAC :: SeqData -> [Amino]
- toIUPAC :: [Amino] -> SeqData
- castToAmino :: Sequence a -> Sequence Amino
- defragSeq :: Sequence t -> Sequence t
- seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence t
- readNuc :: FilePath -> IO [Sequence Nuc]
- readProt :: FilePath -> IO [Sequence Amino]
- readFasta :: FilePath -> IO [Sequence Unknown]
- hReadFasta :: Handle -> IO [Sequence Unknown]
- writeFasta :: FilePath -> [Sequence a] -> IO ()
- hWriteFasta :: Handle -> [Sequence a] -> IO ()
- readQual :: FilePath -> IO [Sequence Unknown]
- writeQual :: FilePath -> [Sequence a] -> IO ()
- hWriteQual :: Handle -> [Sequence a] -> IO ()
- readFastaQual :: FilePath -> FilePath -> IO [Sequence Unknown]
- writeFastaQual :: FilePath -> FilePath -> [Sequence a] -> IO ()
- hWriteFastaQual :: Handle -> Handle -> [Sequence a] -> IO ()
- readFastQ :: FilePath -> IO [Sequence Nuc]
- writeFastQ :: FilePath -> [Sequence Nuc] -> IO ()
- hReadFastQ :: Handle -> IO [Sequence Nuc]
- hWriteFastQ :: Handle -> [Sequence Nuc] -> IO ()
- readSangerQ :: FilePath -> IO [Sequence Nuc]
- writeSangerQ :: FilePath -> [Sequence Nuc] -> IO ()
- hReadSangerQ :: Handle -> IO [Sequence Nuc]
- hWriteSangerQ :: Handle -> [Sequence Nuc] -> IO ()
- readIllumina :: FilePath -> IO [Sequence Nuc]
- writeIllumina :: FilePath -> [Sequence Nuc] -> IO ()
- hReadIllumina :: Handle -> IO [Sequence Nuc]
- hWriteIllumina :: Handle -> [Sequence Nuc] -> IO ()
- readPhd :: FilePath -> IO (Sequence Nuc)
- hReadPhd :: Handle -> IO (Sequence Nuc)
- decode2Bit :: ByteString -> [Sequence Nuc]
- read2Bit :: FilePath -> IO [Sequence Nuc]
- hRead2Bit :: Handle -> IO [Sequence Nuc]
- data HashF k = HF {}
- contigous :: (Integral k, Show k, Eq k) => Int -> HashF k
- rcontig :: (Integral k, Show k, Eq k) => Int -> HashF k
- rcpacked :: (Integral k, Show k, Eq k) => Int -> HashF k
- class KWords s where
- entropy :: (Ord str, KWords str) => Int -> str -> Double
Data structures etc (Bio.Sequence.SeqData)
data Sequence t
A sequence consists of a header, the sequence data itself, and optional quality data. The type parameter is a phantom type to separate nucleotide and amino acid sequences
data Unknown
type SeqData = ByteString
The basic data type used in Sequence
s
Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.
type QualData = ByteString
Quality data is a Qual
vector, currently implemented as a ByteString
.
Accessor functions
seqqual :: Sequence a -> QualData
Return the quality data, or error if none exist. Use hasqual if in doubt.
appendHeader :: Sequence a -> String -> Sequence a
Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).
setHeader :: Sequence a -> String -> Sequence a
Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).
Converting to and from String.
Nucleotide functionality.
Complement a single character. I.e. identify the nucleotide it
can hybridize with. Note that for multiple nucleotides, you usually
want the reverse complement (see revcompl
for that).
revcompl :: Sequence Nuc -> Sequence Nuc
Calculate the reverse complement. This is only relevant for the nucleotide alphabet, and it leaves other characters unmodified.
Protein sequence functionality
data Amino
translate :: Sequence Nuc -> Offset -> [Amino]
Translate a nucleotide sequence into the corresponding protein sequence. This works rather blindly, with no attempt to identify ORFs or otherwise QA the result.
castToAmino :: Sequence a -> Sequence Amino
Other utility functions
defragSeq :: Sequence t -> Sequence t
Returns a sequence with all internal storage freshly copied and with sequence and quality data present as a single chunk.
By freshly copying internal storage, defragSeq
allows garbage
collection of the original data source whence the sequence was
read; otherwise, use of just a short sequence name can cause an
entire sequence file buffer to be retained.
By compacting sequence data into a single chunk, defragSeq
avoids
linear-time traversal of sequence chunks during random access into
sequence data.
seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence t
map over sequences, treating them as a sequence of (char,word8) pairs. This will work on sequences without quality, as long as the function doesn't try to examine it. The current implementation is not very efficient.
File IO
Generic sequence reading
readNuc :: FilePath -> IO [Sequence Nuc]
Read nucleotide sequences in any format - Fasta, SFF, FastQ, 2bit, PHD... Todo: detect Illumina vs Sanger FastQ, transparent compression
readProt :: FilePath -> IO [Sequence Amino]
Read protein sequences in any supported format (i.e. Fasta)
The Fasta file format (Bio.Sequence.Fasta)
hReadFasta :: Handle -> IO [Sequence Unknown]
Lazily read sequence from handle
writeFasta :: FilePath -> [Sequence a] -> IO ()
Write sequences to a FASTA-formatted file. Line length is 60.
hWriteFasta :: Handle -> [Sequence a] -> IO ()
Write sequences in FASTA format to a handle.
Quality data
Not part of the Fasta format, and treated separately.
hWriteQual :: Handle -> [Sequence a] -> IO ()
readFastaQual :: FilePath -> FilePath -> IO [Sequence Unknown]
Read sequence and associated quality. Will error if the sequences and qualites do not match one-to-one in sequence.
writeFastaQual :: FilePath -> FilePath -> [Sequence a] -> IO ()
Write sequence and quality data simulatnously This may be more laziness-friendly.
hWriteFastaQual :: Handle -> Handle -> [Sequence a] -> IO ()
The FastQ format (Bio.Sequence.FastQ)
readFastQ :: FilePath -> IO [Sequence Nuc]
Deprecated: FastQ assumes Sanger-style quality info use {read,write}SangerQ or -Illumina instead
writeFastQ :: FilePath -> [Sequence Nuc] -> IO ()
Deprecated: FastQ assumes Sanger-style quality info use {read,write}SangerQ or -Illumina instead
hReadFastQ :: Handle -> IO [Sequence Nuc]
Deprecated: FastQ assumes Sanger-style quality info use {read,write}SangerQ or -Illumina instead
hWriteFastQ :: Handle -> [Sequence Nuc] -> IO ()
Deprecated: FastQ assumes Sanger-style quality info use {read,write}SangerQ or -Illumina instead
readSangerQ :: FilePath -> IO [Sequence Nuc]
writeSangerQ :: FilePath -> [Sequence Nuc] -> IO ()
hReadSangerQ :: Handle -> IO [Sequence Nuc]
hWriteSangerQ :: Handle -> [Sequence Nuc] -> IO ()
readIllumina :: FilePath -> IO [Sequence Nuc]
writeIllumina :: FilePath -> [Sequence Nuc] -> IO ()
hReadIllumina :: Handle -> IO [Sequence Nuc]
hWriteIllumina :: Handle -> [Sequence Nuc] -> IO ()
The phd file format (Bio.Sequence.Phd)
These contain base (nucleotide) calling information,
and are generated by phred
.
TwoBit file format support (Bio.Seqeunce.TwoBit)
Used by BLAT
and related tools.
decode2Bit :: ByteString -> [Sequence Nuc]
Parse a (lazy) ByteString as sequences in the 2bit format.
read2Bit :: FilePath -> IO [Sequence Nuc]
Read sequences from a file in 2bit format and | unmarshall/deserialize into Sequence format.
hRead2Bit :: Handle -> IO [Sequence Nuc]
Read sequences from a file handle in the 2bit format and | unmarshall/deserialze into Sequence format.
Hashing functionality (Bio.Sequence.HashWord)
Packing words from sequences into integral data types
data HashF k
This is a struct for containing a set of hashing functions
contigous :: (Integral k, Show k, Eq k) => Int -> HashF k
Contigous constructs an int/eger from a contigous k-word.
rcontig :: (Integral k, Show k, Eq k) => Int -> HashF k
Like contigous
, but returns the same hash for a word and its reverse complement.
rcpacked :: (Integral k, Show k, Eq k) => Int -> HashF k
Like rcontig
, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide
are treated the same a single nucleotide.