User’s Guide¶
- dataset-creator needs a list of SeqRecordExpanded objects:
>>> from seqrecord_expanded import SeqRecord
>>>
>>> # `table` is the Translation Table code based on NCBI
>>> seq_record1 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
... table=1, voucher_code='CP100-10',
... taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record2 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
... table=1, voucher_code='CP100-10',
... taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record3 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='wingless',
... table=1, voucher_code='CP100-10',
... taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record4 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='winglesss',
... table=1, voucher_code='CP100-10',
... taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_records = [
... seq_record1, seq_record2, seq_record3, seq_record4,
... ]
- Create an aminoacid dataset from your nucleotide sequences:
>>> from dataset_creator import Dataset
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
... codon_positions='ALL', aminoacids=True)
3. Create dataset with degenerated nucleotide sequences using the method by Zwick et al.:
>>> # The degenerate method can be 'S', 'Z', 'SZ' and 'normal'
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
... codon_positions='ALL', degenerate='S')
- Create dataset specifying the outgroup by its voucher code:
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
... outgroup='CP100-10)
- Codon positions can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
>>> dataset = Dataset(seq_records, format='TNT', partitioning='by codon position',
... codon_positions='ALL')
>>> dataset = Dataset(seq_records, format='PHYLIP', partitioning='1st-2nd, 3rd',
... codon_positions='ALL')
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
... codon_positions='1st')
- The dataset is returned as a string:
>>> print(dataset.dataset_str)
#NEXUS
blah blah ...
API Guide¶
dataset_creator package¶
Submodules¶
dataset_creator.base_dataset module¶
- class dataset_creator.base_dataset.BasePairCount(reading_frame=None, codon_positions=None, partitioning=None, count_start=None, count_end=None)[source]¶
Bases: builtins.object
Uses reading frame info, partitioning method and number of codon positions to return corrected base pair count for charset lines.
Example
>>> bp_count = BasePairCount(reading_frame=1, codon_positions='1st-2nd', ... partitioning='by codon position', ... count_start=100, count_end=512) >>> bp_count.get_corrected_count() [ '100-512', '101-513', ]
- class dataset_creator.base_dataset.DatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶
Bases: builtins.object
By default, the data sequences block generated is NEXUS and we use BioPython tools to convert it to other formats such as FASTA. However, sometimes the blo
Parameters: - data (named tuple) – containing: * gene_codes: list * number_chars: string * number_taxa: string * seq_records: list of SeqRecordExpanded objects * gene_codes_and_lengths: OrderedDict
- codon_positions (str) – str. Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
- partitioning (str) –
- aminoacids (boolean) –
- degenerate (str) –
- format (str) – NEXUS, PHYLIP or FASTA.
- outgroup (str) – Specimen code of taxon that should be used as outgroup.
- convert_block_dicts_to_string(block_1st2nd, block_1st, block_2nd, block_3rd)[source]¶
Takes into account whether we need to output all codon positions.
- convert_to_string(block)[source]¶
Makes gene_block as str from list of SeqRecordExpanded objects of a gene_code.
Override this function if the dataset block needs to be different due to file format.
This block will need to be split further if the dataset is FASTA or TNT and the partitioning scheme is 1st-2nd, 3rd.
As the dataset is split into several blocks due to 1st-2nd, 3rd we cannot translate to aminoacids or degenerate the sequences.
- dataset_block()[source]¶
Creates the block with taxon names and their sequences.
Override this function if the dataset block needs to be different due to file format.
Example
CP100_10_Aus_aus ACGATRGACGATRA... CP100_11_Aus_bus ACGATRGACGATRA... ...
- split_data()[source]¶
Splits the list of SeqRecordExpanded objects into lists, which are kept into a bigger list.
If the file_format is Nexus, then it is only partitioned by gene. If it is FASTA, then it needs partitioning by codon positions if required.
Example
>>> blocks = [ ... [SeqRecord1, SeqRecord2], # for gene 1 ... [SeqRecord1, SeqRecord2], # for gene 2 ... [SeqRecord1, SeqRecord2], # for gene 3 ... [SeqRecord1, SeqRecord2], # for gene 4 ... ]
Bases: builtins.object
Builds charset block:
Parameters: - data (namedtuple) – with necessary info for dataset creation.
- codon_positions (str) – 1st, 2nd, 3rd, 1st-2nd, ALL.
- partitioning (str) – by gene, by codon position, 1st-2nd, 3rd.
- outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.
Example
>>> begin mrbayes; charset ArgKin = 1-596; charset COI-begin = 597-1265; charset COI_end = 1266-2071; charset ef1a = 2072-3311; charset RpS2 = 3312-3722; charset RpS5 = 3723-4339; charset wingless = 4340-4739; set autoclose=yes; prset applyto=(all) ratepr=variable brlensp=unconstrained:Exp(100.0) shapepr=exp(1.0) tratiopr=beta(2.0,1.0); lset applyto=(all) nst=mixed rates=gamma [invgamma]; unlink statefreq=(all); unlink shape=(all) revmat=(all) tratio=(all) [pinvar=(all)]; mcmc ngen=10000000 printfreq=1000 samplefreq=1000 nchains=4 nruns=2 savebrlens=yes [temp=0.11]; sump relburnin=yes [no] burninfrac=0.25 [2500]; sumt relburnin=yes [no] burninfrac=0.25 [2500] contype=halfcompat [allcompat]; END;
Appends pos1, pos2, etc to the gene_code if needed.
Generates the outgroup line from the voucher code specified by the user.
Override this function for Phylip dataset as the content is different and goes into a separate file.
Override this function for Phylip dataset as the content is different and goes into a separate file.
Charset lines have or depending on type of partitioning and codon positions requested for our dataset.
Returns:
dataset_creator.creator module¶
- class dataset_creator.creator.Creator(data, format=None, codon_positions=None, partitioning=None, aminoacids=None, degenerate=None, outgroup=None)[source]¶
Bases: builtins.object
Create dataset and extra files for formats FASTA, NEXUS, PHYLIP, TNT and MEGA. We will create a NEXUS formatted dataset first and use BioPython tools to convert to FASTA and PHYLIP formats.
Parameters: - data (named tuple) – containing: * gene_codes: list * number_chars: string * number_taxa: string * seq_records: list of SeqRecordExpanded objects * gene_codes_and_lengths
- format (str) – NEXUS, PHYLIP, TNT, MEGA
- codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
- partitioning (str) – ‘by gene’, ‘by codon position’, ‘1st-2nd, 3rd’
- aminoacids (boolean) – To create aminoacid sequences instead of returning nucleotides.
- degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
- outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.
- extra_dataset_str¶
str – Charset block in Phylip formatted datasets.
Example
>>> dataset_creator = Creator(data, format='NEXUS', codon_positions='ALL', ... partitioning='by gene') >>> dataset_creator.dataset_str '#NEXUS blah blah '
dataset_creator.dataset module¶
- class dataset_creator.dataset.Dataset(seq_records, format=None, partitioning=None, codon_positions=None, aminoacids=None, degenerate=None, outgroup=None)[source]¶
Bases: builtins.object
User’s class for making datasets of several formats. It needs as input a list of SeqRecord-expanded objects with as much info as possible:
Parameters: - seq_records (list) – SeqRecordExpanded objects. The list should be sorted by gene_code and then voucher code.
- format (str) – NEXUS, PHYLIP, TNT, MEGA, GenBankFASTA.
- partitioning (str) – Partitioning scheme: by gene (default), by codon position (each) and 1st-2nd, 3rd.
- codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
- aminoacids (boolean) – Returns the dataset as aminoacid sequences.
- degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
- outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.
- _gene_codes_and_lengths¶
dict – in the form gene_code: list The list contains sequence lengths for its sequences. We assume the longest to be the real gene_code sequence length.
Example
>>> dataset = Dataset(seq_records, format='NEXUS', codon_positions='1st', ... partitioning='by gene', ... ) >>> print(dataset.dataset_str) '#NEXUS blah blah ' >>> dataset = Dataset(seq_records, format='PHYLIP', codon_positions='ALL', ... partitioning='by gene', ... ) >>> print(dataset.dataset_str) '100 10 blah blah '
dataset_creator.exceptions module¶
dataset_creator.genbank_fasta module¶
dataset_creator.mega module¶
- class dataset_creator.mega.MegaDatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶
dataset_creator.phylip module¶
Bases: dataset_creator.base_dataset.DatasetFooter
Overridden function for Phylip dataset as the content is different and goes into a separate file.
Overridden function for Phylip dataset as the content is different and goes into a separate file.
dataset_creator.tnt module¶
- class dataset_creator.tnt.TntDatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶
dataset_creator.utils module¶
- dataset_creator.utils.convert_nexus_to_format(dataset_as_nexus, dataset_format)[source]¶
Converts nexus format to Phylip and Fasta using Biopython tools.
Parameters: - dataset_as_nexus –
- dataset_format –
Returns:
- dataset_creator.utils.get_seq(seq_record, codon_positions, aminoacids=False, degenerate=None)[source]¶
Checks parameters such as codon_positions, aminoacids... to return the required sequence as string.
Parameters: - seq_record (SeqRecordExpanded object) –
- codon_positions (str) –
- aminoacids (boolean) –
Returns: Namedtuple containing seq (str) and warning (str).
- dataset_creator.utils.make_dataset_header(data, file_format, aminoacids)[source]¶
Creates the dataset header for NEXUS files from #NEXUS to MATRIX.
Parameters: - data (namedtuple) – with necessary info for dataset creation.
- file_format (str) – TNT, PHYLIP, NEXUS, FASTA
- aminoacids (boolean) – If aminoacids is True the header will show DATATYPE=PROTEIN otherwise it will be DNA.
Module contents¶
- class dataset_creator.Dataset(seq_records, format=None, partitioning=None, codon_positions=None, aminoacids=None, degenerate=None, outgroup=None)¶
Bases: builtins.object
User’s class for making datasets of several formats. It needs as input a list of SeqRecord-expanded objects with as much info as possible:
Parameters: - seq_records (list) – SeqRecordExpanded objects. The list should be sorted by gene_code and then voucher code.
- format (str) – NEXUS, PHYLIP, TNT, MEGA, GenBankFASTA.
- partitioning (str) – Partitioning scheme: by gene (default), by codon position (each) and 1st-2nd, 3rd.
- codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
- aminoacids (boolean) – Returns the dataset as aminoacid sequences.
- degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
- outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.
- _gene_codes_and_lengths¶
dict – in the form gene_code: list The list contains sequence lengths for its sequences. We assume the longest to be the real gene_code sequence length.
Example
>>> dataset = Dataset(seq_records, format='NEXUS', codon_positions='1st', ... partitioning='by gene', ... ) >>> print(dataset.dataset_str) '#NEXUS blah blah ' >>> dataset = Dataset(seq_records, format='PHYLIP', codon_positions='ALL', ... partitioning='by gene', ... ) >>> print(dataset.dataset_str) '100 10 blah blah '
- sort_seq_records(seq_records)¶
Checks that SeqExpandedRecords are sorted by gene_code and then by voucher code.
The dashes in taxon names need to be converted to underscores so the dataset will be accepted by Biopython to do format conversions.