User’s Guide¶

dataset-creator needs a list of SeqRecordExpanded objects:

>>> from seqrecord_expanded import SeqRecord
>>>
>>> # `table` is the Translation Table code based on NCBI
>>> seq_record1 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
...                         table=1, voucher_code='CP100-10',
...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record2 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
...                         table=1, voucher_code='CP100-10',
...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record3 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='wingless',
...                         table=1, voucher_code='CP100-10',
...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_record4 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='winglesss',
...                         table=1, voucher_code='CP100-10',
...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
>>>
>>> seq_records = [
...    seq_record1, seq_record2, seq_record3, seq_record4,
... ]

Create an aminoacid dataset from your nucleotide sequences:

>>> from dataset_creator import Dataset
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
...                   codon_positions='ALL', aminoacids=True)

3. Create dataset with degenerated nucleotide sequences using the method by Zwick et al.:

>>> # The degenerate method can be 'S', 'Z', 'SZ' and 'normal'
>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
...                   codon_positions='ALL', degenerate='S')

Create dataset specifying the outgroup by its voucher code:

>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
...                   outgroup='CP100-10)

Codon positions can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).

>>> dataset = Dataset(seq_records, format='TNT', partitioning='by codon position',
...                   codon_positions='ALL')

>>> dataset = Dataset(seq_records, format='PHYLIP', partitioning='1st-2nd, 3rd',
...                   codon_positions='ALL')

>>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
...                   codon_positions='1st')

The dataset is returned as a string:

>>> print(dataset.dataset_str)
#NEXUS
blah blah ...

API Guide¶

dataset_creator package¶

Submodules¶

dataset_creator.base_dataset module¶

class dataset_creator.base_dataset.BasePairCount(reading_frame=None, codon_positions=None, partitioning=None, count_start=None, count_end=None)[source]¶

Bases: builtins.object

Uses reading frame info, partitioning method and number of codon positions to return corrected base pair count for charset lines.

Example

>>> bp_count = BasePairCount(reading_frame=1, codon_positions='1st-2nd',
...                          partitioning='by codon position',
...                          count_start=100, count_end=512)
>>> bp_count.get_corrected_count()
[
    '100-512',
    '101-513',
]

get_corrected_count()[source]¶

class dataset_creator.base_dataset.DatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶

Bases: builtins.object

By default, the data sequences block generated is NEXUS and we use BioPython tools to convert it to other formats such as FASTA. However, sometimes the blo

Parameters:

data (named tuple) – containing: * gene_codes: list * number_chars: string * number_taxa: string * seq_records: list of SeqRecordExpanded objects * gene_codes_and_lengths: OrderedDict
codon_positions (str) – str. Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
partitioning (str) –
aminoacids (boolean) –
degenerate (str) –
format (str) – NEXUS, PHYLIP or FASTA.
outgroup (str) – Specimen code of taxon that should be used as outgroup.

convert_block_dicts_to_string(block_1st2nd, block_1st, block_2nd, block_3rd)[source]¶: Takes into account whether we need to output all codon positions.

convert_to_string(block)[source]¶

Makes gene_block as str from list of SeqRecordExpanded objects of a gene_code.

Override this function if the dataset block needs to be different due to file format.

This block will need to be split further if the dataset is FASTA or TNT and the partitioning scheme is 1st-2nd, 3rd.

As the dataset is split into several blocks due to 1st-2nd, 3rd we cannot translate to aminoacids or degenerate the sequences.

dataset_block()[source]¶

Creates the block with taxon names and their sequences.

Override this function if the dataset block needs to be different due to file format.

Example

CP100_10_Aus_aus ACGATRGACGATRA... CP100_11_Aus_bus ACGATRGACGATRA... ...

flatten_taxonomy(seq_record)[source]¶

make_datablock_by_gene(block)[source]¶

make_datablock_considering_codon_positions_as_fasta_format(block)[source]¶

split_data()[source]¶

Splits the list of SeqRecordExpanded objects into lists, which are kept into a bigger list.

If the file_format is Nexus, then it is only partitioned by gene. If it is FASTA, then it needs partitioning by codon positions if required.

Example

>>> blocks = [
...     [SeqRecord1, SeqRecord2],  # for gene 1
...     [SeqRecord1, SeqRecord2],  # for gene 2
...     [SeqRecord1, SeqRecord2],  # for gene 3
...     [SeqRecord1, SeqRecord2],  # for gene 4
... ]

class dataset_creator.base_dataset.DatasetFooter(data, codon_positions=None, partitioning=None, outgroup=None)[source]¶

Bases: builtins.object

Builds charset block:

Parameters:	data (namedtuple) – with necessary info for dataset creation. codon_positions (str) – 1st, 2nd, 3rd, 1st-2nd, ALL. partitioning (str) – by gene, by codon position, 1st-2nd, 3rd. outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.

Example

>>>
begin mrbayes;
charset ArgKin = 1-596;
charset COI-begin = 597-1265;
charset COI_end = 1266-2071;
charset ef1a = 2072-3311;
charset RpS2 = 3312-3722;
charset RpS5 = 3723-4339;
charset wingless = 4340-4739;
set autoclose=yes;
prset applyto=(all) ratepr=variable brlensp=unconstrained:Exp(100.0) shapepr=exp(1.0) tratiopr=beta(2.0,1.0);
lset applyto=(all) nst=mixed rates=gamma [invgamma];
unlink statefreq=(all);
unlink shape=(all) revmat=(all) tratio=(all) [pinvar=(all)];
mcmc ngen=10000000 printfreq=1000 samplefreq=1000 nchains=4 nruns=2 savebrlens=yes [temp=0.11];
 sump relburnin=yes [no] burninfrac=0.25 [2500];
 sumt relburnin=yes [no] burninfrac=0.25 [2500] contype=halfcompat [allcompat];
END;

add_suffixes_to_gene_codes()[source]¶: Appends pos1, pos2, etc to the gene_code if needed.

correct_count_using_reading_frames(gene_code, count_start, count_end)[source]¶

dataset_footer()[source]¶

format_charset_line(gene_code, count_start, count_end)[source]¶

get_outgroup()[source]¶: Generates the outgroup line from the voucher code specified by the user.

make_charset_block()[source]¶: Override this function for Phylip dataset as the content is different and goes into a separate file.

make_charsets()[source]¶: Override this function for Phylip dataset as the content is different and goes into a separate file.

make_footer()[source]¶

make_gene_code_suffixes()[source]¶

make_partition_line()[source]¶

make_slash_number()[source]¶

Charset lines have or depending on type of partitioning and codon positions requested for our dataset.

Returns:

suffix_for_one_codon_position()[source]¶

suffix_for_several_codon_positions()[source]¶

dataset_creator.creator module¶

class dataset_creator.creator.Creator(data, format=None, codon_positions=None, partitioning=None, aminoacids=None, degenerate=None, outgroup=None)[source]¶

Bases: builtins.object

Create dataset and extra files for formats FASTA, NEXUS, PHYLIP, TNT and MEGA. We will create a NEXUS formatted dataset first and use BioPython tools to convert to FASTA and PHYLIP formats.

Parameters:

data (named tuple) – containing: * gene_codes: list * number_chars: string * number_taxa: string * seq_records: list of SeqRecordExpanded objects * gene_codes_and_lengths
format (str) – NEXUS, PHYLIP, TNT, MEGA
codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
partitioning (str) – ‘by gene’, ‘by codon position’, ‘1st-2nd, 3rd’
aminoacids (boolean) – To create aminoacid sequences instead of returning nucleotides.
degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.

extra_dataset_str¶: str – Charset block in Phylip formatted datasets.

Example

>>> dataset_creator = Creator(data, format='NEXUS', codon_positions='ALL',
...                           partitioning='by gene')
>>> dataset_creator.dataset_str
'#NEXUS
blah blah
'

create_dataset_block()[source]¶

create_dataset_footer()[source]¶

create_dataset_header()[source]¶

create_extra_dataset_file()[source]¶

put_everything_together()[source]¶

dataset_creator.dataset module¶

class dataset_creator.dataset.Dataset(seq_records, format=None, partitioning=None, codon_positions=None, aminoacids=None, degenerate=None, outgroup=None)[source]¶

Bases: builtins.object

User’s class for making datasets of several formats. It needs as input a list of SeqRecord-expanded objects with as much info as possible:

Parameters:

seq_records (list) – SeqRecordExpanded objects. The list should be sorted by gene_code and then voucher code.
format (str) – NEXUS, PHYLIP, TNT, MEGA, GenBankFASTA.
partitioning (str) – Partitioning scheme: by gene (default), by codon position (each) and 1st-2nd, 3rd.
codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
aminoacids (boolean) – Returns the dataset as aminoacid sequences.
degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.

_gene_codes_and_lengths¶: dict – in the form gene_code: list The list contains sequence lengths for its sequences. We assume the longest to be the real gene_code sequence length.

Example

>>> dataset = Dataset(seq_records, format='NEXUS', codon_positions='1st',
...                   partitioning='by gene',
...                   )
>>> print(dataset.dataset_str)
'#NEXUS
blah blah
'
>>> dataset = Dataset(seq_records, format='PHYLIP', codon_positions='ALL',
...                   partitioning='by gene',
...                   )
>>> print(dataset.dataset_str)
'100 10
blah blah
'

sort_seq_records(seq_records)[source]¶

Checks that SeqExpandedRecords are sorted by gene_code and then by voucher code.

The dashes in taxon names need to be converted to underscores so the dataset will be accepted by Biopython to do format conversions.

dataset_creator.exceptions module¶

exception dataset_creator.exceptions.WrongParameterFormat[source]¶: Bases: builtins.Exception

dataset_creator.genbank_fasta module¶

class dataset_creator.genbank_fasta.GenBankFASTADatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶

Bases: dataset_creator.base_dataset.DatasetBlock

convert_to_string(block)[source]¶

Takes a list of SeqRecordExpanded objects corresponding to a gene_code and produces the gene_block as string.

Parameters:	block –
Returns:	str.

dataset_creator.mega module¶

class dataset_creator.mega.MegaDatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶

Bases: dataset_creator.base_dataset.DatasetBlock

convert_blocks_to_string()[source]¶

New method, only in MegaDatasetBlock class.

Returns:	flattened data blocks as string

dataset_block()[source]¶

dataset_creator.phylip module¶

class dataset_creator.phylip.PhylipDatasetFooter(data, codon_positions=None, partitioning=None, outgroup=None)[source]¶

Bases: dataset_creator.base_dataset.DatasetFooter

make_charset_block()[source]¶: Overridden function for Phylip dataset as the content is different and goes into a separate file.

make_charsets()[source]¶: Overridden function for Phylip dataset as the content is different and goes into a separate file.

dataset_creator.tnt module¶

class dataset_creator.tnt.TntDatasetBlock(data, codon_positions, partitioning, aminoacids=None, degenerate=None, format=None, outgroup=None)[source]¶

Bases: dataset_creator.base_dataset.DatasetBlock

convert_to_string(block)[source]¶

Takes a list of SeqRecordExpanded objects corresponding to a gene_code and produces the gene_block as string.

Parameters:	block –
Returns:	str.

dataset_block()[source]¶

put_outgroup_at_start_of_block(block)[source]¶

dataset_creator.utils module¶

dataset_creator.utils.convert_nexus_to_format(dataset_as_nexus, dataset_format)[source]¶

Converts nexus format to Phylip and Fasta using Biopython tools.

Parameters:	dataset_as_nexus – dataset_format –
Returns:

dataset_creator.utils.get_seq(seq_record, codon_positions, aminoacids=False, degenerate=None)[source]¶

Checks parameters such as codon_positions, aminoacids... to return the required sequence as string.

Parameters:	seq_record (SeqRecordExpanded object) – codon_positions (str) – aminoacids (boolean) –
Returns:	Namedtuple containing `seq (str)` and `warning (str)`.

dataset_creator.utils.make_dataset_header(data, file_format, aminoacids)[source]¶

Creates the dataset header for NEXUS files from #NEXUS to MATRIX.

Parameters:	data (namedtuple) – with necessary info for dataset creation. file_format (str) – TNT, PHYLIP, NEXUS, FASTA aminoacids (boolean) – If `aminoacids is True` the header will show `DATATYPE=PROTEIN` otherwise it will be `DNA`.

dataset_creator.utils.make_random_filename()[source]¶

dataset_creator.utils.read_and_delete_tmp_file(filename)[source]¶

Module contents¶

class dataset_creator.Dataset(seq_records, format=None, partitioning=None, codon_positions=None, aminoacids=None, degenerate=None, outgroup=None)¶

Bases: builtins.object

User’s class for making datasets of several formats. It needs as input a list of SeqRecord-expanded objects with as much info as possible:

Parameters:

seq_records (list) – SeqRecordExpanded objects. The list should be sorted by gene_code and then voucher code.
format (str) – NEXUS, PHYLIP, TNT, MEGA, GenBankFASTA.
partitioning (str) – Partitioning scheme: by gene (default), by codon position (each) and 1st-2nd, 3rd.
codon_positions (str) – Can be 1st, 2nd, 3rd, 1st-2nd, ALL (default).
aminoacids (boolean) – Returns the dataset as aminoacid sequences.
degenerate (str) – Method to degenerate nucleotide sequences, following Zwick et al. Can be S, Z, SZ and normal.
outgroup (str) – voucher code to be used as outgroup for NEXUS and TNT files.

_gene_codes_and_lengths¶: dict – in the form gene_code: list The list contains sequence lengths for its sequences. We assume the longest to be the real gene_code sequence length.

Example

>>> dataset = Dataset(seq_records, format='NEXUS', codon_positions='1st',
...                   partitioning='by gene',
...                   )
>>> print(dataset.dataset_str)
'#NEXUS
blah blah
'
>>> dataset = Dataset(seq_records, format='PHYLIP', codon_positions='ALL',
...                   partitioning='by gene',
...                   )
>>> print(dataset.dataset_str)
'100 10
blah blah
'

sort_seq_records(seq_records)¶

Checks that SeqExpandedRecords are sorted by gene_code and then by voucher code.

The dashes in taxon names need to be converted to underscores so the dataset will be accepted by Biopython to do format conversions.