Enter your name and EID here
After completing this Jupyter notebook, please convert it to pdf and submit both the pdf and the original notebook on Canvas no later than 4:00 pm on May 9, 2019. The two documents will be graded jointly, so they must be consistent (as in, don't change the Jupyter notebook without also updating the converted pdf!).
All results presented must have corresponding code. Any answers/results given without the corresponding python code that generated the result will be considered absent. All code reported in your final project document should work properly.
Before submitting the Jupyter notebook part, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."
The project consists of two problems. For both problems, please follow these guidelines:
The bacteria called Salmonella enterica Typhimurium are pathogenic bacteria closely related to E. coli. They cause typhoid fever in humans. There are many different S. enterica Typhimurium strains, and here we will compare two such strains, LT2 and CT18. LT2 is the canonical strain that is most commonly used as a reference. CT18 is another widely used reference.
Before we can work with these two genomes, we need to download them. Note: Running the next cell may take a few minutes.
from Bio import Entrez
Entrez.email = ... # put your email here
# Download S. enterica strain LT2 and write into file "S_enterica_LT2.gb":
download_handle = Entrez.efetch(db="nucleotide", id="NC_003197", rettype="gbwithparts", retmode="text")
out_handle = open("S_enterica_LT2.gb", "w")
out_handle.write(download_handle.read())
download_handle.close()
out_handle.close()
print("Downloaded S. enterica LT2")
# Download S. enterica strain CT18 and write into file "S_enterica_CT18.gb":
download_handle = Entrez.efetch(db="nucleotide", id="NC_003198", rettype="gbwithparts", retmode="text")
out_handle = open("S_enterica_CT18.gb", "w")
out_handle.write(download_handle.read())
download_handle.close()
out_handle.close()
print("Downloaded S. enterica CT18")
Problem 1a (30 pts): How many named protein-coding genes are in S. enterica LT2? And how many of these genes have synonyms in S. enterica CT18?
Hint: Gene names have been defined for the LT2 strain. You can find these names in the "gene" qualifier of CDS features. When equivalent genes exist in CT18, they are listed under the "gene_synonym" qualifer of the CDS features. As an example, manually open the two genome files and look for the "thrL" gene in each genome.
Provide a brief introduction (1 paragraph max) explaining how you are going to answer the questions.
from Bio import SeqIO
# read in the LT2 genome
in_handle = open("S_enterica_LT2.gb", "r")
record_LT2 = SeqIO.read(in_handle, "genbank")
in_handle.close()
# read in the CT18 genome
in_handle = open("S_enterica_CT18.gb", "r")
record_CT18 = SeqIO.read(in_handle, "genbank")
in_handle.close()
# your code goes here
Provide a brief conclusion (1 paragraph max) explaining what your code shows.
Problem 1b (20 pts): How many of the named genes in LT2 without a synonym in CT18 have their product listed as "hypothetical protein"?
Provide a brief introduction (1 paragraph max) explaining how you are going to answer the question.
# your code goes here
Provide a brief conclusion (1 paragraph max) explaining what your code shows.
(50 pts)
Ask a question about the genomes from Problem 1 and then write python code that generates an answer. The question does not have to be conceptual, and it can be about only one of the two genomes or about the two genomes jointly.
For full credit, the answer code must meet the following conditions:
for
loopif
statementQuestion: your question goes here
Provide a brief introduction (1-2 paragraphs) explaining how you are going to answer the question.
# your code goes here
Provide a brief conclusion (1-2 paragraphs) explaining what you have learned about your question from your code.