Project 3

Enter your name and EID here

Instructions

After completing this Jupyter notebook, please convert it to pdf and submit both the pdf and the original notebook on Canvas no later than 4:00 pm on May 9, 2019. The two documents will be graded jointly, so they must be consistent (as in, don't change the Jupyter notebook without also updating the converted pdf!).

All results presented must have corresponding code. Any answers/results given without the corresponding python code that generated the result will be considered absent. All code reported in your final project document should work properly.

Before submitting the Jupyter notebook part, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

The project consists of two problems. For both problems, please follow these guidelines:

  • Final output needs to be nicely formatted and human readable. For example, if your result is a count, don't just print the value of the count, print "The count is: ...".
  • For each problem, limit your total code to less than 100 lines.
  • Write comments and explanatory text, so we understand what you are doing.
  • Do not print out large datasets, such as an entire genome, or a list of all genes in a genome, etc.
  • Verify that nothing of importance (code, comments, other text) is cut off in your final pdf.

Problem 1

The bacteria called Salmonella enterica Typhimurium are pathogenic bacteria closely related to E. coli. They cause typhoid fever in humans. There are many different S. enterica Typhimurium strains, and here we will compare two such strains, LT2 and CT18. LT2 is the canonical strain that is most commonly used as a reference. CT18 is another widely used reference.

Before we can work with these two genomes, we need to download them. Note: Running the next cell may take a few minutes.

In [1]:
from Bio import Entrez
Entrez.email = ... # put your email here

# Download S. enterica strain LT2 and write into file "S_enterica_LT2.gb":
download_handle = Entrez.efetch(db="nucleotide", id="NC_003197", rettype="gbwithparts", retmode="text")
out_handle = open("S_enterica_LT2.gb", "w")
out_handle.write(download_handle.read())
download_handle.close()
out_handle.close()
print("Downloaded S. enterica LT2")

# Download S. enterica strain CT18 and write into file "S_enterica_CT18.gb":
download_handle = Entrez.efetch(db="nucleotide", id="NC_003198", rettype="gbwithparts", retmode="text")
out_handle = open("S_enterica_CT18.gb", "w")
out_handle.write(download_handle.read())
download_handle.close()
out_handle.close()
print("Downloaded S. enterica CT18")
Downloaded S. enterica LT2
Downloaded S. enterica CT18

Problem 1a (30 pts): How many named protein-coding genes are in S. enterica LT2? And how many of these genes have synonyms in S. enterica CT18?

Hint: Gene names have been defined for the LT2 strain. You can find these names in the "gene" qualifier of CDS features. When equivalent genes exist in CT18, they are listed under the "gene_synonym" qualifer of the CDS features. As an example, manually open the two genome files and look for the "thrL" gene in each genome.

Provide a brief introduction (1 paragraph max) explaining how you are going to answer the questions.

In [2]:
from Bio import SeqIO

# read in the LT2 genome
in_handle = open("S_enterica_LT2.gb", "r")
record_LT2 = SeqIO.read(in_handle, "genbank")
in_handle.close()

# read in the CT18 genome
in_handle = open("S_enterica_CT18.gb", "r")
record_CT18 = SeqIO.read(in_handle, "genbank")
in_handle.close()

# your code goes here

Provide a brief conclusion (1 paragraph max) explaining what your code shows.

Problem 1b (20 pts): How many of the named genes in LT2 without a synonym in CT18 have their product listed as "hypothetical protein"?

Provide a brief introduction (1 paragraph max) explaining how you are going to answer the question.

In [3]:
# your code goes here

Provide a brief conclusion (1 paragraph max) explaining what your code shows.

Problem 2

(50 pts)

Ask a question about the genomes from Problem 1 and then write python code that generates an answer. The question does not have to be conceptual, and it can be about only one of the two genomes or about the two genomes jointly.

For full credit, the answer code must meet the following conditions:

  • contains at least one for loop
  • contains at least one if statement
  • uses at least one list or dictionary
  • uses at least one regular expression

Question: your question goes here

Provide a brief introduction (1-2 paragraphs) explaining how you are going to answer the question.

In [4]:
# your code goes here

Provide a brief conclusion (1-2 paragraphs) explaining what you have learned about your question from your code.