# tocID <- "BIN-Storing_data.R" # # Purpose: A Bioinformatics Course: # R code accompanying the BIN-Storing_data unit # # Version: 1.3.2 # # Date: 2017-10 - 2021-09 # Author: Boris Steipe (boris.steipe@utoronto.ca) # # V 1.3.2 2021 minimal maintenance # V 1.3.1 add overlooked jsonlite:: prefix to fromJson() # V 1.3 Made file locations more consistent. All student-edited files # go into the myScripts directory # V 1.2 2020 updates. Finally removed stringAsFactors :-) # V 1.1 Add instructions to retrieve UniProt ID from ID mapping service. # V 1.0 First live version, complete rebuilt. Now using JSON data sources. # V 0.1 First code copied from BCH441_A03_makeYFOlist.R # # TODO: # The sameSpecies() approach is a bit of a hack - can we solve the # species vs. strain issue in a more principled way? # # == HOW TO WORK WITH LEARNING UNIT FILES ====================================== # # DO NOT SIMPLY source() THESE FILES! # # If there are portions you don't understand, use R's help system, Google for an # answer, or ask your instructor. Don't continue if you don't understand what's # going on. That's not how it works ... # # ============================================================================== #TOC> ========================================================================== #TOC> #TOC> Section Title Line #TOC> ----------------------------------------------------------------------- #TOC> 1 A Relational Datamodel in R: review 63 #TOC> 1.1 Building a sample database structure 103 #TOC> 1.1.1 completing the database 209 #TOC> 1.2 Querying the database 242 #TOC> 1.3 Task: submit for credit (part 1/2) 273 #TOC> 2 Implementing the protein datamodel 297 #TOC> 2.1 JSON formatted source data 323 #TOC> 2.2 "Sanitizing" sequence data 364 #TOC> 2.3 Create a protein table for our data model 386 #TOC> 2.3.1 Initialize the database 388 #TOC> 2.3.2 Add data 400 #TOC> 2.4 Complete the database 420 #TOC> 2.4.1 Examples of navigating the database 447 #TOC> 2.5 Updating the database 479 #TOC> 3 Add your own data 491 #TOC> 3.1 Find a protein 499 #TOC> 3.2 Put the information into JSON files 530 #TOC> 3.3 Create an R script to create your own database 572 #TOC> 3.3.1 Check and validate 600 #TOC> 3.4 Task: submit for credit (part 2/2) 645 #TOC> #TOC> ========================================================================== # = 1 A Relational Datamodel in R: review ================================= # A disclaimer at first: we are not building an industry-strength database at # all here - but we are employing principles of such a database to keep common # types of lab-data well organized. Don't think of this as emulating or even # replacing a "real" database, but think of this as improving the ad hoc # approaches we normally employ to store data in the lab. That does not mean # such ad hoc approaches are necessarily bad - the best solution always depends # on your objectives, the details of your tasks, and the context in which you # are working. # The principle we follow in implementing a relational data model is to build a # list of dataframes . This list is our "database": # - Each _entity_ of the datamodel is a dataframe. In an SQL database, these # would also be called "tables". In a spreadsheet this would be a "sheet". # - Each instance of an entity, i.e. one stored _item_, is a row of the data # frame. In an SQL database this would be a record. In a spreadsheet this is # a row. # - Each _attribute_ of an entity is is a column of the dataframe. In an SQL # database this is a column, in a spreadsheet too. # - This doesn't necessarily solve the question of how we will store and curate # our source data - we will defer that to later. At first we talk only about # data representation internal to our R session, where we need it for # processing and analysis. # Lets review syntax for creating and accessing such a structure, a list of data # frames. You'll have to be absolutely confident with this, or you'll get lost # in all the later learning units. We'll start from a compact example, a tiny # database of philosophers to keep things brief. That database will have three # tables: person, works and book. Person stores biographical data, book stores # books, and works is a join table associating persons with their work. You # should already be familiar with "join tables" and why we need them. This is # the structure: # # person: id, name, born, died, school # book: id, title, published # works: id, person$id, book$id # Perhaps draw out this schema to make things more clear. # == 1.1 Building a sample database structure ============================== # Let's build this structure. philDB <- list() # This is an empty list # This is a data frame that we initialize with two philosophers x <- data.frame(id = c(1,2), name = c("Laozi", "Martin Heidegger"), born = c(NA, "1889"), died = c("531 BCE", "1976"), school = c("Daoism", "Phenomenology")) str(x) # Lets add the dataframe to the philDB list and call it "person" there. philDB[["person"]] <- x str(philDB) # and let's remove x so we don't mix up things later. rm(x) # We can address elements with the usual subsetting operators. I will use # the $ operator for tables and columns, the [] operator for elements in # columns. For example ... philDB$person$name[1] # Laozi # task: Write an expression that returns all "school" entries from the # person table. # Let's now add another person. There are several ways to do this, the # conceptually cleanest is to create a one-row dataframe with the data, and # rbind() it to the existing dataframe. Doing this, we must take care that # the data frame column names are identical. What happens if they are not? # Let's find out: (x <- data.frame(a=1:4, b=11:14)) (y <- data.frame(a=6, c=17)) rbind(x, y) (y <- data.frame(a=6, b=17)) rbind(x, y) # All clear? That's good - this behaviour provides us with a sanity check on the # operation. Incidentally: rbind(x, y) did NOT change the table ... x # rather rbind() had the chnaged table as its return value and that's why it # was printed. To actually change the table, you need to ASSIGN the return # value of rbind() ... like so: x <- rbind(x, y) # To continue ... (x <- data.frame(id = 2, name = "Zhuangzi", born = "369 BCE", died = "286 BCE", school = "Daoism")) # Add this to the "person" table in our database with rbind() ... philDB$person <- rbind(philDB$person, x) # ... and examine the result: str(philDB) # We made a serious error in our data! Did you spot it? # # If not, look at ... philDB$person$id # ... does that look oK? # # Absolutely not! "id" is the Primary Key in the table, and it has to be # unique. How can we guarantee it to be unique? Certainly not when we # enter it by hand. We need a function that generates a unique key. Here's # a simple version, without any error-checking. It assumes that a column # named "id" exists in the table, and that it holds the Primary Keys: autoincrement <- function(table) { return(max(table$id) + 1) } #Try it: autoincrement(philDB$person) # Once that is clear, let's remove the Zhuangzi entry and recreate it correctly. # Many ways to remove, here we use a logical expression to select matching # record(s), apply the results to subset the data frame, and overwrite the # existing table with the new one. sel <- !(philDB$person$name == "Zhuangzi") # select ... philDB$person <- philDB$person[sel, ] # ... and replace str(philDB) # Now let's add Zhuangzi with correct data. Note how we use the autoincrement # function for the id x <- data.frame(id = autoincrement(philDB$person), name = "Zhuangzi", born = "369 BCE", died = "286 BCE", school = "Daoism") philDB$person <- rbind(philDB$person, x) str(philDB) # So far so good. Be honest with yourself. If you didn't follow any of this, # go back, re-read, play with it, and ask for help. These are the foundations. # === 1.1.1 completing the database # Next I'll add one more person, and create the other two tables: x <- data.frame(id = autoincrement(philDB$person), name = "Kongzi", born = "551 BCE", died = "479 BCE", school = "Confucianism") philDB$person <- rbind(philDB$person, x) # a table of major works ... philDB[["books"]] <- data.frame(id = 1:5, title = c("Zhuangzi", "Analects", "Being and Time", "Daodejing", "On the Way to Language"), published = c("300 BCE", "220 BCE", "1927", "530 BCE", "1959")) # a "join" table that links works and their author ... philDB[["works"]] <- data.frame(id = 1:5, personID = c(3, 4, 2, 1, 2), bookID = c(1, 2, 3, 4, 5)) str(philDB) # == 1.2 Querying the database ============================================= # To retrieve data, we need to subset tables, possibly based on conditions we # find in other tables. Sometimes we can simply get the information, e.g. # all names ... philDB$person$name # ... or all book titles ... philDB$books$title # ... but sometimes we need to cross-reference information via join tables. Here # is an example where we list authors and their works, sorted alphabetically by # author: (sel <- order(philDB$person$name)) # check out ?order and describe to # someone you know what it does, so that # you are sure you understand it. Its # indirection can be a bit tricky to # understand. ( pID <- philDB$person$id[sel] ) sel <- numeric() # initialize the vector for (ID in pID) { sel <- which(philDB$works$personID == ID) # get all rows for which # the condition is TRUE cat(sprintf("%s: ", philDB$person$name[ID])) # output the person cat(sprintf("\"%s\" ", philDB$books$title[sel])) # output the book cat("\n") } # Examine the intermediate results and trace the logic until this is clear. # == 1.3 Task: submit for credit (part 1/2) ================================ # Write code that adds another philosopher to the datamodel: # Immanuel Kant, (1724 - 1804), Enlightenment Philosophy. # Works: Critique of Pure Reason (1781), Critique of Judgement (1790) # Paste your code into your submission page. Enclose it in
 ... 
# tags. # # Write and submit code that lists the philosophical schools in # alphabetical order, and the books associated with them, also # alphabetically. Format your output like: # Confucianism # Analects - (220 BCE) # Daoism # Daodejing - (530 BCE) # ... etc. # # Show the output of your code. Make sure the code itself is enclosed # in
 ... 
tags. DO NOT POST A SCREENSHOT OF YOUR OUTPUT, # BUT COPY THE EXACT, COMPLETE OUTPUT, PASTE IT INTO YOUR SUBMISSION, # AND FORMAT IT CORRECTLY. # == Submission: Code to list the philosophical schools in alphabetical order as well as their respective books in alphabetical order. schools <- unique(philDB$person$school) schools <- sort(schools) for (s in schools) { cat(sprintf("%s\n", s)) authors = which(philDB$person$school == s) for (author in authors) { works = which(philDB$works$personID == author) for (work in works) { bookId = which(philDB$books$id == philDB$works$bookID[work]) cat(sprintf("\t%s - (%s)\n", philDB$books$title[bookId], philDB$books$published[bookId])) } } } # = 2 Implementing the protein datamodel ================================== # Working with the code above has probably illustrated a few concerns about # curating data and storing it for analysis. In particular the join tables # seem problematic - figuring out the correct IDs, it's easy to make # mistakes. # - Data needs to be captured in a human-readable form so it can be verified # and validated; # - Some aspects of the database should _never_ be done by hand because # errors are easy to make and hard to see. That essentially includes # every operation that has to do with abstract, primary keys; # - Elementary operations we need to support are: adding data, selecting # data, modifying data and deleting data. # We will therefore construct our protein database in the following way: # - For each table, we will keep the primary information in JSON files. There # it is easy to read, edit if needed, and modify it. # - We will use simple scripts to read the JSON data and assemble it in # our database for further analysis. # - I have constructed initial files for yeast Mbp1 and nine other reference # species. # - I have written a small number of utility functions to read those files # and assemble them into a database. # == 2.1 JSON formatted source data ======================================== # Have a look at the structure of the yeast Mbp1 protein data: file.show("./data/MBP1_SACCE.json") # - The whole thing is an array: [ ... ]. This is not necessary for a single # object, but we will have more objects in other files. And it's perfectly # legal to have an array with a single element. # - The data is formatted as "key" : "value" pairs inside an object { ... }. # This keeps the association between data items and their semantics # explicit. # - All keys are strings and they are unique in the object. # - Values are mostly single strings and integers ... # - ... except for "sequence". That one is an array of strings. Why? This is to # make it easier to format and maintain the data. JSON does not allow line # breaks within strings, but the strings we copy/paste from Genbank or other # sources might have line breaks, sequence numbers etc. So we need to # sanitize the sequence at some point. But since we need to do that # anyway, it is easier to see the whole sequence if we store it in chunks. # The .utilities.R script that get's loaded whenever you open this project # has already made sure the "jsonlite" package exists on your computer. This # package supports our work with .json formatted data. if (! requireNamespace("jsonlite", quietly = TRUE)) { install.packages("jsonlite") } # Package information: # library(help = jsonlite) # basic information # browseVignettes("jsonlite") # available vignettes # data(package = "jsonlite") # available datasets x <- jsonlite::fromJSON("./data/MBP1_SACCE.json") str(x) x$name unlist(x$sequence) # == 2.2 "Sanitizing" sequence data ======================================== # Examine the dbSanitizeSequence() function: dbSanitizeSequence # Try: dbSanitizeSequence(c("GAA", "ttc")) dbSanitizeSequence("MsnQ00%0 I@#>YSary S G1 V2DV3Y>") x <- " 1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk 61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha 121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr ... " # copy/paste from Genbank dbSanitizeSequence(x) # == 2.3 Create a protein table for our data model ========================= # === 2.3.1 Initialize the database # The function dbInit contains all the code to return a list of empty # data frames for our data model. dbInit myDB <- dbInit() str(myDB) # === 2.3.2 Add data # fromJSON() returns a dataframe that we can readily process to add data # to our table. Have a look at the function to add protein entries: dbAddProtein myDB <- dbAddProtein(myDB, jsonlite::fromJSON("./data/MBP1_SACCE.json")) str(myDB) # Lets check that the 833 amino acids of the yeast MBP1 sequence have # safely arrived. Note the genral idiom we use here to retrieve the data: # we define a boolean vector that satisfies a condition, then we subset # a column with that vector. sel <- myDB$protein$name == "MBP1_SACCE" nchar(myDB$protein$sequence[sel]) # == 2.4 Complete the database ============================================= # Completing the database with Mbp1 data and data for 9 other "reference" # species is more of the same. I have assembled the code in a script # "./scripts/ABC-createRefDB.R" - open it, check it out, and then source it. # It's really very simple, just reading some prepared files of data I have # formatted with JSON, and assembling the data in our data model. # # The code is also very simple and in particular there is no checking for errors # or inconsistencies. Have a look: # Totally straightforward ... dbAddTaxonomy dbAddFeature # Just slightly more complex, since we need to match the protein or feature # name in the JSON file with its internal ID, and, when doing that confirm # that it CAN be matched and that the match is UNIQUE dbAddAnnotation # Now: create the database source("./scripts/ABC-createRefDB.R") str(myDB) # === 2.4.1 Examples of navigating the database # You can look at the contents of the tables in the usual way we access # elements from lists and dataframes. Here are some examples: myDB$protein myDB$protein$RefSeqID myDB$protein[,"name"] myDB$taxonomy myDB$taxonomy$species biCode(myDB$taxonomy$species) # Comparing two tables: # Are all of the taxonomyIDs in the protein table present in the # taxonomy table? We ought to check, because the way we imported the # data from JSON objects, we could have omitted or forgotten some. But we can # check this with one simple expression. Unravel it and study its components. all(myDB$protein$taxonomyID %in% myDB$taxonomy$ID) # If this is not TRUE, you MUST fix the problem before continuing. # Cross-referencing information: # What is the species name of the protein whose name is "MBP1_COPCI"? sel <- myDB$protein$name == "MBP1_COPCI" x <- myDB$protein$taxonomyID[sel] sel <- myDB$taxonomy$ID == x myDB$taxonomy$species[sel] # == 2.5 Updating the database ============================================= # Basic tasks for databases include retrieving data, selecting data, updating # and deleting data. Here we will take a simple, pedestrian approach: # # In case we need to modify any of the data, we modify it in the JSON file # save that, and recreate the database. The myDB database will only be # used for analysis. # # = 3 Add your own data =================================================== # You have defined a genome sequence fungus as "MYSPE", and your final task # will be to find the protein in MYSPE that is most similar to yeast Mbp1, and # to enter its information into the database. # == 3.1 Find a protein ==================================================== # The BLAST algorithm will be properly introduced in a later learning unit - # for now just use it in the following way: # # - Navigate to https://blast.ncbi.nlm.nih.gov/Blast.cgi and click on # Protein BLAST. # - Enter NP_010227 into the "Query Sequence" field. # - Choose "Reference proteins (refseq_protein)" as the "Database" in the # "Choose Search Set" section. # - Paste the MYSPE species name into the "Organism" field. # # - Click the "BLAST" button. # You will probably get more than one result. If you get dozens of results or # more, or if you get no results, something went wrong. Reconsider whether the # problem was with your input, try something different, or ask for help. # Otherwise, look for the top-hit in the "Descriptions" tab In some cases # there will be more than one hit with nearly similar E-values. If this is the # case for MYSPE, choose the one with the higher degree of similarity (more # identities) with the N-terminus of the query - i.e. the Query sequence of # the first ~ 100 amino acids. # - If you are submitting this unit for credit, you will need to paste the # relevant section of the BLAST results into your submission page (see task). # - Follow the link to the protein data page, linked from "Accession". # - From there, in a separate tab, open the link to the taxonomy database page # for MYSPE which is linked from the "ORGANISM" record. # == 3.2 Put the information into JSON files =============================== # - Next make a copy of the file "./data/MBP1_SACCE.json" in the "data" # directory and give it a new name that corresponds to MYSPE - e.g. if # MYSPE is called "Crptycoccus neoformans", your file should be called # "MBP1_CRYNE.json"; in that case "MBP1_CRYNE" would also be the # "name" of your protein. Open the file in the RStudio editor and replace # all of the MBP1_SACCE data with the corresponding data of your protein. # # Note: The UniProt ID may not be listed on the NCBI page. To retrieve # it, navigate to http://www.uniprot.org/mapping/ , paste your RefSeq ID # into the query field, make sure "RefSeqProtein" is selected for "From" # and "UniProtKB" is selected for "To", and click "Go". In case this does # not retrieve a single UniProt ID, contact me. # # Save your .json file into your myScripts directory. # # Confirm this step: if (file.exists(sprintf("./myScripts/MBP1_%s.json", biCode(MYSPE)))) { cat("Excellent - all good to continue.\n") } else { stop(sprintf(" The file \"./myScripts/MBP1_%s.json\" does not exist", biCode(MYSPE))) } # # # - Do a similar thing for the MYSPE taxonomy entry. Copy # "./data/refTaxonomy.json" and make a new file named "MYSPEtaxonomy.json". # Create a valid JSON file with only one single entry - that of MYSPE. # # Confirm this step: if (file.exists(sprintf("./myScripts/%staxonomy.json", biCode(MYSPE)))) { cat("Excellent - all good to continue.\n") } else { stop(sprintf(" The file \"./myScripts/%staxonomy.json\" does not exist", biCode(MYSPE))) } # - Validate your two files online at https://jsonlint.com/ # == 3.3 Create an R script to create your own database ==================== # Next: to create your own database. # - Make a new R script, call it "makeProteinDB.R" # - enter the following expression as the first command: # source("./scripts/ABC-createRefDB.R") # - than add the two commands that add your protein and taxonomy data, # they should look like: # # myDB <- dbAddProtein(myDB, # jsonlite::fromJSON("./myScripts/MBP1_.json")) # myDB <- dbAddTaxonomy(myDB, # jsonlite::fromJSON("./myScripts/taxonomy.json")) # # # - save the .json file in the ./myScripts/ folder and source() it: # # source("./myScripts/makeProteinDB.R") # <<<- This command ... # # ... needs to be executed whenever you recreate the database. In particular, # whenever you have added or modified data in any of the JSON files. Later you # will add more information. # Remember this principle. Don't rely on objects in memory - you might # "break" them with a code experiment. But always have a script with # which you can create what you need. # === 3.3.1 Check and validate # Is your protein named according to the pattern "MBP1_MYSPE"? It should be. # And does the taxonomy table contain the systematic name? It should be the same # that you get when you type MYSPE into the console. # Let's compute sequence lengths on the fly (with the function nchar() ), and # open this with the table viewer function View() View(cbind(myDB$protein[ , c("ID", "name", "RefSeqID")], length = nchar(myDB$protein$sequence))) # Does your protein appear in the last row of this table? Where does your # protein's length fall relative to the reference proteins? About the same? Much # shorter? Much longer? If it is less then 500 amino acids long, I would suspect # an error. Contact me for advice. # Is that the right sequence? Is it the same as the one on the NCBI protein # database page? myDB$protein$sequence[nrow(myDB$protein)] # If not, don't continue! Fix the problem first. # Let me repeat: If this does not give you the right sequence of the MYSPE # Mbp1 homologue, DO NOT CONTINUE. Fix the problem. # Is that the right taxonomy ID and binomial name for MYSPE? # This question may be a bit non-trivial ... MYSPE is a species, but the # recorded taxonomy ID may be a strain. We have a utility function, # sameSpecies() that normalizes organism name to the binomial species. # sel <- sameSpecies(myDB$taxonomy$species, MYSPE) myDB$taxonomy[sel, ] # If not, or if the result was "<0 rows> ... " then DO NOT CONTINUE. # Fix the problem first. # Does this give you the right refseq ID for MBP1_MYSPE? sel <- myDB$protein$name == paste0("MBP1_", biCode(MYSPE)) myDB$protein$RefSeqID[sel] # If not, or if the result was "<0 rows> ... " then DO NOT CONTINUE. # Fix the problem first. # == 3.4 Task: submit for credit (part 2/2) ================================ # - On your submission page, copy/paste the BLAST result headers from the # "Alignments" tab, to demonstrate that the data justifies your choice of # protein; you don't need to paste the whole alignment, just the header(s). # Note the relevant values separately: eValue, coverage, %ID etc. and link # to your protein's NCBI protein database page. (Note: in case there are # more than one high-scoring segments included for the SAME protein, you # need to show the results for all of its high-scoring segments.) # - Copy and paste the contents of your two JSON files on your submission # page on the Student Wiki. Make sure they are enclosed in
 ... 
# tags. # - Execute the three commands below and show the result on your submission page biCode(myDB$taxonomy$species) %in% biCode(MYSPE) sel <- sameSpecies(myDB$taxonomy$species, MYSPE) myDB$protein$taxonomyID %in% myDB$taxonomy$ID[sel] # That is all. # [END]