Introduction To Data Modeling and Data Access Methods
John "Scooter" Morris
April 5, 2017
Overview
- Limitations
- Data Modeling
- ER Diagrams
- ER Diagrams - Examples
- Data Access Methods
Limitations (Data Modeling)
- Data modeling is a large topic
- We're going to focus on one data modeling technique (Entity-Relationship Diagrams)
- What am I not telling you about?
- Other data modeling techniques (see Data Modeling on Wikipedia for a more complete list)
- Application modeling techniques like UML
- User modeling techniques that attempt to document the user interaction
- This is an introduction
- enough to get started and to know what you don't know (I hope)
- Ask questions!
Example Problem
A system to automate the tracking and documentation of plasmid construction
- Terminology:
- fragment: a length of double-stranded DNA
- plasmid: a circular fragment
- recipe: a series of manipulations of the DNA to produce a new plasmid with cDNA of interest inserted
- Needs:
- Data processing -- convert raw data into results
- Visualization -- a way to visualize the results
- Data storage -- store the results (and perhaps the raw data)
Example problem
Implementation Approaches
- Incremental implementation
- Start coding right away with small parts of system
- Add complexity as you go along
- Pros:
- get something done quickly
- learn by doing
- Cons:
- will probably have to throw out a lot of code
- early data model will constrain your implementation
- changes to data model will require significant refactoring
- over time, will become unmaintainable
- Only recommended for quick-and-dirty throw-away code
Implementation Approaches
- Detailed design
- Produce detailed design
- Data model
- Class diagrams
- UML, etc.
- Code only after design complete
- Pros:
- probably get cleaner implementation
- design documentation will serve to assist in maintenance
- better able to scope project and estimate resources
- Cons:
- very time consuming
- changes in research may happen too quick to make this practical
- users may get inpatient
- Only recommended for very limited, stable projects
- Data model is key
Implementation Approaches
- Hybrid approach
- Produce data model design
- Do fragment implementation
- Pros:
- changing the data model is hard, probably will have the largest impact on your code
- data model documentation is a useful document to discuss system with colleagues
- get benefit of fragment implementation
- Cons:
- still have to spend some up-front design time
- will (undoubtedly) need to throw out some code or refactor
- Recommended approach for most projects
Data Modeling
- The FIRST Step
- Structured way to understand the data semantics
- Independent of underlying platform
- Way to communicate with team members (including users)
- Excellent (minimal?) documentation
- Example: ER Diagrams
ER Diagrams: Notation
|
- Entity (Entity Type)
- A collection of entities that share common properties (a thing)
- e.g. Fragment, Recipe, Gene
|
|
- Attribute
- Property of an entity that is of interest
- e.g. Name, File, Sequence
|
|
- Relationship
- An association between entities
|
|
- Degree
- Number of entities involved in the relationship
- one-to-many, one-to-one, many-to-many
|
ER Diagrams: Extended Example
- Extend the system....
- Add the ability to extract the experimental details
- Add more information about the gene: promotors, enhancers, RBS, introns, exons, CDS, etc.
- Add information about the protein: structure, function, sequence, etc.
ER Diagrams: Extended Example - 1
One possible design:
ER Diagrams: Extended Example - 2
Drop "Feature" entity:
ER Diagrams: Extended Example - 3
Expand structure:
ER Diagrams: Other Examples
- The canonical: employee/employer/department system
- Another database "favorite": sales/parts/inventory
- More relevant: on-line laboratory information management system (LIMS)
- Modeling systems:
- apoptosis signaling pathway
- ascending pain pathway
ER Diagrams: Apoptosis Signaling Pathway
ER Diagrams: Ascending Pain Pathway
ER Diagrams
- Recommended Reading:
- Chen, P.S. The entity-relationship model: toward a unified view of data. ACM Trans on Database Syst. pp 9-36 (March 1976)
Data Modeling Assignment
Put together an ER diagram for a database system for cellular pathways. Include information
about the proteins, metabolites, functions, interactions, cellular locations, and evidence codes. Don't attempt
to be complete -- focus on the major entities and their relationships.
Data Access Methods
- How is the data accessed?
- Why do we care?
- Important for special-purpose databases
- Some systems give you choices
- Terminology:
- Index: an access path into the data
- Key: a field (or fields) used to access the data
- Primary key: a field (combination) whose values uniquely identify the record
Data Access Methods - Linear
- Simple record-oriented view
- Access is through sequential reads
- OK for small data stores -- very slow when the number of records gets large
Data Access Methods - Hash
- Compute a function to access the data
- e.g. add up the characters to produce an integer
- Usually requires a separate index
- The "goodness" of the hash function is important
- A perfect hash function would result in a direct access to the data (i.e. a one-to-one relationship)
- Perfect hash functions are almost never possible
- This results in the possibility of multiple "hits" per hash value (or bucket)
Data Access Methods - Hash
- Simple (and silly) example:
- Hash on the first letter of the recipe name:
Data Access Methods - BTree
Data Access Methods - BTree
- Example: find pBR322.f2 assuming a Btree index on fragment name
- pBR322.f2 > pBR322 and <= pHR5CV
- we take the middle node, which contains pBR322.f2
- If there are more layers, continue repeating the algorithm until you get to the sequence set
Data Access Methods - BTree
Data Access Methods
- There are many other indexing techniques
- Indexing can substantially improve access times
- Deciding what field to index on depends on usage patterns
- You can have multiple indices, but that substantially increases insert time and space requirements
Data Access Methods
- Questions?
- Recommended Reading:
- Knuth, D. E. The Art of Computer Programming, Volume III: Sorting and Searching. Reading, Mass.: Addison-Wesley (1973)