If you are new to working with databases, then this article is for you. In this post I will break down what databases are, how they are created, how you should create them and how websites use them to store everything from user credentials to the products that you add in your shopping cart, to these very words that you are reading right now.
If you are familiar with databases already, then still stick around as you might still learn something new as I try to introduce more complex topics into the mix further down below.
But first, let's start off with a few quick definitions.
What is a database?
Databases are software systems that can store data in a structured format that is ideal for data lookup, retrieval, addition, deletion and for managing relationships between this data.
A few examples of such systems are SQL Server, MySql and PostgresSQL, to name just a few. These systems are referred to as RDBMS's, or Relational Database Management Systems.
This is essentially software that you install on the host computer, typically a web server, and that you configure with the appropriate security protocols for your organization and your own needs. If you are working on a project locally for example, then you would go ahead and install any of the aforementioned software on your machine and you would be able to store and manage data on that particular machine.
Typically however, especially these days, we tend to see more and more of our database infrastructure running on cloud based architecture such as Amazon's AWS or Microsot's Azure platform.
Also note that there are other kinds of databases as well that are not relational in nature.
Databases are themselves comprised of multiple tables of data, which can pretty much resemble an Excel sheet.
What are tables?
What are tables?
Tables are grouped and named collections of rows and columns. Each row in a table is a single record of data and each column is a single property of that record.
Take the following User table as an example:
ID |
Firstname |
Lastname |
Email |
DateCreated |
Bob |
Bob |
G. |
bob@email.com |
8/17/2020 |
That would be the data-view of the User table. But there is another perspective, and that is the actual structure of the table.
The structure in this particular case would be:
User
ID : int
Firstname : string
Lastname : string
Email : string
DateCreated: datetime
Note that a real User table would have many more columns typically and can have many other relationships to other tables, so do NOT use that example as your de facto standard. But it should give you an idea of how tables look internally.
What is a column?
Think of a column as a single property in your data set. In most databases there are 3 key components in each column. Those would be:
- The column name
- The column type
- The default value for that column
Let's take an Address table as an example. Typically, your normal address would be comprised of the following fields:
- Address1
- Address2
- City
- State
- Zipcode
- Country
Those would be your typical column names for this kind of table. Ideally, you always want to be as specific as you can be with your naming convention to avoid confusion and to make development easier. By that I mean use standard naming convention and avoid shortening words such as using z_code instead of zipcode. While it is allowed, it might cause confusion on the code side of things.
A standard column that you should add to every single table in your databases is IsActive (bit) and it's essentially a flag that determines whether an item is still valid or not. You could consider this a deletion in a sense. For the sake of data integrity you normally do not want to do hard deletes on your records.
Notice that I prefix the name with "Is". This informs me that this column is a boolean value without me having to log in to the database to determine its type. I follow this same naming convention for any boolean columns in my database tables usually.
Column types
The column types available to you will mainly depend on the particular RDBMS that you are using, but the following list is relatively standard and can be found in practically every database server.
- varchar
- int
- double
- bit
- date
- datetime
Ideally you are free to use whichever type makes the most sense to you and to your project based on your given options.
For example, a zipcode will for the most part always be a numeric value, so you can store this column as an int value. However, there would be no harm in storing it as a varchar (string), particularly if you are never using it as an integer in your application.
Default values
Default values are just that. Assuming that you enter a record into the database without a given value, you have the option of designating this column with a value of the given type.
If you are using SQL Server as your database, you can set more complex values as defaults. For example, if you have a column such as DateCreated you can set a default to one of the many built-in SQL Server functions such as getdate() or getutcdate(). These functions will generate the appropriate values.
This is mainly useful for preventing excessive null values in your data, which could potentially increase the likelihood of errors in your code.
What is SQL?
SQL, or Structure Query Language, is the language of your database. It is mainly comprised of 4 main commands:
- SELECT
- INSERT
- UPDATE
- DELETE
There are dozens of other keywords in the SQL specification, however those 4 would make up your standard CRUD type implementation. For those unfamiliar with the CRUD acronym, that would be:
- Create
- Read
- Update
- Delete
Essentially, the fundamentals needed to create a strong user interface. Also note that each database system will have its own set of SQL statements and commands as well.
While the SQL specification includes certain keywords that must be implemented by each database vendor, each system can also have its own set of statements unique to that software.
SQL Server for example uses T-SQL as its subset query language which is specifically designed for SQL Server.
Primary keys (PK's)
In order to have a valid dataset, you need a way to identify each column uniquely. If for example you want to retrieve data for a particular user in your User table with a particular email, you would not expect to get back 3 or 4 records. Primary keys, while not required, are useful for identifying unique records in your database. Primary keys can either be their own columns, typically named ID or tablenameID and given a dynamically created and unique ID, or they can be combinations or other columns.
Given the Address table above, adding a primary key would be as simple as adding a new column to the table:
- AddressID
- Address1
- Address2
- City
- State
- Zipcode
- Country
Most RDBMS's allow you to specifically set a column as a PK. Doing this will automatically populate the column with a unique sequential ID each time a new record is added.
You also have the option of using multiple existing columns as the primary key as well. As long as these columns meet the criteria of being unique without the chance of duplication, you can consider it a primary key.
Having multiple columns as a primary key is referred to as a Composite Key.
Note that for the sake of simplicity and to truly avoid duplicates, I always ensure to add a new ID column to each of my database tables.
Referential integrity
Everything mentioned so far has been a good baseline for creating tables of data in a database. These tables however are sometimes related to each other.
Let's take the Address table mentioned above for example. Typically, you don't just store random addresses in a database. They normally belong to a user in your system. In this case you would need to add a UserID to your Address table that would link to the particular User in your database. In which case, your new table would look like the following:
- AddressID
- Address1
- Address2
- City
- State
- Zipcode
- Country
- UserID
Because an Address must belong to a user, the UserID column can not be null and must point to a valid UserID in the User table. This relationship is referrered to as Referential Integrity.
If you attempted to add a new Address given an invalid UserID, the database should indeed reject the INSERT as it violates referential integrity rules. The same would apply if you tried to DELETE a User without first deleting its related Addresses.
Also note that the UserID column in the Address table is referred to as a Foreign Key. However, in its own User table, it would be considered the Primary Key.
Normalization
Normalization is regards to a database is the process of creating relationships so that you reduce overall data redundancy and duplication. Using primary keys and foreign keys as mentioned above is one way to go about doing so, but there are many other methods that can be done as well.
One common pattern that is often seen in database development is the use of sequential columns in a table to store the same kind of data. Take the following table for example:
- UserID
- FirstName
- LastName
- Email
- Phone1
- Phone2
- Phone3
Specifically let's focus on how we are storing phone numbers in this case. While you can indeed store these values in the User table, note that in this particular case you will only be able to store up to 3 phone numbers. Not only is this limiting, but the only buyable solution in fixing it is to add yet another column Phone4 to the table.
Yet another issue to notice with this is that given that most users only have 1 phone number in the database, you will end up with needless 'null' values per user in the Phone2, Phone3 columns.
The solution to this is to create another table solely for storing phone numbers and to link each record to a particular user by adding the UserID foreign key.
- TelephoneID
- PhoneType
- Number
- UserID
Again, notice how we use the UserID foreign key (FK) to link this record to a particular user. In this case we would be able to add as many phone records to a user as is required without having to worry about null values or about being limited.
Those are the core concepts when it comes to database normalization. Just to recap:
1. Use primary keys to ensure uniqueness
2. Use foreign keys to maintain refential integrity
3. Do not use repeated columns for related data
4. Create separate tables to store related data
While there are even stricter protocols that you can follow in order to ensure higher levels of normalization, there is such as a thing as 'too much normalization' in which the development time and complexity will far exceed any benefit that you may get.
If you can get these core principles down then you will be one step closer to creating more robust and scalale software in no time.