FIU Demography Census Database Administrator Project

Question Description

I this lab, you are going to work with a real “big data” project. A big task for a database administrator is to ensure that the user queries work properly, and they do not take too long to execute. So far we have been working with small data sets, so even complex queries run very fast. However, let us see what happens when we run into really large data sets.

For this lab, I have obtained permission from PRDH (Historical Demography Research Programme) from Université de Montréal to use the 1881 Canada census with the Complete census data of Canada. This data set has a sample of over 4 million rows from the full census database. Yep. you read that right. 4 million. I have taken the initial steps of downloading the data from their site, cleaning it out a little bit, and building an SQL server database containing the raw data as well as associated lookup values. The following is the reference to the database:

Lisa Dillon, 1881 Canadian Census Project, North Atlantic Population Project, and Minnesota Population Center. National Sample of the 1881 Census of Canada (version 2.0). Montréal, QC: Département de Démographie, Université de Montréal [distributor], 2008.

Given the licensing agreements of this project, I am not allowed to install it permanently in a lab and can only distribute it via a password-protected process. Since the data is too large to be placed in Canvas, I have created a OneDrive link for you to download the file:

Link to OneDrive file (Links to an external site.)

This file is a backup file created using Microsoft SQL Server 2012, so it can be restored as a database in any SQL Server version 2012 or higher. To perform this lab, you will need SQL Server 2012 or higher. SQL Server 2019 community edition is available for free download from Microsoft from: https://www.microsoft.com/en-us/sql-server/sql-server-downloads (Links to an external site.)

You can download the free “express edition” which is sufficient for this lab.

Install SQL Server on your computer. You will need a 64 bit Windows computer (or a Windows virtual machine if you have a Mac). If you don’t have either let me know so I can arrange an alternative.
Fire up SQL Server Management Studio (you can just search in the start menu)
Log in to the local SQL server instance using the default credentials
Right-click Databases – Restore Database…
Under Source, select Device, click on the […] next to the file entry, and select the .bak file you downloaded above.
You can verify the backup media before restoring, and click Ok to restore.
You will now see a database called canadacensus in your list of databases.

Your task in this assignment.

A database administrator does a lot of things, including maintaining users, databases, taking backups and restoring, installing databases, updates, patches, etc. But one of the most important is to ensure that the databases run efficiently. One way to do it is by using an index (or indices).

In the class, you will learn about indices and how indices can speed up certain types of queries, and operations like order by, group by, etc. To see the effect of indices, we are going to look at a few fairly simple queries (at least simple compared to the last few labs). There are only 5, 4 fairly simple – the fifth one a little more complex, still should be pretty easy once you figure out how to do it (see the hint). Write the queries and try them out with the database exactly the way it is provided.

Find the information on the person with the ID 180211209
Select all persons with a specific ethnic origin (take your pick). Note that you may be better off using the origincode attribute (and join with the originlooiup table) rather than using the ethnicorigin column. The ethnicorigin column is user-entered and has a lot of errors. Try both queries and see how a join affects performance.
Select all persons practicing some form of Christianity, sorted by their last names, then first names. Again, you may find it easier to join with the religion lookup table rather than the “religion” attribute but try both of them to see the performance.
Find the number of households with more than 5 members, for each province and district.
Find the provinces where there are more women than men. This is a slightly difficult query – think through this a little bit. How do you count the number of women? What about number of men? Can you put both of them in the same query (you may want to think about using subqueries here).

Once you have come up with the above queries, you will need to determine how long each query takes to execute on the original database. To do this, you will need to do this:

set statistics time on;[your query];

Run your query – along with the results tab, you will now also see additional information in a “Messages” tab. You should see something like below:

SQL Server parse and compile time:    CPU time = 0 ms, elapsed time = 1 ms. SQL Server Execution Times:   CPU time = 0 ms,  elapsed time = 2 ms.(1000 rows affected) SQL Server Execution Times:   CPU time = 11 ms,  elapsed time = 140 ms.

The first reported timing is how long SQL server took to parse and compile your script. Should be pretty quick, 1-2ms. The second timing is for the set command, which should also be pretty quick, 1-2 ms. Then you will see a summary of the results of your query (in my case, I had a query that returns 1000 rows). Then you see the server execution time for the query. So the query I ran here took almost negligible CPU time, and about 140ms real-time. Note that since many different things in your system are using the CPU, your real elapsed time will always be greater than the CPU time. Depending on the machine you run your query on, your timings will change, so I recommend doing the complete assignment on the same machine if possible. Or at least you should do each query comparison in one machine.

Since many different things can affect the running of a query (including the system caching some of the data, running the same query multiple times may affect the timing. So run each query 5 times, and report the CPU time and elapsed time for each execution, and take the average CPU time and average Elapsed time.

Now, given what you have learned in this class, try to come up with what you think should speed up your queries. Some examples would be – to create a primary key (or creating a unique index) which will create a clustered index on your database. Make sure you keep the statistics on, and for each optimization, record the time it took to create the index (or key). Then run the query you were trying to optimize (5 times) and record the results.

See the Excel spreadsheet attached for a simple template for your results. Create appropriate graphs or charts to present your results. Note that for each query, I have data points for unoptimized, as well as two different optimizations – you can think about different indexes to create and run the query based on the index. Note that sometimes creating one index will affect potentially all the queries, so do the unoptimized versions first, and run all the queries that you think will be affected by your optimization.

ExperimentResultsTemplate.xlsx

Submit a Word document containing a narrative of what you did, and what type of performance improvement, if any, you were able to get. Copy and paste any table or chart from your excel spreadsheet. Submit the excel spreadsheet as well.

Note – this assignment requires SQL Server 2012 or higher. You can install it in Windows, or under a virtual machine in a Mac. message me if you want to follow along with the installation process if you do not already have access to SQL Server 2012 (or higher).

^{"Place your order now for a similar assignment and have exceptional work written by our team of experts, guaranteeing you "A" results."}