Project 0: Computing Environment
Over the course of the semester, you will work with a variety of software packages, including PostgreSQL, Apache Spark, Python Django, and others. Installing those packages and getting started can often be a hassle, because of software dependencies. You have three choices.
- Install the different software packages on your own machine (most of these packages should have tutorials to install them on different OSs). If you have a Linux box or a Mac, this should be possible; it may be more difficult with Windows. In any case, although we will try our best, we would likely not be able to help you with any problems.
- (Preferred Option) Use Vagrant with Virtual Box (as discussed below). If you have a reasonably modern machine (within last 3-4 years), VirtualBox should generally work fine, but with older laptops, the performance may not be as good. See below for more details on this.
- Use a VM in the CS Department's Horvitz Cluster. Last year, only one person chose to use this, so we will likely do this on demand, if anyone is really interested. You can also similarly use a VM in Amazon or Microsoft Azure clusters -- both of them provide free VMs for beginning users. The problem with this approach is that you only have
ssh
access into that machine, so you can't run a web browser, etc., without some work.
Git & Github
Git is one of the most widely used version control management systems today, and invaluable when working in a team. GitHub is a web-based hosting service built around git -- it supports hosting git repositories, user management, etc. There are other similar services, e.g., bitbucket.
We will use GitHub to distribute the assignments, and other class materials. Our use of git/github for the class will be minimal; however, we encourage you to use it for collaboration for your class project, or for other classes.
Cloning the Class Repository
We will be using the department gitlab
for downloading project materials,
and Elms for uploading. Clone Project0
by installing git
, and then:
git clone https://gitlab.cs.umd.edu/keleher/p0.git
NOTE: If you are having trouble installing git
, you can just download the files instead (as a zipfile), although updating may become tedious.
Vagrant + Virtual Box
Virtual Box is a virtualization software package (similar to VMWare or Parallels), which allows you to construct and run virtual machines on your computer. Vagrant enables users to create and configure lightweight, reproducible, and portable development environments using VirtualBox. We will provide VagrantFiles
for different assignments, that will help you start the VMs with the requisite packages installed.
- In order to use this option, you have to first install Vagrant and VirtualBox on your machine (called
Host
henceforth). See the instructions on the two websites above to do that. - Vagrant makes things super-easy. We will provide you with appropriate setup files -- all you need to do is
vagrant up
to start the virtual machine. - More specifically: in the git sub-directory
project0
, runvagrant up
. This will start the virtual machine in the background. - By default, vagrant only provides ssh access into the virtual machine (called
Guest VM
henceforth), usingvagrant ssh
. This will work as if you are doingssh
into a remote machine. - The Guest VM will have access to the files in the current directory in the host machine (i.e., the files in the
project0
directory and its subdirectories). These are mounted in the guest VM at/vagrant
. It would be best if you only make edits to that directory -- since those edits will survive avagrant destroy
. In fact, you can continue using your favorite text editor (in the host machine) to edit files, and only use the VM for running specific programs (likepostgres
,psql
, oripython notebook
below). - If the Guest VM has a program (e.g., a Web Server) running and listening on a specific port (e.g., 80), you can access those ports from the host machine by adding them to the
VagrantFile
. The provided VagrantFile has two such mappings: for port 8888, used by iPython (mapped to port 8888 on the host machine), and for port 80, used by a Web server (mapped to port 8080 on the host machine). - If you just exist
ssh
, then the VM continues running in the background. - Some other vagrant commands that you would need to be familiar with:
-
vagrant destroy
: will remove all traces of the guest machine from your system. It'll stop the guest machine, power it down, and remove all of the guest hard disks. Any data stored on the VM will be deleted (except in/vagrant/
). When you are ready to work again, just dovagrant up
. -
vagrant suspend
: will save the current running state of the machine and stop it. When you dovagrant up
again, it will restart with that state. -
vagrant halt
: will shutdown the machine.
-
PostgreSQL
PostgreSQL is a full-fledged and powerful relational database system, and will be used for several assignments.
PostgreSQL is already installed on your virtual machine. To get started, start the virtual machine using vagrant up
.
The current version of PostgreSQL is 9.5.4. You will find the detailed documentation at:
http://www.postgresql.org/docs/9.5/interactive/index.html. However, the version installed on the VMs is 9.3.14, the one available through apt-get
right now.
Following steps will get you started with creating a database and populating it with the University
dataset provided on the book website: http://www.db-book.com
-
You will be using PostgreSQL in the client-server mode. Recall that the server is a continuously running process that listens on a specific port (the actual port would differ, and you can usually choose it when starting the server). In order to connect to the server, the client will need to know the port. The client and server are often on different machines, but for you, it may be easiest if they are on the same machine (i.e., the virtual machine).
-
Using the psql client is the easiest -- it provides a command-line access to the database. But there are other clients too, including a GUI (although that would require starting the VM in a GUI mode, which is a bit more involved). We will assume psql here. If you really want to use the graphical interfaces, we recommend trying to install PostgreSQL directly on your machine.
-
Important: The server should be already started on your virtual machine -- you do not need to start it. However, the following two help pages discuss how to start the server: Creating a database cluster and Starting the server
-
PostgreSQL server has a default superuser called postgres. You can do everything under that username, or you can create a different username for yourself. If you run a command (say
createdb
) without any options, it uses the same username that you are logged in under (i.e.,vagrant
). However, if you haven't created a PostgreSQL user with that name, the command will fail. You can either create a user (by logging in as the superuser), or run everything as a superuser (typically with the option: -U postgres). -
After the server has started, the first step is to create a database, using the createdb command. PostgreSQL automatically creates one database for its own purpose, called postgres. It is preferable you create a different database for your data. Here are more details on createdb: http://www.postgresql.org/docs/current/static/tutorial-createdb.html
-
We will create a database called university.
createdb university
-
Once the database is created, you can connect to it. There are many ways to connect to the server. The easiest is to use the commandline tool called psql. Start it by:
psql university
psql takes quite a few other options: you can specify different user, a specific port, another server etc. See documentation: http://www.postgresql.org/docs/current/static/app-psql.html
-
Note: you don't need a password here because PostgreSQL uses what's called
peer authentication
by default. You would typically need a password for other types of connections to the server (e.g., through JDBC).
Now you can start using the database.
-
The psql program has a number of internal commands that are not SQL commands; such commands are often client and database specific. For psql, they begin with the backslash character:
\
. For example, you can get help on the syntax of various PostgreSQL SQL commands by typing:\h
. -
\d
: lists out the tables in the database. -
All commands like this can be found at: http://www.postgresql.org/docs/current/static/app-psql.html.
\?
will also list them out. -
To populate the database using the provided university dataset, use the following:
\i DDL.sql
, followed by\i smallRelationsInsertFile.sql
-
For this to work, the two .sql files must be in the same directory as the one where you started psql. The first command creates the tables, and the second one inserts tuples in it.
-
Create a different database
university_large
for the larger dataset provided (largeRelationsInsertFile.sql
). Since the table names are identical, we need a separate database. You would need this for the reading homework.
Python and Jupyter/IPython
We will be using Python for most of the assignments; you wouldn't typically use Python for systems development, but it works much better as an instructional tool. Python is easy to pick up, and we will also provide skeleton code for most of the assignments.
IPython is an enhanced command shell for Python, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history.
IPython Notebook started as a web browser-based interface to IPython, and proved especially popular with Data Scientists. A few years ago, the Notebook functionality was forked off as a separate project, called Jupyter. Jupyter provides support for many other languages in addition to Python.
-
Start the VM using
vagrant up
. Python, IPython, and Jupyter are already loaded. -
To use Python, you can just do
python
(oripython
), and it will start up the shell. -
To use Jupyter Notebook, do
cd /vagrant
followed by:jupyter notebook --port=8881 --no-browser --ip=0.0.0.0
This will start a server on the VM, listening on port 8881. We will access it from the host (as discussed above, the VagrantFile maps the 8881 port on the guest VM to the 8881 port on the host VM). To do that, simply start the browser, and point it to: http://127.0.0.1:8881
-
You should see the Notebooks in the displayed directory. Click to open the "Getting Started" Notebook, and follow the instructions therein.
-
The second Notebook ("Basics of SQL") covers basics of SQL, by connecting to your local PostgreSQL instance. The Notebook also serves as an alternative mechanism to run queries. However, in order to use that, you must set up a password in
psql
using\password
(set the password to beubuntu
).
Notes
You might see reboot required
on logging in. You can either ignore this, or
reboot via vagrant halt
and then vagrant up
again. If you reboot from
/sbin/reboot
you might lose the sync'd /vagrants
folder. If so, retrieve
via vagrant reload
.