Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Project 0: Computing Environment

Over the course of the semester, you will work with a variety of software packages, including PostgreSQL, Apache Spark, Python Django, and others. Installing those packages and getting started can often be a hassle, because of software dependencies. You have three choices.

  • Install the different software packages on your own machine (most of these packages should have tutorials to install them on different OSs). If you have a Linux box or a Mac, this should be possible; it may be more difficult with Windows. In any case, although we will try our best, we would likely not be able to help you with any problems.
  • (Preferred Option) Use Vagrant with Virtual Box (as discussed below). If you have a reasonably modern machine (within last 3-4 years), VirtualBox should generally work fine, but with older laptops, the performance may not be as good. See below for more details on this.
  • Use a VM in the CS Department's Horvitz Cluster. Last year, only one person chose to use this, so we will likely do this on demand, if anyone is really interested. You can also similarly use a VM in Amazon or Microsoft Azure clusters -- both of them provide free VMs for beginning users. The problem with this approach is that you only have ssh access into that machine, so you can't run a web browser, etc., without some work.

Git & Github

Git is one of the most widely used version control management systems today, and invaluable when working in a team. GitHub is a web-based hosting service built around git -- it supports hosting git repositories, user management, etc. There are other similar services, e.g., bitbucket.

We will use GitHub to distribute the assignments, and other class materials. Our use of git/github for the class will be minimal; however, we encourage you to use it for collaboration for your class project, or for other classes.

Cloning the Class Repository

We will be using the department gitlab for downloading project materials, and Elms for uploading. Clone Project0 by installing git, and then:

git clone https://gitlab.cs.umd.edu/keleher/p0.git

NOTE: If you are having trouble installing git, you can just download the files instead (as a zipfile), although updating may become tedious.

Vagrant + Virtual Box

Virtual Box is a virtualization software package (similar to VMWare or Parallels), which allows you to construct and run virtual machines on your computer. Vagrant enables users to create and configure lightweight, reproducible, and portable development environments using VirtualBox. We will provide VagrantFiles for different assignments, that will help you start the VMs with the requisite packages installed.

  • In order to use this option, you have to first install Vagrant and VirtualBox on your machine (called Host henceforth). See the instructions on the two websites above to do that.
  • Vagrant makes things super-easy. We will provide you with appropriate setup files -- all you need to do is vagrant up to start the virtual machine.
  • More specifically: in the git sub-directory project0, run vagrant up. This will start the virtual machine in the background.
  • By default, vagrant only provides ssh access into the virtual machine (called Guest VM henceforth), using vagrant ssh. This will work as if you are doing ssh into a remote machine.
  • The Guest VM will have access to the files in the current directory in the host machine (i.e., the files in the project0 directory and its subdirectories). These are mounted in the guest VM at /vagrant. It would be best if you only make edits to that directory -- since those edits will survive a vagrant destroy. In fact, you can continue using your favorite text editor (in the host machine) to edit files, and only use the VM for running specific programs (like postgres, psql, or ipython notebook below).
  • If the Guest VM has a program (e.g., a Web Server) running and listening on a specific port (e.g., 80), you can access those ports from the host machine by adding them to the VagrantFile. The provided VagrantFile has two such mappings: for port 8888, used by iPython (mapped to port 8888 on the host machine), and for port 80, used by a Web server (mapped to port 8080 on the host machine).
  • If you just exist ssh, then the VM continues running in the background.
  • Some other vagrant commands that you would need to be familiar with:
    • vagrant destroy: will remove all traces of the guest machine from your system. It'll stop the guest machine, power it down, and remove all of the guest hard disks. Any data stored on the VM will be deleted (except in /vagrant/). When you are ready to work again, just do vagrant up.
    • vagrant suspend: will save the current running state of the machine and stop it. When you do vagrant up again, it will restart with that state.
    • vagrant halt: will shutdown the machine.

PostgreSQL

PostgreSQL is a full-fledged and powerful relational database system, and will be used for several assignments.

PostgreSQL is already installed on your virtual machine. To get started, start the virtual machine using vagrant up.

The current version of PostgreSQL is 9.5.4. You will find the detailed documentation at: http://www.postgresql.org/docs/9.5/interactive/index.html. However, the version installed on the VMs is 9.3.14, the one available through apt-get right now.

Following steps will get you started with creating a database and populating it with the University dataset provided on the book website: http://www.db-book.com

  • You will be using PostgreSQL in the client-server mode. Recall that the server is a continuously running process that listens on a specific port (the actual port would differ, and you can usually choose it when starting the server). In order to connect to the server, the client will need to know the port. The client and server are often on different machines, but for you, it may be easiest if they are on the same machine (i.e., the virtual machine).

  • Using the psql client is the easiest -- it provides a command-line access to the database. But there are other clients too, including a GUI (although that would require starting the VM in a GUI mode, which is a bit more involved). We will assume psql here. If you really want to use the graphical interfaces, we recommend trying to install PostgreSQL directly on your machine.

  • Important: The server should be already started on your virtual machine -- you do not need to start it. However, the following two help pages discuss how to start the server: Creating a database cluster and Starting the server

  • PostgreSQL server has a default superuser called postgres. You can do everything under that username, or you can create a different username for yourself. If you run a command (say createdb) without any options, it uses the same username that you are logged in under (i.e., vagrant). However, if you haven't created a PostgreSQL user with that name, the command will fail. You can either create a user (by logging in as the superuser), or run everything as a superuser (typically with the option: -U postgres).

  • After the server has started, the first step is to create a database, using the createdb command. PostgreSQL automatically creates one database for its own purpose, called postgres. It is preferable you create a different database for your data. Here are more details on createdb: http://www.postgresql.org/docs/current/static/tutorial-createdb.html

  • We will create a database called university.

     createdb university
  • Once the database is created, you can connect to it. There are many ways to connect to the server. The easiest is to use the commandline tool called psql. Start it by:

     psql university

    psql takes quite a few other options: you can specify different user, a specific port, another server etc. See documentation: http://www.postgresql.org/docs/current/static/app-psql.html

  • Note: you don't need a password here because PostgreSQL uses what's called peer authentication by default. You would typically need a password for other types of connections to the server (e.g., through JDBC).

Now you can start using the database.

  • The psql program has a number of internal commands that are not SQL commands; such commands are often client and database specific. For psql, they begin with the backslash character: \. For example, you can get help on the syntax of various PostgreSQL SQL commands by typing: \h.

  • \d: lists out the tables in the database.

  • All commands like this can be found at: http://www.postgresql.org/docs/current/static/app-psql.html. \? will also list them out.

  • To populate the database using the provided university dataset, use the following: \i DDL.sql, followed by

    \i smallRelationsInsertFile.sql
  • For this to work, the two .sql files must be in the same directory as the one where you started psql. The first command creates the tables, and the second one inserts tuples in it.

  • Create a different database university_large for the larger dataset provided (largeRelationsInsertFile.sql). Since the table names are identical, we need a separate database. You would need this for the reading homework.


Python and Jupyter/IPython

We will be using Python for most of the assignments; you wouldn't typically use Python for systems development, but it works much better as an instructional tool. Python is easy to pick up, and we will also provide skeleton code for most of the assignments.

IPython is an enhanced command shell for Python, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history.

IPython Notebook started as a web browser-based interface to IPython, and proved especially popular with Data Scientists. A few years ago, the Notebook functionality was forked off as a separate project, called Jupyter. Jupyter provides support for many other languages in addition to Python.

  • Start the VM using vagrant up. Python, IPython, and Jupyter are already loaded.

  • To use Python, you can just do python (or ipython), and it will start up the shell.

  • To use Jupyter Notebook, do cd /vagrant followed by:

     jupyter notebook --port=8881 --no-browser --ip=0.0.0.0

This will start a server on the VM, listening on port 8881. We will access it from the host (as discussed above, the VagrantFile maps the 8881 port on the guest VM to the 8881 port on the host VM). To do that, simply start the browser, and point it to: http://127.0.0.1:8881

  • You should see the Notebooks in the displayed directory. Click to open the "Getting Started" Notebook, and follow the instructions therein.

  • The second Notebook ("Basics of SQL") covers basics of SQL, by connecting to your local PostgreSQL instance. The Notebook also serves as an alternative mechanism to run queries. However, in order to use that, you must set up a password in psql using \password (set the password to be ubuntu).

Notes

You might see reboot required on logging in. You can either ignore this, or reboot via vagrant halt and then vagrant up again. If you reboot from /sbin/reboot you might lose the sync'd /vagrants folder. If so, retrieve via vagrant reload.