R & SQL: Simple Data Science with R and SQL

Mar 2, 2015

R & SQL: Simple Data Science with R and SQL



Hello World!

The topic of today's article is databases.

As Data Scientists and Statisticians work with data everyday, they wont actually use that 50 lines text file data-sets provided by teachers in the statistical analysis courses in a real-world applications.

Statisticians work with massive amounts of data, whether this data is stored in flat-files or in databases, the size of the analyzed data will definitely be more then a couple of hundred records.
Thus the need for a way to extract data from large tables stored in databases in a simple, intuitive way.


Whats is SQL ?


SQL means Structured Query Language. For me, it has always been the "Simple Query Language".

I've always used the term "Simple" to describe the simplicity of learning and using basic sql functions.



SQL is the standard language for relational database management systems (Here I am referring to the ANSI definition ).

So it is used to work with databases, performing various operations like creating tables, extracting data, updating tables, deleting and dropping entire databases...etc

You can fined more about the basics of this awesome language in the links bellow.

what are the different R & sql packages? 


As most of you know, R is a statistical analysis software. It has an amazing community, a wide range of packages (now 5000+ packages available ) and a strong documentation and it is used in almost all domains.

The old way of working with data stored in databases was to separate the extraction of the data and the analysis in two operations that are made in two or more software and that require time to first extract this data, aggregate it and save it in flat files and then open it with an analysis software and work on it.

With R, you can do all these operations in one interface, with the help of packages that will provide a way to connect to databases, use the SQL language to extract your data ( and also to add, delete and updates your tables) from your Rstudio interface ( I use Rstudio for it's simplicity and the various cool things that I can do,in just one interface).

The package:


For this task, we will use the ODBC Connectivity package "RODBC" 

This package is great for managing connections not only with SQL server databases but also with Mysql, 
PostgreSQL,Oracle and SQLite.   So it covers almost all the basic database sources that we need to use.

Simple example :


The RODBC package offers multiple possibilities to query the database and do many operations.


Making a connection with the database. The logic of the package is that the user need to make a connection channel from the client, in our case the R package, to the database management system.
And this connection need to be closed every time the work with the database is over.

the code bellow will give you a basic look on how to do the connection and the simple manipulation of a database in R.

I will try to enrich this post with more examples in the near future.