2015

Dec 29, 2015

Intro to SAS Macro Programming (Part 1)


Trying different statistical languages and being confortable with them is a big advantage for today's statisticians and data scientists.
This is why I wanted to do this series of posts that will include (besides R and Python) SAS, SAS Macro , Julia..etc


What is SAS?

I will try to be as unbiased as possible here. According to its Wikipedia page, SAS  is a software suite that can mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it.

SAS has a multitude of procedures specialized in statistical analysis, data manipulation and visualisation.

SAS programming consists on writing the code composed by SAS Data Steps and SAS Procedures, and Executing it.

The concepts of parameters and variables is included in the SAS Macros.



Dec 19, 2015

The hard thing about hard things: Teaching experience Feedback


Today was the last day of my Statistical Analysis Software course I gave at the Higher School of Statistics and Data Analysis of Tunis.

The only thing that's left is the final exam, so the work is not over...Yet!

It is important to mention that teaching is a -very- hard thing to do, especially if you are a "caring" person and you do CARE about perfection and small details.

I have tried to be as reachable as possible to all my students, especially at the beginning of the class, when they were new and haven't had any significant statistical of computing background.


A Summary of the points I have given in the class this semester



Making everybody happy and all the course material easy is not the goal here. The most important thing is to make sure that all the details are delivered to all students and that all the questions they ask about the course are answered.

Oct 20, 2015

Data Mining basic and advanced concepts - Part 1 : Variables





It is  important to get to know your data. It's very tempting to jump straight to analyzing and forecasting or building predictive or regression models. But Data needs to be ready for that!

Real world data is typically noisy, imperfect and heterogeneous. So cleaning this data and getting familiar with its structure is a crucial task.

Usually, this task of data cleaning is time-consuming.

I would say from a personal experience that about 75 to 80 percent of the time spent with data is time dedicated to getting it ready in a shape suitable for analysis and model building.

Oct 13, 2015

Sep 24, 2015

September Talk: Higher School of Statistics and Data analysis


Hello there!

I have been invited this year by the Junior Enterprise of the Higher School of Statistics and Data Analysis to talk about my (little) experience as a statistician and former student of this school.

As usual, the crowed was interesting, current students of the school were very interested in what I had to say.



Aug 30, 2015

Analyzing Twitter data with R (part 3: Cleaning & organizing the Data)


After we have explained in the previous parts, how to set up the access to Twitter's API and how to import tweets with a simple R command, in this third part we will try to organize and clean the data we have imported in good reusable format.

So let's pick up from where we left off, after finishing the authentication process, now we want to import certain tweets. the command searchTwitter  from the package twitteR will do the job for us.
 we need to create a new object that will contain the data we are about to import and give it the result of our searchTwitter function.

In our example, we will be querying Twitter for the tweets containing the hashtag #Tunisia, we will import 1000 English tweets. and we will have a preview of the imported data with the head()  function:




Aug 17, 2015

String manipulation functions in R ( part 2)


First part of this series of posts.


This is the second part of a series of posts that treat the subject of strings and text manipulation in R. Every time some functions are picked with basic examples and shared here. I hope that at the end of this series, I'll have a good archive of lessons from beginners level to expert level.

The points we will see in this part 2 are:

  1. character encoding 
  2. length of a string 
  3. lowercase and uppercase conversion 
  4. basic string comparison 
  5. concatenating strings
  6. extracting sub-strings 

character encoding 



Aug 8, 2015

Mathematical Symbols in LATEX


I have been going through my hard drive and organizing some of the files I have, PDF's, eBooks and articles and I found a very interesting documents about LATEX math symbols.

Please deactivate ad-blockers to be able to see the pdf document !

UPDATE :

I have published this article a few years and it has attracted the most attention of all my blogposts, so I'd like to thank all of the visitors that shared this post.

Please take some time to check out other articles I have shared about Data Science, Machine learning and Big Data:

Data Mining basic and advanced concepts - Part 2 : Describing Data




Here are some of the most interesting parts of it:


Aug 5, 2015

M&M&M : Mean ~ Median ~ Mode : Uncovered



Is it just me , or people  are still confusing the 3 M's?

 


I've made a general observation about the common mistakes some folks make when talking about statistical concepts., that they can not differentiate between the mean, mode and median of a particular data variable.

For statisticians and domain experts, this might seem like a "stupid" topic to talk about, since statistics are actually a very deep field, and those 3 points are just small dots in an ocean of concepts and knowledge .

But it is very important for those who work with numbers, to really get the idea behind some of the basic concepts they see or use everyday, because as confusing as it might seem for some people, an argument where you use the "mean" as a variable to justify a decision you made will be totally wrong if what you were really referring to in your reasoning was the "Mode"

Jul 20, 2015

Analyzing Twitter data with R (part 2: Importing Tweets )



Twitter is a magnificent source of very interesting data about the world, trends, products, celebrities, current and past events..etc

This is why I  have been interested in analyzing and working in Twitter data for a long time.




In part one we learned how to set up an application and get some codes and key's to use later one.
In this part of the tutorial, we will look at the ways of searching and importing data from Twitter.

The authentication process is very easy: it has mainly  2 parts:


- Entering the keys and secrets you had from Twitter

- Submitting them for authentication and get access to Twitter's API.


Jul 18, 2015

String manipulation functions in R ( part 1)





I wanted to do this quick tutorial because of an observation I made while working with some New R users who struggle with operations that does not involve numbers, operations and statistical calculations.

Handling text, images, files of all formats, are operations made possible within R via its numerous packages.



This time I started with basic strings, and I will probably add to this series later.

We will learn 4 new functions:

  • grep
  • grepl
These 2 function searches for matches of a given string variable, within each element of a character vector. the only difference between them is the output. The first one gives the position of the string that matches the search, the second one gives a logical result of TRUE or FALSE for all the strings of the vector.
  • gsub
This function performs a substitution of a given string by an other in all the strings given as input for the function.
  • str_replace
This function replaces first occurrence of a matched pattern in a string.



Let's start!
Here we input some text into a vector:


Jun 15, 2015

My trainee graduated ! Hola Big Data!



Today was a very happy day, for me and for many people in my professional circle.












It was the presentation day of a fellow Statistician, who presented a project I have worked on with her for the past four months, this time I was on the enterprise side of the project.





Jun 10, 2015

What are the words they use? #Winou_Pétrole



After a long and quite popular online campaign about the natural resources in Tunisia. Online activist have made it a national movement that have spread very fast.

Media haven't started covering the online campaign until it got very big on Facebook.
One of the very positive effects of suck democratic form of expression, is a hearing session of the minister of industry, energy and mines in the house of representatives, about the subject of natural resources in Tunisia.


Apr 2, 2015

#AttaqueBardo: Analyzing Bardo museum terrorist attack on Twitter




The past terrorist attack in Tunisia has been the most violent attack on tourists in the country's history.
I have tried to analyze how did this event impact Tunisia's image in social media, and in particular on Twitter.

Twitter has been used to post and share information about major events in Tunisia, starting from the revolution, where internet activist shared info about the protests, providing the world with real-time updates of the situation in local towns where there have been protests and violent events.

Tunisia has been under major threats from terrorist groups . These extremists have been planning and executing violent attacks on police forces, military and national guard agents throughout the country.

The negative impact of these attacks is direct on tourism and investment in the country.

Terrorist attacks create panic just few months before the summer season, which is the highest period of hotel reservation during the year.


Photos of the hostages being held inside the Bardo Museum


Mar 17, 2015

Analyzing Twitter data with R (part 1: connecting to Twitter API )



All "Smart" Businesses are looking to understand the social media trends , to analyze the massive amount of public data available online and to make insightful decisions based on these analysis.


Source :http://www.slashgear.com/twitter-data-grants-introduced-to-offer-select-institutes-data-trove-05315867/


For any statistician or future data scientist freshly graduating out of the university, it is very important to be able to have certain skills with the statistical modeling and mathematical knowledge.

In this series of posts, I will detail the necessary steps for that you will need to access Twitter, import data, clean it and analyze it and have a conclusion based on the data you have extracted.

It will be a simple step-by-step tutorial if you'd like to call it that way.

This series of posts is destined to students, currently taking data science classes, and to anyone interested in R language and social media in particular .

I've been asked to make the same thing with Python also. I will try to make time and share some Python practices for those using it for data analysis. But I encourage you to learn R !


Mar 2, 2015

R & SQL: Simple Data Science with R and SQL



Hello World!

The topic of today's article is databases.

As Data Scientists and Statisticians work with data everyday, they wont actually use that 50 lines text file data-sets provided by teachers in the statistical analysis courses in a real-world applications.

Statisticians work with massive amounts of data, whether this data is stored in flat-files or in databases, the size of the analyzed data will definitely be more then a couple of hundred records.
Thus the need for a way to extract data from large tables stored in databases in a simple, intuitive way.


Whats is SQL ?


SQL means Structured Query Language. For me, it has always been the "Simple Query Language".

I've always used the term "Simple" to describe the simplicity of learning and using basic sql functions.

Jan 8, 2015

Quick distribution plotting with R



I was asked by new students in the Statistics and Data Analysis School of Tunis (and by others friends) about ways to plot densities and the best software to do that.

I will try to give some examples with  R software on how to plot density .

If you don't know what R is, you can take a look at my old article that explains that very well with some good R learning resources . ( here in French ).

You know that densities are all about random data and the first thing that comes in mind is histograms.

To start this task, we will first set the data sample that we will work with which should be random, and then we will start plotting some nice graphics with the ggplot2 library.

Basic things :Histogram and density plots:



regular histogram
where is the mean!

density plot
Overlay of density and histogram