In 2016, there are many options when it comes to platforms for conducting analytical projects. There are Python, R, SAS, and SPSS for statistical work. You can use a multitude of SQL flavors for databases (mySQL, SQLite, SQL Server, etc.), and to distribute the analysis to the masses, the number and types of reporting systems are too many to list and too varied to represent with any examples. However, the plethora and flexibility of solution available shouldn't make you believe that there are no wrong answers. Many analytics professionals, especially folks who are focused on pure mathematics or statistics, believe platform agnosticism is next to godliness.
"Many analytics professionals ... believe platform agnosticism is next to godliness."
While it isn't fair to say that choosing one of these platforms over another will actually stop you from conducting one or more kinds of analysis, the choice will have a profound impact on how easy certain types of analysis will be to conduct, how scalable your analytical projects will be over time, and how the results can be communicated through your organization.
For the sake of transparency (if, by chance, you missed the title), I am a huge supporter of R and Microsoft SQL Server. There are a few reasons I prefer this duo:
mro and r: a sequel that doesn't suck
A few years ago, Microsoft released their own version of R, which enables parallel execution and adds enhanced matrix algebra utilities from Intel, but, most importantly, they left everything else alone. MRO (Microsoft R Open) and the Intel Math Kernel Library (MKL) greatly enhance performance of matrix algebra on a single core, and add to this advantage when more cores are available.
Shiny: Get ready to look like a hero
Unlike other analysis-centric programming languages, R has Shiny, which lets you deploy applications to people who need customized results. Python has a few users working on something sort of similar, but it is in a fledgling stage by comparison.
This is the death of BI tools. Tableau is down on its knees and just heard the hammer click back. I'll be dedicating a post to exactly why BI is doomed anyway, but Shiny is a significant nail in the coffin. By allowing non-technical users to parameterize complex problems and retrieve their own solutions, an analytics group is able to deliver more general solutions instead of targeted reports (which must always be tweaked if they are not entirely ignored).
Shiny application development feels markedly different from standard R programming, but if you can get a meaningful project and a little time, you'll be functionally proficient in a week or two.
This is huge for businesses. There are people who build Tableau reports for a living making $90k/year just to show the results from analytical work. That is an insane use of money when you can just modify the original scripts in the original environment and get a perfectly functional, professional business solution.
You get to steal from smart people
R lets you harness the work of leading academics. Thanks to its popularity in scientific and academic spheres, R has packages that let you employ even the most obscure research techniques simply and effectively. Need a GMM solution to Weighted Least Squares? You're covered in one line of code.
The sqldf package is a trump card if you don't have the time or interest to master the data shuffling idiosyncrasies of a new language. This package allows you to use SQL syntax directly on data frames (the workhorse of object types in R), which greatly decreases learning time while increasing code transparency throughout your analytics group.
One sql to rule them all...
First, I want to say that I am not being (and have never been) paid by Microsoft to endorse their products. Second, I want to say that I am super, super open to that. If anyone from Microsoft is reading, I would genuinely say anything you want for money...Bill Gates fights crime. Apple plastics are made from puppies.
Hit me up.
SQL Server, which uses Microsoft's T-SQL (Transact SQL) language, has several advantages for analytical work (besides being an extremely slick piece of technical goodness under the hood for computer-sciencey reasons). Besides a sweet, new set of features called R Services, which lets you run R inside your database environment.
Could you mess around with C# and make something similar appear to happen with Python or SAS? If one of your chief concerns on a daily basis is the rapidly declining amount of wall space available to continue hanging your honorary degrees and your shed is filling up with patent approvals, Maybe. However, if your main focus is on the analysis and not on systems engineering (or you're a mere mortal), maybe go for the off-the-shelf solution.
Besides R Services, SQL Server has a few other advantages that come up on a daily basis.
without geography, you're nowhere
SQL Server implements Geography Functions and Geography Indexing in a user-friendly way, meaning spatial calculations become infinitely easier. Gone are the days of using 15 libraries to grid out your x-y points while dealing with map projections, 2D/3D conversion, etc.: It's all handled. Spatial statistics and optimization problems that used to be untenable and beautiful maps with analysis layers become manageable when you combine SQL Server's intuitive geography features with R's functionality.
Semantic Search: Another day at the text mine
SQL Server's Semantic Search capabilities are absolutely awesome. Without too much trouble, you'll be writing queries that are terminator-esque. By utilizing thesauruses (Yes, that's the plural - I Googled.) and a host of other nifty text-mining assistance tools, you can write your own code to do feature extraction, score documents on how similar they are, match & extract key phrases, and more. This is way beyond REGEX.
Though not truly text mining on its own, these features provide you a jumping-off point way ahead of where you would start if you were only using your analytical language.
it's (sort of) free
Using SQL Server Express is a great way to bridge the gap between budget and high-end functionality, and it isn't a bait-and-switch. This isn't one of those "This is totally free...I just need your credit card info" deals. There are a couple of restrictions on Express vs. the full distribution, but they really are minor and related to DB size. It's free. Forever. Period.
Also - this usually goes unmentioned - a consequence of this "free" distribution is that people can install it on their personal computers at home, making it easier for people to pick up and practice if they're new to databases, database management systems, or just new to SQL.
Perfection and other things you won't find in this world
Are there weaknesses to this approach? Of course. Depending on how you divide up the work between SQL Server and R, you may run into the fact that R's flow control is hideously slow (the new parallel execution accelerates calculations, not flow control). And those convenient/awesome/magical geography types from SQL Server? They'll cause R to fail if you try to read them in (it doesn't take many records to exceed R's memory allotment).
Developing packages for R can also be frustrating, as it requires an arcane (Rcane?) series of component installations which, due to reliance on the specified PATH and the fact that every installed component seems to have a different default installation destination.
Java is also finicky on R (this comes up more rarely now, but it does come up). Let me save you some heartache: You need to go find the Java distribution that matches your R architecture (64- or 32-bit). The default Java install won't work.
The dynamic duo I recommend has served me incredibly well over the years. I have not, to this day, run into an analytical problem that could not be solved and solved well with these tools, and they both keep getting better. Picking up the ability to employ these solutions in tandem makes you more valuable, your group far more productive, and your work incredible.
Copyright © 2018