Supporting & Validating Big Data Architecture with Scripting Languages
BY Stephen Dillon |Schneider Electric| Global Solutions | Engineering Fellow | Data Architect
I have spent the majority of my time in 2015 performing hands on validation and optimization of architectures that support analytics. This led me to realize, early on, a need to accomplish a wide variety of associated tasks including [deep breath…] the regular deployment of up to 50 virtual machines in the Cloud, deploying large database clusters, creating storage pools, modifying DB configurations, creating custom backup solutions, remotely creating performance monitors in parallel across many VMs, generating test data, executing parallel performance benchmark tests, and finally logging various KPIs associated with all aspects of the cluster and each test instance. Yes, that’s a long list!
Admittedly; each of those tasks already has some form of a solution available across various independent technologies today that I’m sure Operations professionals know well. However; as a Data Architect, my focus is on data storage and architectures not administration or DevOps which I am sure can also be said for many Data Scientists as well. This was not an issue initially whereas we had a smaller cluster, i.e. 12 nodes, but eventually the project had been tuned well enough at a small scale to justify testing at greater scale i.e. 50 nodes.
Thus the routine and often daily need to re-provision these experimental clusters was daunting if done manually and learning 5 tools or languages really was undesirable and unrealistic. After all; I only needed to be able to accomplish these tasks in a pre-production environment, prove the experimental architectures, develop the initial benchmarks, and then hand over control to others who focus on such things professionally.
I then recalled a book I had read in 2014 entitled “Data Science at the Command Line” [J. Janssens, O’Reilly] and was inspired by some of the related efforts I made with Perl, Python, and [gulp] ultimately Powershell. My hope was to leverage some of the same knowledge and skills for supporting Big Data infrastructure and architecture. Thus I set off to find tools or a scripting languages that allowed me to utilize as few as possible and reuse them across the aforementioned efforts.
After much initial investigation, a moderate amount of experimentation, and at least a dozen books later, I narrowed down my choices to Python and Powershell (PS) for the devOps related tasks i.e. anything not related to test data generation and performance testing. These scripting languages proved to be the two key contenders per my research based on what others in the community were using. I will be completely candid, my company is a Microsoft partner and we heavily use Azure so Powershell naturally came up often in my investigation and our partnership with them heavily influenced my decision.
If you’re not familiar with Powershell, it is Microsoft’s automation scripting language. Historically; I’ve tried to steer clear of PS because I viewed it as a sysAdmin or DBA tool and not a scripting language like Python or Perl.
I’ve learned that’s not really accurate anymore whereas PS has evolved and now possesses the key features you would expect of a typical scripting language including advanced functions, workflows, parallel code execution, and even modular code. Furthermore; it was built atop the .Net framework allowing one to integrate many of the same libraries found in .Net and Windows. If by chance PS does not do something out of the box you can import other libraries. So PS showed great promise and offered more than simply VM provisioning capabilities; including the ability to work with disks and storage pools, creating performance monitors, and creating custom backup solutions. The only real omission at that point was an efficient way to generate a lot of data and execute performance tests in parallel which required further investigation.
Surprisingly to me; Powershell has become a fundamental part of my day to day efforts to validate and support Big Data solutions and architectures. It’s an evolving technology and like any other it is not without its pain points but the positives have outweighed the negatives for me. I would encourage anyone, especially those working with Azure, to consider Powershell as more than just a sysAdmin tool.
It’s relatively easy to up to speed on and with some moderate investment you can begin using the more advanced features of PS to support your Big Data initiatives. If you’re interested in learning more about Powershell, there is a burgeoning user community to explore and many great learning offers such as Microsoft’s video tutorial series offered via The Microsoft Virtual Academy.
Other Articles from Stephen Dillon