Supporting & Validating Big Data Architecture with Scripting Languages

by Roberto Zicari · Published September 16, 2015 · Updated September 16, 2015

Supporting & Validating Big Data Architecture with Scripting Languages

BY Stephen Dillon |Schneider Electric| Global Solutions | Engineering Fellow | Data Architect

I have spent the majority of my time in 2015 performing hands on validation and optimization of architectures that support analytics. This led me to realize, early on, a need to accomplish a wide variety of associated tasks including [deep breath…] the regular deployment of up to 50 virtual machines in the Cloud, deploying large database clusters, creating storage pools, modifying DB configurations, creating custom backup solutions, remotely creating performance monitors in parallel across many VMs, generating test data, executing parallel performance benchmark tests, and finally logging various KPIs associated with all aspects of the cluster and each test instance. Yes, that’s a long list!

Admittedly; each of those tasks already has some form of a solution available across various independent technologies today that I’m sure Operations professionals know well. However; as a Data Architect, my focus is on data storage and architectures not administration or DevOps which I am sure can also be said for many Data Scientists as well. This was not an issue initially whereas we had a smaller cluster, i.e. 12 nodes, but eventually the project had been tuned well enough at a small scale to justify testing at greater scale i.e. 50 nodes.
Thus the routine and often daily need to re-provision these experimental clusters was daunting if done manually and learning 5 tools or languages really was undesirable and unrealistic. After all; I only needed to be able to accomplish these tasks in a pre-production environment, prove the experimental architectures, develop the initial benchmarks, and then hand over control to others who focus on such things professionally.

I then recalled a book I had read in 2014 entitled “Data Science at the Command Line” [J. Janssens, O’Reilly] and was inspired by some of the related efforts I made with Perl, Python, and [gulp] ultimately Powershell. My hope was to leverage some of the same knowledge and skills for supporting Big Data infrastructure and architecture. Thus I set off to find tools or a scripting languages that allowed me to utilize as few as possible and reuse them across the aforementioned efforts.
I investigated a range of technologies including Perl, Python, Javascript, GO, Scala, and Powershell as well as tools such as Chef and Puppet for comparison. Covering each of these is well beyond the scope of this one page article but in summary I quickly deemed tools like Chef to be out of scope for me based on my goals. I required more control and re-usability of fewer technologies than what a prepackaged tool offered. I instead focused on scripting languages that could afford greater control and solve more problems.

After much initial investigation, a moderate amount of experimentation, and at least a dozen books later, I narrowed down my choices to Python and Powershell (PS) for the devOps related tasks i.e. anything not related to test data generation and performance testing. These scripting languages proved to be the two key contenders per my research based on what others in the community were using. I will be completely candid, my company is a Microsoft partner and we heavily use Azure so Powershell naturally came up often in my investigation and our partnership with them heavily influenced my decision.

If you’re not familiar with Powershell, it is Microsoft’s automation scripting language. Historically; I’ve tried to steer clear of PS because I viewed it as a sysAdmin or DBA tool and not a scripting language like Python or Perl.
I’ve learned that’s not really accurate anymore whereas PS has evolved and now possesses the key features you would expect of a typical scripting language including advanced functions, workflows, parallel code execution, and even modular code. Furthermore; it was built atop the .Net framework allowing one to integrate many of the same libraries found in .Net and Windows. If by chance PS does not do something out of the box you can import other libraries. So PS showed great promise and offered more than simply VM provisioning capabilities; including the ability to work with disks and storage pools, creating performance monitors, and creating custom backup solutions. The only real omission at that point was an efficient way to generate a lot of data and execute performance tests in parallel which required further investigation.

My major challenge for performance tests was not necessarily writing the tests or finding the most “performant” scripting language. I was already doing this with Javascript years ago with MongoDB clusters. Historically my pain point was managing the various instances of the Javascript templates being executed, editing the parameters such as the number of meters and data points as well as database level configurations per test, executing them in parallel, and logging the results. Besides; an initial assessment of Powershell’s ability to generate test data and use it as the performance scripting language showed it was not as fast as Javascript or Perl. I had an opportunity though to integrate both Powershell’s ability to execute jobs in parallel and my existing Javascript performance templates.
With a little effort; I wrote a tag substitution engine in PS that modified a Javascript template and instantiated various instances of it with some common settings as well as per instance information. I was able to utilize the PS workflow functionality to execute the test instances remotely and in parallel, log each instance as a separate test for auditing, as well as aggregate the KPIs as if they were from one test instance. I cannot emphasize how much easier this made my life when executing benchmarks.

Surprisingly to me; Powershell has become a fundamental part of my day to day efforts to validate and support Big Data solutions and architectures. It’s an evolving technology and like any other it is not without its pain points but the positives have outweighed the negatives for me. I would encourage anyone, especially those working with Azure, to consider Powershell as more than just a sysAdmin tool.
It’s relatively easy to up to speed on and with some moderate investment you can begin using the more advanced features of PS to support your Big Data initiatives. If you’re interested in learning more about Powershell, there is a burgeoning user community to explore and many great learning offers such as Microsoft’s video tutorial series offered via The Microsoft Virtual Academy.

Other Articles from Stephen Dillon

–Rethinking Contemporary Operations Teams by Stephen Dillon. ODBMS.org

–Embracing the evolution of Graphs by Stephen Dillon. ODBMS.org

Supporting & Validating Big Data Architecture with Scripting Languages

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

HPCC Systems from LexisNexis Risk Solutions

KX

InterSystems

MySQL/Oracle

SingleStore

Supporters

McObject

NEXTGRES

Progress

Raima

Scality

Volt Active Data