Jonathan Ortiz is a data scientist and knowledge engineer at data.world, an Austin startup helping data people solve problems faster by building the most meaningful, collaborative, and abundant data resource in the world. Through a 2016 DataStart grant fellowship, Jonathan worked with the US Census Bureau to make Census data more accessible and easier to combine with other data sets via data.world. The fellowship was managed by RENCI and the South Big Data Hub and supported by the National Science Foundation.
Jonathan is a former student at the University of Texas – Austin Data Analytics and Big Data Program. Prior to his work in data science, Jonathan accumulated extensive experience in media, marketing, and consulting. Most recently, Jonathan served as Director at Scratch, a division of Viacom that drives innovation by channeling the power of brands like MTV, BET, Comedy Central, and Nickelodeon. Jonathan received a Bachelor of Science in Economics with concentrations in Marketing and Management from the Wharton School at the University of Pennsylvania.
Q1. Is domain knowledge necessary for a data scientist?
Strictly speaking? Yes. Practically speaking? No. While it’s true that domain knowledge is necessary to understand the issue at hand, develop hypotheses, and gauge model performance, I would argue that domain knowledge is not necessary from the outset of an analysis. In fact, sometimes it’s better to develop this domain familiarity as one progresses through a project, because when we learn something for the first time we are typically more rigorous in our approach. Data scientists with a lot of domain experience can become lazy and take too much for granted. Plus, I’m a firm believer that one can learn anything by talking to experts, reading books, using the Internet, and looking through historical data projects.
So, to answer the question in its strictest sense, yes, domain knowledge is necessary, but to say that all data scientists must develop a bevy of domain experience before practicing any analytics is to assume one cannot learn by doing.
Q2. What should every data scientist know about machine learning?
Oh, I have a lot of these:
● It’s not a silver bullet
● If you put garbage in, you get garbage out
● Start small and simple
● Have a baseline for comparison
● Align model evaluation with the objective of the prediction (e.g. the business question)
● Keep up-to-date on the latest methods (they’ve already changed as you read this sentence)
Also, many people ask me whether machine learning and automation will do away with the data science profession, and I don’t think so. Machine learning is being “democratized” to the point where most people will be able to use it through some point-and-click UI very soon, and more and more methods will become fully automated in the future, but this is not to say that the role of the data scientist will cease to exist. As with all things that get automated, automation in data science will lighten the load for experts to focus on the more creative work.
Q3. What are the most effective machine learning algorithms?
This really just depends. There is a tradeoff between accuracy and complexity. There are resource issues to consider. There are also categorically different types of problems: text, clustering, classification, regression, time series, reinforcement learning…and these different types may require different algorithms. So, I tend to think the most effective algorithm is the one that surpasses the bar for accuracy while being comparatively easy to explain and cheap to implement with the data and compute resources at one’s disposal.
Q4. What is your experience with data blending?
This is one of the core skills they typically don’t prioritize in all the online courses, bootcamps, and even some college curricula; there is a tendency to focus on data mining and machine learning techniques. But, experienced data scientists will tell you that anywhere from 50-80% of their time is spent doing this data janitorial work. For me, that number is even higher–closer to 98%–because my particular role at data.world is more that of a data provider than that of a data user. I make datasets available to our users in clean, linked formats such that users can take them and run with them, creating innovative solutions to the world’s problems. Therefore, my main focus is on data engineering and semantic modeling using RDF technologies.
Semantics and data blending go hand-in-hand, because easily (or automatically) linking data requires uniform vocabularies, and semantic technologies can be used to assign precise meaning to data resources. With semantic modeling, data discovery is easier because correlations are explicit in the data itself. interoperability increases, because we use universal identifiers to denote data resources, and machines can process the data better because the connections are explicitly stated in a standard format – just as web browsers can show you any web page without needing custom code for each one. This is an exciting field, because semantics provide the basis for contextualized data linking, which is integral to machine learning and artificial intelligence.
Q5. Can data ingestion be automated?
Not only can it be, but, at this point, it needs to be (at least if you plan to stay competitive)! Automated data ingestion processes enable agile analytics, which is the foundation of an information-based organization. Companies that get this right can spin up and sustain a never-ending feedback loop of data collection, ingestion, measurement, and improvement. These companies scale at alarming rates due to this data-driven feedback loop.
Q6. How do you ensure data quality?
The world is a messy place, and, therefore, so is the web and so is data. No matter what you do, there’s always going to be dirty data lacking attributes entirely, missing values within attributes, and riddled with inaccuracies. The best way to alleviate this is for all data users to track provenance of their data and allow for reproducibility of their analyses and models. The open-source software development philosophy will be co-opted by data scientists as more and more of them collaborate on data projects. By storing source data files, scripts, and models on open platforms, data scientists enable reproducibility of their research and allow others to find issues and offer improvements.
Q7. When is data mining a suitable technique to use and when not?
Two conditions would rule its use out for me. Either:
1. Not having enough data collected
2. Not having enough attributes in the data
Data cannot tell you what’s not there. For example, let’s say a bank wants to mine their electronic transaction data to determine whether or not charges are fraudulent, but they only have 1 day of transactions and the data only contain an account number, a time of day, and a transaction amount. No amount of data mining will answer this question with the data provided.
Q8. What techniques can you use to leverage interesting metadata?
Linked data! I wrote about this in Q4, but it bears repeating here. The state of the world today is a collection of individual data silos meant to address the needs of the person, org., or enterprise that stood them up, with little thought given to how their data is related to other data. Linked data is the next-generation data architecture that makes individual data points as addressable and linkable as documents on the web are today. At data.world, we’re helping to bridge the gap from the current state of the world to that future by building a social network that connects folks collaborating on data projects, and automating much of the data linking process.
Q9. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
I think “good” insights are those that are both “relevant” and “correct,” and those are the ones you want to shoot for. As I wrote in Q2, always have a baseline for comparison.
You can do this either by experimenting, where you actually run a controlled test between different options and determine empirically which is the preferred outcome (like when A/B testing or using a Multi-armed Bandit algorithm to determine optimal features on a website), or by comparing predictive models to the current ground truth or projected outcomes from current data.
Also, solicit feedback about your results early and often by showing your customers, clients, and domain experts. Gather as much feedback as you can throughout the process in order to iterate on the models.