Build a data product that matters.
TL;DR: To become maximally effective, all data scientists should know the flow of producing & maintaining a data product to better understand how to fit their piece into the puzzle.
Links to other posts on this series can be found at the bottom of this page (as they are written).
The most important skill of any aspiring data professional is the ability to make kick butt, automated data features (and ultimately, products).
This fact has given rise to what I’ll dub here as the “full stack” data scientist. A full stack data scientist is someone comfortable with building the entire data product “circuit” him/herself. At a high level, this entails a command of comfort in building data pipelines, synthesizing readable code-bases, employing relevant statistical/machine learning paradigms, and interacting with an appropriate database that can empower the delivery of value that you ultimately want to produce. (Perhaps even a touch of front end web development if it fits your problem space 😀)
The truth is, in most positions, you wont have the capacity or expectations to build the entire data product yourself. Large teams of capable statisticians, developers, & engineers make this possible. However while one may eventually become a specialist in a subset of data science over his/her career, very (VERY) few will ever have to conduct all of their work in complete isolation. Nearly every data scientist will work in a team that must co-ordinate their work into complete value-adding services. Thus everyone with hands in a data project should be familiar enough with the core processes to be able to answer general questions like:
- Why does this need to be built in the first place?
- Who are the stakeholders in this service/product/feature?
- How am I getting this data in (& where does it come from)?
- What other services & messaging queues will I need to interact with?
- What language is the best to write part x in?
- Which algorithm will you pick to tackle the problem?
- What data structures play best with my problem?
- What database will be best to persist data to?
- How should I structure my schema?
- How can I improve the quality of the output via statistical insight &&/|| machine learning?
- How performant is my solution, and where might bottlenecks arise?
- How is this going to be deployed, and how can it be most effectively monitored?
- What tests do I need to build to make sure this will hold up against the test of time?
- Is what I’ve written readable?
- Can others easily use, understand, & interact with my code?
- How does this solution scale?
The ability to speak intelligently about all of these prompts with your colleagues will not only make you a more effective developer, but will also inevitably lead to better & more maintainable products/services. You gain the knowledge of how to best fit your piece of the puzzle in with those of the team that may have been built 20 years ago, 20+ years in the future, or any other point in between.
The goal of this series is to go over my approach & methodology to tackling these tricky questions as I have come to understand them. These posts will be loosely tied around my own experiences, word collages (in paragraph form!) of info, tips, tricks, & facts I’ve picked up along my ride, and a case study of building a micro service-based data feature that we are using at Atomata today for work with time series data streams to optimize indoor environments for cannabis cultivation.
I hope the process can be informative at all steps, but feel free to skip around. I’ll be writing to this series over the upcoming weeks/months as time permits, hitting all of the questions posted above.
- Do your research, define your motivation (& eat your vegetables).
- On building a path for a pilgrimage. (Coming soon!)
- Getting cozy with an algorithm.
- Tuning exotic instruments to play pretty music.
- Waterproofing your stack
- Take it live!
I love feedback, so feel free to post any ideas, criticisms, or even requests for additional subjects below.