Preface

At some point, most data scientists reach the point where they want to show their work to others. But the skills and tools to deploy data science are completely different from the skills and tools needed to do data science.

If you’re a data scientist who wants to get your work in front of the right people, this book aims to equip you with all the technical things you need to know that aren’t data science.

Hopefully, once you’ve read this book, you’ll understand how to deploy your data science, whether you’re building a DIY deployment system or trying to work with your organization’s IT/DevOps/SysAdmin/SRE group to make that happen.

0.1 Moving Data Science to A Server

In recent years, as data science has become more central to organizations, many have been moving their operations off of individual contributors’ laptops and onto centralized servers. Depending on your organization, the centralization of data science operations can make your life way easier – or it can be kinda a bummer.

Server migrations can work well regardless of whether they’re instigated by the data science or the IT organization. The biggest determinant is how well the data science and IT/DevOps teams can collaborate.

Data scientists are good at manipulating and using data, but most have little expertise in SysAdmin work, and aren’t really that interested. On the flip side, IT/DevOps organizations usually don’t really understand data science workflows, the data science development process, or how data scientists use R and Python.

Often, migrations to a server are instigated by the data scientists themselves – usually because they’ve run out of horsepower on their laptops. If you, or one of your teammates, enjoys and is good as SysAdmin work, this can be a great situation! You get the hardware you need for your project quickly and with minimal interference.

On the other hand, most data scientists don’t really want to be SysAdmins, and these systems are often fragile, isolated from other corporate systems, and potentially susceptible to security vulnerabilities.

Other organizations are moving to servers as well, but led by the IT group. For many IT groups, it’s way easier to maintain a centralized server environment, as opposed to helping each data scientist maintain their own environment on their laptop.

Having just one platform makes it much easier to give shared access to more powerful computing platforms, to data sources that require some configuration, and to R and Python packages that wrap around system libraries and can be a pain to configure (looking at you, rJava).

This can be a great situation for data scientists! If the platform is well-configured and scoped, you can get instant access through their web browser to more compute resources, and don’t have to worry about maintaining local installations of data science tools like R, Python, RStudio, and Jupyter, and you don’t need to worry about how to connect to important data sources – those things are just available for use.

But this can also be a bad experience. Long wait times for hardware or software updates, overly restrictive policies – especially around package management – and misunderstandings of what data scientists are trying to do on the platforms can lead to servers going largely unused.

So much of whether the server-based experience is good or not depends on the relationship between the data science and IT/Admin group. In organizations where these groups work together smoothly, this can be a huge win for everyone involved. However, there are some organizations where IT/Admins are so concerned with stability and security that they make it impossible to do data science, and the data scientists spend all their time playing cat-and-mouse games to try to get work done behind IT/Admin’s backs.

If you work at such a place, it’s frankly hard to get much done on the server. It’s probably worth investing some time into improving your relationship with your favorite person on the IT/Admin team. Hopefully, this book will help you understand a little of what’s on the minds of people in the IT group, and a sense of how to talk to them better.

Software information and conventions

I used the knitr package (Xie 2015) and the bookdown package (Xie 2021) to compile my book. My R session information is shown below:

xfun::session_info()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Locale:
##   LC_CTYPE=C.UTF-8       LC_NUMERIC=C          
##   LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
##   LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##   LC_PAPER=C.UTF-8       LC_NAME=C             
##   LC_ADDRESS=C           LC_TELEPHONE=C        
##   LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## Package version:
##   base64enc_0.1.3 bookdown_0.24.4 brio_1.1.2     
##   bslib_0.3.1     cachem_1.0.6    compiler_4.1.2 
##   crayon_1.4.2    desc_1.4.0      digest_0.6.25  
##   downlit_0.4.0   ellipsis_0.3.2  evaluate_0.14  
##   fansi_0.5.0     fastmap_1.1.0   fs_1.5.0       
##   glue_1.4.2      graphics_4.1.2  grDevices_4.1.2
##   highr_0.8       htmltools_0.5.2 jquerylib_0.1.3
##   jsonlite_1.7.1  knitr_1.36      magrittr_1.5   
##   memoise_2.0.0   methods_4.1.2   R6_2.5.1       
##   rappdirs_0.3.3  rlang_0.4.10    rmarkdown_2.11 
##   rprojroot_2.0.2 sass_0.4.0      stats_4.1.2    
##   stringi_1.4.6   stringr_1.4.0   tinytex_0.35   
##   tools_4.1.2     utils_4.1.2     vctrs_0.3.8    
##   xfun_0.28       xml2_1.3.2      yaml_2.2.1

Package names are in bold text (e.g., rmarkdown), and inline code and filenames are formatted in a typewriter font (e.g., knitr::knit('foo.Rmd')). Function names are followed by parentheses (e.g., bookdown::render_book()).

Acknowledgments

A lot of people are helping me write this book.

This book is published to the web using GitHub Actions from rOpenSci.