A (im)perfect romance: Amazon SageMaker and Jupyter Notebooks

In my role as a data engineer, I’ve been working in with Amazon SageMaker and Jupyter Notebook App almost everyday since I joined the company. Mostly coding in pandas to perform data transformations and manipulation or using SPARQL to query our AWS Neptune graph databases.

Early on, I was tasked with presenting on “What is Jupyter Notebooks and why do you use it?” to our Analytics team in an internal showcase session we run to get to know one another’s work, processes and projects better.

In doing so, I delved into an unexpected rabbit hole of debate on Notebooks, SageMaker, why you must use them together, why you shouldn’t use them, what you can and can’t do with them and why that’s good or bad.

If you work in a role in data science, engineering and machine learning you’ll likely have some interaction with Jupyter Notebooks in your job (if not everyday like me!). So I thought it would be useful to explain what they are, why they’re useful (from my perspective in data engineering) and the main debates you may encounter.

What is Amazon SageMaker?

SageMaker is a cloud machine-learning platform that was launched by Amazon to help developers, data scientists and data engineers process and transform data. SageMaker also allows you to create, train, and deploy machine-learning models in the cloud.

Why is it useful?

In short, SageMaker is useful because it allows us to visualise, process, clean and transform data into our required forms using the traditional methods we would normally use (in Python), but more efficiently and safely than if you were to run it on your local machine. It also runs models and provides a single platform to get to production faster with less effort (and at lower cost).

As an engineer, SageMaker provides scalable and reliable means of running data processing workloads. An an engineer, you’re able to connect to existing storage (note: your data must be stored in an Amazon S3 bucket), run your projects efficiently, and save the output to your preferred storage. It provides logs and metrics which are very useful when working in team structures with multiple contributors at different points in the pipeline.

Importantly, it’s well integrated with other AWS products and services – meaning you don’t have to give external access to your virtual private cloud (VPC) when running SageMaker as it’s already internal. This can be particularly useful when running data science and engineer projects so you don’t expose your database externally and helping keep your projects secure. It also has various other security features which your team can enable as part of AWS’s shared responsibility model which you can find more out about here.

However, SageMaker doesn’t often like to work alone and requires a friend for us to help us navigate it well:

Jupyter Notebooks and JupyterLab

Jupyter Notebook (notebooks) and JupyterLab are open-source web applications created by Jupyter (or Project Jupyter) where you can create and share documents that contain live code, equations, visualisations and narrative text.

The most common uses include:

  • 🧼 data cleaning and transformation
  • 📈 data visualisation
  • 🧮 numerical simulation
  • 🔮 statistical modelling
  • 🤖 machine learning

What is a notebook?

Essentially, a notebook is a document that includes code that you will be using for your project alongside its output. This can include visualisations (e.g. graphs or database extracts), narrative text (e.g. instructions), mathematical equations, and more.

It is different to a typical coding file as it runs a series of mutually-independent cells. This can be useful when testing and applying processing tasks or machine learning models as you are able to test each step as you go or (more commonly) edit and refine each step without having to run the full document. It helps make your daily tasks more efficient to run and allows others to test and check your work without removing (or even # out) your work.

It’s worth noting that while I’m talking about Jupyter’s use-case with Amazon SageMaker, that you can run notebooks locally and with other platforms and cloud providers.

What is JupyterLab?

JupyterLab is Jupyter’s next-generation web app which includes notebooks, but allows you to open several notebooks and files at once (including HTML, text, .csv and other files data professionals interact with often). You open all these files in a single tab in your browser, whereas notebooks open separately. A great resource explaining the difference (and similarities) between notebooks and JupyterLab here.

Why use it with SageMaker?

Well, first of all, it’s the one AWS recommends you use with it:

Taken from AWS ML for startups presentation by Allie K. Miller and Sean Wilkinson – 2021

There’s loads of great information about why Amazon have partnered with Jupyter Notebook App as the primary interface, and you can find out more about this (in much better words than mine!) and walk-throughs on how to start your SageMaker Notebook instance here.

Lastly, Jupyter’s apps are extensible and flexible and supports various workflows within data science and scientific computing. It also supports over 40 programming languages, including Python, R, Julia, and Scala.

Why do I like it?

As a relative newbie to professional engineering, well-documented code is a huge help. Notebooks enable you to intersperse markdown with code blocks to help explain what the steps are or remind your future self why you set up the notebook in this way.

As you can imagine, I query databases a lot and use notebooks to create various data frames, transformations and files which are entered into into our processing and ingestion pipelines at various stages. Output previews in notebooks mean I’m able to have a quick look at what I’m creating as I go. I find this particularly useful when querying from our graph using SPARQL – a language I hadn’t heard of until six months ago – and so made many an error or returned nothing in my first few goes.

However, Jupyter isn’t always the perfect partner…

You can’t have multiple people working in the same notebook at the same time. If you want someone to edit or check your work, or if you’re about to run a notebook, you must check others have shutdown their notebook and saved any changes to avoid overwriting their work. You are able to save .ipynb_checkpoints which mean you can revert to earlier versions of the file, but it’s useful to be aware of this when using notebooks within teams.

Whilst the mutually-independent code blocks can be very useful in editing and testing, it’s important to remember that if they’re not saved in the exact order, you’ll be brought into error-central when you try to run it next time. Notebooks also don’t show you if a code block has been edited e.g. the [1] next to your code block and the output won’t change until it’s run again nor will the UI change to make it clear the text or code has been edited.

There’s loads of debate on this point, with many saying that notebooks can be difficult for code beginners for this reason – though I haven’t found this to be too much of a problem.

I hope this article inspires you to give Amazon SageMaker and Jupyter Notebook App a go for your data science projects! Some really useful links I’ve used to get started with (and form some of these opinions on) Amazon SageMaker and Jupyter Notebooks can be found below:

If you enjoyed this post – please let me know! I’ll trialling a few different introduction to AWS (and other) tools posts and it would be great to hear what you want to see more of 📧 contact@rkulidzan.com and hit the socials below to follow my #60DaysOfCloud journey for more open learning.

search previous next tag category expand menu location phone mail time cart zoom edit close