Lab 1 Discussion - Coding for ML Basics
1 Overall Lab logistics
The labs will consist of practical challenges related with the course material. Typical challenges will span more than one week. The first lab will be an exception. Following the course policy, 15% of the grade will be based on attendance, and the other 15% on the lab submissions, graded by completion. You are allowed 4 emergency lab attendance drops and 1 lab submission drop. If you have DSP accomodations, extended rules will apply.
The lab assignments will be mostly based on python. Therefore expect you to know how to write code in python.
2 Coding for ML
That said, coding for machine learning requires some specialized considerations.
This lab is intended as a warm-up exercise and also provide some tips regarding coding for machine learning.
3 Python
Python is the standard language for machine learning. Some of the key strenghts of python are:
3.1 Readability and simplicity
It has a very readable and easy to learn syntax. This makes collaboration and code re-usage easier. This also makes it easier to develop new packages and build on top of other’s codes.
3.2 High-level language
Unlike “low-level” languages which give you control over fine details of how your code will be run (e.g. C), python takes care of that for you by default. This is what makes the first item possible.
However this comes at a cost, which you should always have in mind: in terms of speed and resource allocation, python can be very wasteful. This is one the reasons why certain deployments may require you to move to a low-level language like C.
An important note however, is that this issue has been partially mitigated in specialized packages. Numpy, PyTorch, JAX, etc. are examples of packages that not only offer ready made implementations of things, but they also take a lot of care on the details of how things will be run in order to gain efficiency. Since the kinds of computations relevant in ML are mostly contained on this packages, this reduces the need to explicitly migrate to low-level languages.
3.3 Popularity
Due to the first point, and the fact that a dedicated community of programmers worked hard to mitigate the performance issues, python has become very popular. This a big strength. It means that for almost anything related with machine learning and related fields, you should expect someone to have already implemented a package to do what you want to do. And the popularity also ensures that these packages are regularly being maintain. Some examples are:
- librosa for audio-processing methods.
- openCV for image processing.
- sympy for symbolic mathematics.
- pytorch for methods which require automatic differentiation.
- sklearn for machine learning methods.
- pandas for data analysis.
- geopandas for pre-acquired geographical data.
And of course, many more smaller packages for even more niche uses.
3.4 Conclusion
Machine learning is a field that requires a mix between quick individual experimentation, building on stranger’s code, and high-performance numerical computation. Over time python has developed into a language that is very round-off on these requirements.
4 Coding environments
We highly recommend using coding IDEs for machine learning such as vscode and its forks. The reason being that machine learning usually requires working with both python scripts and jupyter notebooks. A coding IDE provides an easy to access platform where you can juggle between these two.
But other environments are also possible, as long as this last point is addressed.
If you only need notebooks, you might consider using platforms such as google collab. Although we will mostly deal with notebooks during the course, we highly recommend working on IDEs as you might need to use scripts in your future ML applications.
5 Notebooks vs. Scripts
Machine learning applications overall use both notebooks and scripts. Each have different strenghts are weakness you should have in mind, but will also learn over time. Here are some differences:
5.0.1 Notebooks
Strengths
- Easier to do quick and flexible experimentation.
- Easier to communicate and share with collaborators.
- Easier to debug and less high-level planning required.
Weaknesses
- Worse to run long duration experiments.
- Difficult to structure for re-usage and performance. Less modular.
- Harder to do version control and build on top of, or share with others to build on top of.
5.0.2 Scripts
Strengths
- Easier to setup long and complex experiments.
- Eaiser to modularize, re-use and plan how things will be executed and organized.
- Easier to do version control and share to build on top of.
Weaknesses
- Slower development cycle. Small tweaks require planning ahead.
- Harder to communicate and present results to collaboratores.
- Debugging can be annoying. Requires some high-level planning and care to mitigate bugs.
As a rule of thumb, initial experimentations are usually done on notebooks. If you have a simple idea for a method and just want to check whether it is worth investing time in it, do a notebook first. Over cycles of experimentation usually trends will emerge regarding which things stand out as more promising. You should then consider turning those into scripts. Moreover, usually full-fledged machine learning pipelines of model training and evaluation are written using scripts. Think of notebooks as useful for drafts, prototypes and presentations. Scripts for the rest.
You ideally want to use both. Using only notebooks makes everything messy and seem like an unending collage of drafts. Using only scripts makes it very painful and costly to try out new things and present them.
6 Structuring your code
Coding in machine learning is modular in a very specific way. Unlike in certain software applications where you have a nearly fully-developed idea of what you will need and can do a top-down planning, in machine learning there is a lot of adaptive experimentation. You usually try different variations of the same idea and keep what works. So it is important to strike a balance between doing things from scratch and seeings trends in your usage to avoid repeating yourself too much.
As a rule of thumb, whenever you are starting to implement a method, try to think a couple of steps ahead, (e.g. if two methods I want to try require the same custom mathematical operation, implement that operation once first), but do not try to think too much ahead as that will be wasteful (e.g. create a method right-away with 10 different parameters corresponding to all conceivable variations you can think of, even if most of them seem unpromising).
Usually the necessary long-term patterns reveal themselves as you experiment. Worrying too much about them beforehand will just make you slower.
7 Testing
Often in machine learning, your method will underperform. As the methods become more complicated, the possible reasons behind the underperformance become larger. Ideally you want to write your code in such a way that you can rule out the underperformance being due to some bug in the code. Specially in order to apply the ideas from the course. You should be able to trust that the math which is implemented is the one you are using to reason theoretically behind the model failure. It can be very frustrasting to elaborate a very sophisticated theoretically grounded hypothesis as to why the method is not working only to find out it was actually just a bug in the code.
With this in mind, try to be preventive. It is better to implement things slowly making sure each part is doing what you think it is doing, then rushing. This will you save you time once you need to improve your method. Regularly testing intermediary aspects of your implementation is a good practice. For example, checking whether the output and input dimensions of a function correspond to what you think they are. Devising some simple cases where you know what a key function should output to check whether it is outputting the right thing. Trying your method first in a simple problem which you understable how it should behave before moving to a complicated one.
But, like in the previous bit, keep in mind that there is always room to catch errors later on. So do not over prevent.
8 Performance
As mentioned previously, several python packages, such as numpy are implemented in a way that is optimized to do numerical computations. So whenever there is some complex numerical computation you would like to use, always prefer the numpy implementation, in case there is one. This is not only about saving time by not having to re-write the methods from scratch, but also the implementation from numpy will mostly use memory and resources much more efficiently than anything you can think of. The same thing applies to packages such as pytorch and to some extent also scipy and sklearn.
Machine learning involves a lot of operations that need to be done in a ‘batched’ way. This means that the same operation is applied individually to all entries of a vector or matrix. Although one can in principle write these using loops, whenever possible, try to write them without loops using vectorized operations. We will see more about this in the practical part next.
There are some other ways of improving performance which we might not have time to cover, but you should have in mind and check out if you have time. Some of these are: pre-compiling the code using wrappers such as jit from JAX and running code in parallel, which is natively supported by sklearn for certain models.
9 Commenting your code
Readability is one of python’s strenghts. But it is only a strength if used correctly. Python as a language is structured in such a way that your syntax should encode what your code does as directly as possible so as to require less comments. For example, instead of writing a code snippet like:
## set up constant pi
x = np.pi
## compute integer part of pi
y = np.floor(x)one should prefer:
pi = np.pi
floor_pi = np.floor(pi)Try to use comments to provide high-level summaries of what large chunks of code do, how they relate with other things, and to clarify/note some subtle details that are harder to infer from just reading the explicit python syntax. An important example of this is documenting functions:
def product(a:float,b:float) -> c:
"""
Compute the product of two real numbers a and b.
Parameters
----------
a: float corresponding to first term of product.
b: float corresponding to second term of product.
Returns
------
c: float corresponding to product of a and b
"""
return a*bOn this regard LLMs are your allies. They are usually very good at writing these kinds of documentations.
9.1 A concrete example
Take a moment to check this excerpt from the pytorch package. See how the concepts mentioned are employed.
10 Classes and methods
Classes are an aspect of python coding which often shows up in machine learning packages. The simplest example, are models in sklearn. You can always implement a machine learning model as a collection of functions which you call individually. But often it is more convenient to wrap up everything using classes. Since classes are often a more advanced concept, you might not be as familiar with them as with other concepts in python. It is worth revising it, for example here.
Deciding whether or not to wrap a model into a class is similar to the previous considerations. It depends on context. Classes make your code more organized, but less flexible and so on.
11 LLM policy for the lab
You might use LLM assistance for specific commands and coding language syntax. With the exception of this first lab, you may also use it for commenting your code. Do not use it for auto-completion nor writing entire code excerpts, unless otherwise specified.