Appendix - Plotting in Python

Matplotlib Plot A Line (Detailed Guide) - Python Guides This is the best guide on drawing lines in matplotlib. And this is an addition to drawing smooth lines python - Plot smooth line with PyPlot - Stack Overflow

From Matplotlib docs: And here's the color guide Specifying Colors — Matplotlib 3.4.3 documentation To find specific function arguments, say on axes, search for the full path, like so: "matplotlib.axes.Axes.axhline". In this example, we're looking for axhline arguments on the axes. Here's the main page, for all available functions on ax[0], see this link matplotlib.axes — Matplotlib 3.4.3 documentation

Plotting in a loop

Example code - Plotting in a loop: Plotting errors (actual - predicted), after I've already sorted the dataframe w.r.t. actual value. The sorting step is an important step to see if error, and thus model, is biased towards high or low values of the target. Another plot is to not sort, then plot a scatter plot of actuals vs. predicted values, to check visually for homosckedasticity.

In the code below, I divided the range of the actual target variable into 7 intervals, and I'm plotting 7 figures of lines of predicted and actual values. The buckets are those intervals, and are based on a graph of the target variable distribution histogram bins, so it's case-dependent. You can use the target max instead of float("inf")

buckets= [pandas_df['target'].min(), 3000, 40_000, 100_000, 320_000, 500_000, 1_000_000, float("inf")]

fig, ax= plt.subplots(nrows=4, ncols=2, figsize=(12,20))
ax= ax.flatten() #important step, so you don't have to worry about 2 for-loops for rows and columns

for i,m in enumerate(buckets[:-1]):
	M= buckets[i+1] # the next element
	arr= pandas_df[(pandas_df['target']>=m) & (pandas_df['target']<=M)]
	ax[i].plot(np.array(arr['target']), c='b', alpha=0.8, label='actual')
	ax[i].plot(np.array(arr['target']), c='g', alpha=0.5, label='prediction')
	#wrapping in np.array so that plt resets the indicies. o.w. plots would be out of order when changing something, as plt will keep the Pandas ones of the first iteration
	ax[i].set_title("target between {} and {}".format(m,M))
	ax[i].legend()
	
# I had extra plot here, I use to plot the errors
ax[7].plot(pandas_df['prediction'] - pandas_df['target'], c='darkred', label= 'y-hat - y')
ax[7].axhline(y=0, xmin=0, xmax=100_000, color='skyblue', linestyle='--')
# for xmax, you could make it dynamic with `max(pandas_df['target'], pandas_df['prediction'])` to take the max of the two, or directly the max of the difference. 
ax[7].set_title("Errors \nwith dashed line at zero")
ax[7].legend()

Density (probability) plot

Using Seaborn's distribution plot (kdeplot) gives the historgram, the curve, or either of them. using the depricated distplot, or the new one displot in version 0.11 The better solution, with rescaling on the fly, use Numpy:

counts, bins= np.histogram(df['colname'], bins=50)
counts= counts/np.sum(counts)
plt.hist(x=bins[:-1], bins=bins, weights=counts)
# the first output list is always one element fewer than the second output list
# to see the output of np.histogram, test with a toy list like [4, 6, 23, 70, 1, 89, 30]

Wonders of Numpy in plotting: how this code works I really need to study Numpy in detail, it has magnificent applications and properties you can use. One of them, is useful for plotting with matplotlib. checkout np.histogram() which returns two arguments, 1) data count in each bin of the histogram, and 2) bin edges. The latter has 1 element more than data count of course, because it includes the leftmost and rightmost edges of all bins.

Bar plot, 2 on same graph, each bar of first right next to the bar from the other

This code example is repeated in "Create Categories/Buckets Manually & KS Test" note in here In the example below, I also do Kologorov-Smirnov test on the two arrays I want to bar plot, and I show results on the plot itself. I also format the result to have only three decimal points.

def test_and_plot(arr1, arr2):
  arr1_norm= arr1/sum(arr1)
	arr2_norm= arr2/sum(arr2)
	
	statistic, p_value= stats.ks_2samp(arr1_norm, arr2_norm)
	
	#plotting
	fig, ax= plt.subplots(figsize=(9,6))
	bar_width= 0.4
	plt.title("over the past two years".title())
	fig.suptitle("p-value= {:1.3f}, KS-stat= {:1.3f}".format(p_value, statistic), fontsize=14)
	
	x= list(range(arr1_norm.shape[0])) #both are of the same length now, arr1 and arr2
	x_offset= [i + bar_width for i in x]
	
	b1= ax.bar(x, arr1_norm, width= bar_width, label="arr1")
	#b1.set_label('arr1')
	ax.legend()
	
	b2= ax.bar(x_offset, arr2_norm, width= bar_width, label= "arr2")
	#b2.set_label("arr2")
	ax.legend()
	
	#fix the x-axis
	ax.set_xticks([i + bar_width / 2 for i in x])
	ax.set_xticklabels(arr1_norm.index.to_list(), rotation=90) #both arrays have the same index
	ax.set_xlabel("net paid amount".title())
	ax.set_ylabel("Probabilities")
	
	return None

VERY IMPORTANT NOTE: The two arrays must be of the same length to start with in order to execute the KS-test. To do that, in case the two arrays are two columns with different values, i.e. two different distributions, then we need to create a new array for each, putting their values in the same "bucket" values, BEFORE precedding to plotting them as per the plotting function above. Find how to do that in "Create Categories/Buckets Manually & KS Test" note in here. Resources for this example numpy set ufuncs numpy combining and reshaping -- scroll down to NumPy concatenate section in the middle of the page grouped charts example

Last updated