Summary
Streamlit is an open-source Python library that streamlines the creation of web applications for machine learning and data science. Tailored for data scientists and ML engineers, it enables the quick development of interactive, data-driven web interfaces without requiring in-depth web development knowledge. Streamlit integrates well with various data visualization libraries and offers features like real-time updates, interactive widgets, and customization options, making it an excellent choice for showcasing and exploring data insights.
In this notebook, we’ll start with an introduction to Streamlit, covering data loading and visualization, followed by deploying machine learning model. Here are two deployed models on Streamlit for customer churn prediction:
This app uses a pre-trained model to predict the likelihood of customer churn: Customer Churn Prediction App
This app provides a complete walkthrough for training a machine learning model for customer churn prediction: End-to-End Churn Model Training App
Python code and data files needed to run this notebook are available via this link.
Streamlit is the fastest way to make data apps. It is an open-source Python library that helps you build web applications to be used for sharing analytical results, building complex interactive experiences, and iterating on top of new machine learning models. On top of that, developing and deploying Streamlit apps is incredibly fast and flexible, often reducing the application development time from days to hours.
Over the past decade, data scientists have become increasingly valuable assets for companies and nonprofits. They assist in making data-driven decisions, enhancing process efficiency, and deploying machine learning models to optimize these decisions on a larger scale. One pain point for data scientists is the process just after they have found a new insight or made a new model. What is the best way to show a dynamic result, a new model, or a complicated piece of analytics to a data scientist’s colleagues? They can send a static visualization, which works in some cases but fails for complicated analyses that build on each other or on anything that requires user input. They can create a Word document (or export their Jupyter notebook as a document) that combines text and visualizations, which also doesn’t incorporate user input and makes reproducible results much harder. Another option still is to build out an entire web application from scratch using a framework such as Flask or Django, and then figure out how to deploy the entire app in a cloud provider.
None of these options are particularly effective. Many are slow, lack user input capabilities, or are suboptimal for the decision-making process crucial to data science.
Streamlit is all about speed and interaction. It is a web application framework that helps you build and develop Python web applications. It has built-in and convenient methods for everything from taking in user inputs like text and dates to showing interactive graphs using the most popular and powerful Python graphing libraries.
Retrieved from Streamlit for Data Science - Second Edition by Richards,Tyler
To run Streamlit apps, you must first install Streamlit using a package manager like pip or brew. The book will guide you on when to use terminal commands and when to write Python scripts, providing clear instructions for both. To install Streamlit, execute the specified code in a terminal.
pip install streamlit
streamlit hello
import streamlit as st
import time
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# Initialize the progress bar and status text in the sidebar
progress_bar = st.sidebar.progress(0)
status_text = st.sidebar.empty()
# Generate initial random data for the line chart
last_rows = np.random.randn(1, 1)
chart = st.line_chart(last_rows)
# Update the line chart in a loop
for i in range(1, 101):
new_rows = last_rows[-1, :] + np.random.randn(5, 1).cumsum(axis=0)
status_text.text(f"{i}% Complete")
chart.add_rows(new_rows)
progress_bar.progress(i)
last_rows = new_rows
time.sleep(0.5)
# Clear the progress bar after completion
progress_bar.empty()
# Add a button to rerun the script
st.button("Re-run")
streamlit run example.py
Make a new folder called stltm_app
, and toss in a new file name stltm_app_demo.py
.
mkdir stltm_app
cd stltm_app
touch stltm_app_demo.py
touch
creates an empty Python file.
The command touch
is used in Unix-like operating systems (such as Linux and macOS) to create a new, empty file in the current directory. If the file already exists, the touch
command updates the file's "last modified" timestamp to the current date and time.
Streamlit provides unique functions for different types of content, such as text, graphs, pictures, and other media, which serve as building blocks for apps. The function st.write()
is one of the first you'll use; it takes a string or various Python objects (like dictionaries) and writes them directly into the web app in the order they are called. As Streamlit processes the Python script, it assigns a sequential slot for each function, making it easy to integrate content into your app by simply using st.write()
.
import streamlit as st
st.write('Hello World')
The URL localhost:8501
indicates the app is hosted locally on computer via port 8501, meaning it's not accessible on the internet.
The hamburger icon at the top right provides additional options when clicked.
It was discussed the st.write()
function for displaying text. Next, st.pyplot()
allows to use matplotlib to create and display graphs in Streamlit. For example, the app simulates 1,000 coin flips, samples 100 from the results, calculates the mean, repeats this 500 times, and plots a histogram, resulting in a bell-shaped distribution.
import streamlit as st
import numpy as np
import matplotlib.pyplot as plt
# Initialize the progress bar and status text in the sidebar
progress_bar = st.sidebar.progress(0)
binom_dist = np.random.binomial(1, .5, 500)
list_of_means = []
for i in range(0, 1000):
list_of_means.append(np.random.choice(binom_dist, 100, replace=True).mean())
fig, ax = plt.subplots()
ax = plt.hist(list_of_means)
st.pyplot(fig)
# Add a button to rerun the script
st.button("Re-run")
Currently, our app only displays visualizations, but most web apps require user input and dynamic content. Streamlit offers many functions for user inputs, such as st.text_input()
for text, st.radio()
for radio buttons, and st.number_input()
for numeric inputs, among others. We will explore these throughout the book, starting with numeric input. For example, we can let users decide the probability of heads in a coin flip and use that as input in our binomial distribution. The st.number_input()
function takes a label, minimum and maximum values, and a default value.
import streamlit as st
import numpy as np
import matplotlib.pyplot as plt
# User input for the probability of heads
prob_heads = st.number_input('Chance of Coins Landing on Heads', min_value=0.0, max_value=1.0, value=0.5)
# User input for the graph title
graph_title = st.text_input('Graph Title')
# User input for the number of samples to draw from the binomial distribution
num_samples = st.radio('Number of Samples', options=[50, 100, 200], index=1)
# Generate a binomial distribution
binomial_dist = np.random.binomial(n=1, p=prob_heads, size=1000)
# Calculate the means of random samples from the binomial distribution
means_list = [np.random.choice(binomial_dist, size=num_samples, replace=True).mean() for _ in range(1000)]
# Create a histogram of the means
fig, ax = plt.subplots()
fig.set_size_inches(5, 2) # Adjust the figure size
ax.hist(means_list, bins=np.arange(0, 1.1, 0.05), range=[0, 1])
ax.set_title(graph_title)
# Display the plot in Streamlit
st.pyplot(fig)
Our app works, but it lacks some finishing touches. We've discussed the versatility of st.write()
, which works with almost any content and should be our default option. Additionally, we can use st.title()
, st.header()
, st.markdown()
, and st.subheader()
to format text easily and maintain consistency in larger apps. These functions help in placing text with different font sizes and using Markdown for familiar formatting. Let's try some of these in the following code:
import streamlit as st
import numpy as np
import matplotlib.pyplot as plt
st.title('Illustrating the Central Limit Theorem with Streamlit')
st.subheader('An App by Tyler Richards')
st.write(('This app simulates a thousand coin flips using the chance of heads input below,'
'and then samples with replacement from that population and plots the histogram of the'
' means of the samples in order to illustrate the central limit theorem!'))
# User input for the probability of heads
prob_heads = st.number_input('Chance of Coins Landing on Heads', min_value=0.0, max_value=1.0, value=0.5)
# User input for the graph title
graph_title = st.text_input('Graph Title')
# User input for the number of samples to draw from the binomial distribution
num_samples = st.radio('Number of Samples', options=[50, 100, 200], index=1)
# Generate a binomial distribution
binomial_dist = np.random.binomial(n=1, p=prob_heads, size=1000)
# Calculate the means of random samples from the binomial distribution
means_list = [np.random.choice(binomial_dist, size=num_samples, replace=True).mean() for _ in range(1000)]
# Create a histogram of the means
fig, ax = plt.subplots()
fig.set_size_inches(5, 2) # Adjust the figure size
ax.hist(means_list, bins=np.arange(0, 1.1, 0.05), range=[0, 1])
ax.set_title(graph_title)
# Display the plot in Streamlit
st.pyplot(fig)
# Add a button to rerun the script
st.button("Re-run")
We can find various real datasets on the UCI Machine Learning repository, which are preprocessed and ready for use with Machine Learning algorithms. I found the Energy Efficiency Data Set particularly useful. This dataset includes energy analyses of 768 simulated building shapes based on 8 features such as Wall Area, Overall Height, Glazing Area, and Orientation, aimed at predicting Heating Load and Cooling Load. The work, published by Tsanas and Xifara in 2012 in the Energy and Buildings Journal, can be used for both regression and classification tasks. In this lecture, we will focus on binary classification of Heating Load, which measures the heating required to maintain indoor temperature at set levels. I have added two columns to the dataset, dividing Heating Load into binary and multiclass categories. Let's examine the dataset.
mkdir energy_eff
cd energy_eff
touch energyeff.py
In the last selection, we learned about a Streamlit input called st.number_input()
. This won’t help us here, but Streamlit has a very similar one called st.selectbox()
, which allows us to ask the user to select one option from multiple options, and the function returns whatever the user selects. We will use this to get the three inputs for our scatterplot:
import streamlit as st
import pandas as pd
import altair as alt
st.title("Energy Efficiency Dataset")
st.markdown('Use this Streamlit app to create scatterplots between different attributes!')
st.markdown('Here is head of data set:')
# Load data
df_energy_eff = pd.read_csv('energyeff.csv')
# Display the first few rows of the dataset
st.write(df_energy_eff.head())
# Dropdown for selecting orientation
selected_orientation = st.selectbox('Select an Orientation to visualize:', [2, 3, 4, 5])
# Dropdown for selecting the x variable
selected_x_var = st.selectbox('Choose the x variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Dropdown for selecting the y variable
selected_y_var = st.selectbox('Choose the y variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Filter the dataset based on the selected orientation
df_filtered = df_energy_eff[df_energy_eff['Orientation'] == selected_orientation]
# Create and display the Altair scatterplot
scatterplot = alt.Chart(
df_filtered, title=f"Scatterplot of Orientation {selected_orientation} for Energy Efficiency"
).mark_circle().encode(
x=selected_x_var,
y=selected_y_var
)
st.altair_chart(scatterplot)
# Add a button to rerun the script
st.button("Re-run")
This looks great, but we can make a few more improvements. Currently, we can't zoom into our chart, leaving much of it blank. We can address this by either adjusting the axes in Altair or making the Altair chart interactive, allowing users to zoom in on any part of the graph. The following code makes the Altair chart zoomable and extends the graph to fit the entire screen using the use_container_width
parameter:
import streamlit as st
import pandas as pd
import altair as alt
st.title("Energy Efficiency Dataset")
st.markdown('Use this Streamlit app to create scatterplots between different attributes!')
st.markdown('Here is head of data set:')
# Load data
df_energy_eff = pd.read_csv('energyeff.csv')
# Display the first few rows of the dataset
st.write(df_energy_eff.head())
# Dropdown for selecting orientation
selected_orientation = st.selectbox('Select an Orientation to visualize:', [2, 3, 4, 5])
# Dropdown for selecting the x variable
selected_x_var = st.selectbox('Choose the x variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Dropdown for selecting the y variable
selected_y_var = st.selectbox('Choose the y variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Filter the dataset based on the selected orientation
df_filtered = df_energy_eff[df_energy_eff['Orientation'] == selected_orientation]
# Create and display the Altair scatterplot
scatterplot = (alt.Chart(
df_filtered, title=f"Scatterplot of Orientation {selected_orientation} for Energy Efficiency"
).mark_circle().encode(
x=selected_x_var,
y=selected_y_var,
color="Heating Load"
).interactive()
)
st.altair_chart(scatterplot, use_container_width=True)
# Add a button to rerun the script
st.button("Re-run")
The final step for this app is to allow users to upload their own data. This enables the research team to upload data at any time and view the results, or for multiple research groups with different data and column names to use a common method. We’ll tackle this step-by-step, starting with accepting data from users.
Streamlit's file_uploader()
function allows users to upload data up to 200 MB by default. Unlike other interactive widgets, `file_uploader()` defaults to `None` until the user interacts with it, as there can't be a pre-existing default file.
This introduces an important concept in Streamlit development: flow control. Flow control involves carefully planning each step of the application. Without explicit instructions, Streamlit will attempt to run the entire app at once, so we need to ensure that the app waits for a user to upload a file before creating a graphic or manipulating a DataFrame.
As we discussed earlier, there are two solutions to the default data upload situation. We can either provide a default file to use until the user uploads their own, or we can pause the app until a file is uploaded. Let's start with the first option. The following code uses the st.file_uploader()
function within an if statement. If the user uploads a file, the app uses that file; otherwise, it defaults to the pre-existing file we have been using:
import streamlit as st
import pandas as pd
import altair as alt
st.title("Energy Efficiency Dataset")
st.markdown('Use this Streamlit app to scatterplot between attributes!')
df_energy_file = st.file_uploader("Select Local Energy Efficiency CSV (default provided)")
if df_energy_file is not None:
df_energy_eff = pd.read_csv(df_energy_file)
# Display the first few rows of the dataset
st.write(df_energy_eff.head())
else:
st.stop()
df_energy_eff = pd.read_csv('energyeff.csv')
# Display the first few rows of the dataset
st.write(df_energy_eff.head())
# Dropdown for selecting orientation
selected_orientation = st.selectbox('Select an Orientation to visualize:', [2, 3, 4, 5])
# Dropdown for selecting the x variable
selected_x_var = st.selectbox('Choose the x variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Dropdown for selecting the y variable
selected_y_var = st.selectbox('Choose the y variable:',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
# Filter the dataset based on the selected orientation
df_filtered = df_energy_eff[df_energy_eff['Orientation'] == selected_orientation]
# Create and display the Altair scatterplot
scatterplot = (alt.Chart(
df_filtered, title=f"Scatterplot of Orientation {selected_orientation} for Energy Efficiency"
).mark_circle().encode(
x=selected_x_var,
y=selected_y_var,
color="Heating Load"
).interactive()
)
st.altair_chart(scatterplot, use_container_width=True)
# Add a button to rerun the script
st.button("Re-run")
The clear advantage of this approach is that there are always results shown in this application, but the results may not be useful to the user! For larger applications, this is a subpar solution as well because any data stored inside the app, regardless of use, is going to slow the application down.
Our second option is to halt the application entirely until the user uploads a file. For this, we can use the Streamlit function st.stop()
, which halts the app's execution whenever called. This approach is useful for identifying errors and prompting users to take action or report issues. While it's not necessary for our current scenario, it's a valuable technique for future applications. The following code demonstrates how to use an if-else statement with st.stop()
in the else block to prevent the app from running if st.file_uploader()
has not been used:
As we develop more computationally intensive Streamlit apps and handle larger datasets, it's important to focus on runtime efficiency. One effective way to improve efficiency is through caching, which involves storing results in memory to avoid repeated computations.
Caching in Streamlit works similarly to human short-term memory, where frequently accessed information is kept readily available. When a function's result is cached, Streamlit retrieves the stored result from memory if the function is called again with the same parameters, rather than recomputing it.
To demonstrate caching, we'll create a function for data upload and use the time
library to artificially delay its execution. We'll then use st.cache_data
to see if it improves the app's performance. Note that Streamlit also has st.cache_resource
for caching resources like database connections and machine learning models. For now, we’ll focus on caching data. The following code defines a load_file()
function that simulates a delay of 3 seconds to test if caching effectively speeds up the app.
import streamlit as st
import pandas as pd
import altair as alt
import seaborn as sns
import time
st.title("Energy Efficiency Dataset")
st.markdown('Use this Streamlit app to scatterplot between attributes!')
energy_file = st.file_uploader("Select Local Energy Efficiency CSV (default provided)")
def load_file(energy_file):
time.sleep(3)
if energy_file is not None:
df = pd.read_csv(energy_file)
else:
df = pd.read_csv('energyeff.csv')
return(df)
df_energy_eff = load_file(energy_file)
sns.set_style('darkgrid')
markers = {2: "X", 3: "s", 4:'o', 5:'*'}
selected_ori = st.selectbox('What Orientation would you like to visualize?',
[2, 3, 4, 5])
selected_x_var = st.selectbox('What do you want the x variable to be?',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
selected_y_var = st.selectbox('What about the y?',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
df_energy_eff = df_energy_eff[df_energy_eff['Orientation'] == selected_ori]
alt_chart = (
alt.Chart(df_energy_eff, title=f"Scatterplot of Orientation {selected_ori} for Energy Efficiency")
.mark_circle()
.encode(
x=selected_x_var,
y=selected_y_var,
color="Orientation"
)
.interactive()
)
st.altair_chart(alt_chart, use_container_width=True)
Now, let’s run this app and then select the hamburger icon in the top right and press the rerun button (we can also just press the R key to rerun).
We notice that each time we rerun the app, it takes at least 3 seconds. Now, let’s add our cache decorator on top of the load_file()
function and run our app again:
import streamlit as st
import pandas as pd
import altair as alt
import seaborn as sns
import time
st.title("Energy Efficiency Dataset")
st.markdown('Use this Streamlit app to scatterplot between attributes!')
energy_file = st.file_uploader("Select Local Energy Efficiency CSV (default provided)")
@st.cache_data()
def load_file(energy_file):
time.sleep(3)
if energy_file is not None:
df = pd.read_csv(energy_file)
else:
df = pd.read_csv('energyeff.csv')
return(df)
df_energy_eff = load_file(energy_file)
sns.set_style('darkgrid')
markers = {2: "X", 3: "s", 4:'o', 5:'*'}
selected_ori = st.selectbox('What Orientation would you like to visualize?',
[2, 3, 4, 5])
selected_x_var = st.selectbox('What do you want the x variable to be?',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
selected_y_var = st.selectbox('What about the y?',
['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height',
'Glazing Area', 'Glazing Area Distribution', 'Heating Load'])
df_energy_eff = df_energy_eff[df_energy_eff['Orientation'] == selected_ori]
alt_chart = (
alt.Chart(df_energy_eff, title=f"Scatterplot of Orientation {selected_ori} for Energy Efficiency")
.mark_circle()
.encode(
x=selected_x_var,
y=selected_y_var,
color="species"
)
.interactive()
)
#st.altair_chart(alt_chart, use_container_width=True)
In the code provided, the @st.cache_data()
decorator is used to cache the results of the load_file()
function.
Here's what happens with @st.cache_data()
:
load_file()
function in memory. If the function is called again with the same energy_file
parameter, Streamlit retrieves the cached result rather than re-running the function. This reduces redundant computations and speeds up the app.time.sleep(3)
line in the function simulates a delay, which will only be experienced during the initial run. If the function is called again with the same parameters, the cached result is used, and the delay is avoided.energy_file
changes, the cache will not be used, and the function will run again to load the new file.Overall, @st.cache_data()
improves the app's performance by avoiding unnecessary re-computation and data loading delays for the same input parameters.
Certainly! Here's a revised version:
One of the most challenging aspects for new developers working with Streamlit is dealing with these two facts:
These two facts make it difficult to make certain types of apps! This is best shown in an example. Let’s say that we want to make a to-do app that makes it easy for you to add items to your to-do list. Adding user input in Streamlit is really simple, so we can create one quickly in a new file called session_state_example.py
that looks like the following:
import streamlit as st
st.title('My To-Do List Creator')
my_todo_list = ["Go to Costco", "Got to swimming pool", "Learn English"]
st.write('My current To-Do list is:', my_todo_list)
new_todo = st.text_input("What do you need to do?")
if st.button('Add the new To-Do item'):
st.write('Adding a new item to the list')
my_todo_list.append(new_todo)
st.write('My new To-Do list is:', my_todo_list)
Once you try to add more than one item to the list, you will notice that it resets the original list and forgets what the first item you entered was!
Enter st.session_state
. Session State is a Streamlit feature that acts as a global dictionary, persisting throughout a user's session. This allows us to overcome the two issues mentioned earlier by storing user inputs in this global dictionary. To add the Session State functionality, we first check if our to-do list is in the session_state
dictionary, and if not, we set our default values. With each new button click, we update our list stored in the session_state
dictionary.
import streamlit as st
st.title('My To-Do List Creator')
if 'my_todo_list' not in st.session_state:
st.session_state.my_todo_list = ["Go to Costco", "Got to swimming pool", "Learn English"]
new_todo = st.text_input("What do you need to do?")
if st.button('Add the new To-Do item'):
st.write('Adding a new item to the list')
st.session_state.my_todo_list.append(new_todo)
st.write('My To-Do list is:', st.session_state.my_todo_list)
Whenever you want to keep information from the user across runs, st.session_state
can help you out.
Visualization is an essential tool for modern data scientists, often serving as the primary method for understanding elements such as statistical models (e.g., through an AUC chart), the distribution of key variables (via histograms), or important business metrics.
We explored two popular Python graphing libraries, Matplotlib and Altair, through various examples. This session will expand on that by introducing a wide range of Python graphing libraries, including some native to Streamlit.
we will cover the following libraries:
Environment and Climate Change Canada has required certain facilities across Canada to report their greenhouse gas (GHG) emissions annually through the Greenhouse Gas Reporting Program. The reported emission can be accessed freely at this link https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/facility-reporting/data. The values are in Megatonnes CO2/year. All facilities that emit the equivalent of 50 kilotonnes (kt) or more of GHGs in carbon dioxide equivalent units (CO2 eq.) per year are required to submit a report.
Streamlit offers four built-in functions for graphing: st.line_chart()
, st.bar_chart()
, st.area_chart()
, and st.map()
. These functions automatically determine the variables you want to graph and display them in the appropriate chart format: line, bar, map, or area.
import streamlit as st
import pandas as pd
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('./GHG/GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).sum()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
st.line_chart(data=ghg_2022, x='FacilityProvince', y='TotalEmissions', color='#ff2b2b')
st.bar_chart(data=ghg_2022, x='FacilityProvince', y='TotalEmissions', color='#09ab3b')
st.area_chart(data=ghg_2022, x='FacilityProvince', y='TotalEmissions', color='#0068c9')
Each of these charts is also interactive by default! We can zoom in or out, roll the mouse over points/bars/lines to see each data point, and even view the full screen out of the box. These Streamlit functions call a popular graphing library called Altair.
There is one more built-in Streamlit graphing function we should discuss: st.map()
. Like the previous functions, it wraps around another Python graphing library, this time PyDeck instead of Altair. It searches the DataFrame for columns named longitude, long, latitude, or lat to identify the coordinates. It then plots each row as a point on a map, auto-zooms and focuses the map, and displays it in our Streamlit app. It's important to note that visualizing detailed maps is much more computationally intensive than other types of visualizations we've used so far. Therefore, we will sample 1,000 random rows from our DataFrame, remove null values, and use st.map()
with the following code:
import streamlit as st
import pandas as pd
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince',
'latitude',
'longitude']).mean()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
ghg_2022 = ghg_2022.dropna(subset=['longitude', 'latitude'])
ghg_2022 = ghg_2022.sample(n = 1000)
st.map(ghg_2022)
As we've observed, these built-in functions are useful for quickly creating Streamlit apps, but there is a trade-off between speed and customizability. In practice, we seldom use these functions when developing Streamlit apps, although we often rely on them for quick visualizations of data within Streamlit. For production, more powerful libraries like Matplotlib, Plotly, Seaborn, and PyDeck provide the flexibility and customizability we need.
Plotly is an interactive visualization library widely used by data scientists to visualize data in Jupyter notebooks, either locally in the browser or on a web platform such as Dash (created by the developers of Plotly). This library shares a similar purpose with Streamlit, primarily focusing on creating internal or external dashboards (hence the name Dash).
Streamlit allows us to integrate Plotly graphs within Streamlit apps using the st.plotly_chart() function, making it easy to port any Plotly or Dash dashboards. We’ll demonstrate this by creating a histogram of the height of trees in SF, similar to the graph we made earlier. The following code produces our Plotly histogram:
import streamlit as st
import pandas as pd
import plotly.express as px
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).mean()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
fig = px.histogram(ghg_2022['TotalEmissions'], nbins=500)
st.plotly_chart(fig)
import streamlit as st
import pandas as pd
import plotly.express as px
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).mean()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
fig = px.histogram(ghg_2022['TotalEmissions'], nbins=500)
st.plotly_chart(fig)
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).mean()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
st.subheader('Seaborn Chart for GHG Emission 2022')
fig_sb, ax_sb = plt.subplots(figsize=(4, 2))
ax_sb = sns.histplot(ghg_2022['TotalEmissions'])
plt.xlabel('TotalEmissions_2022')
st.pyplot(fig_sb)
st.subheader('Matploblib Chart for GHG Emission 2022')
fig_mpl, ax_mpl = plt.subplots(figsize=(4, 2))
ax_mpl = plt.hist(ghg_2022['TotalEmissions'])
plt.xlabel('TotalEmissions_2022')
st.pyplot(fig_mpl)
Bokeh is another web-based interactive visualization library that also offers dashboarding capabilities built on top of it. It is a direct competitor to Plotly and is quite similar in use, with some stylistic differences. Bokeh is an extremely popular Python visualization package that many Python users find comfortable to use.
We can integrate Bokeh graphs into Streamlit in a manner similar to Plotly. First, we create the Bokeh graph, and then use the st.bokeh_chart() function to display it in the Streamlit app. In Bokeh, we need to instantiate a Bokeh figure object and configure its properties before plotting it. It's important to note that any changes made to the Bokeh figure object after calling the st.bokeh_chart() function will not be reflected in the graph displayed in the Streamlit app.
import streamlit as st
import pandas as pd
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
st.title('SF Trees')
st.write(
"""This app analyzes trees in San Francisco using
a dataset kindly provided by SF DPW"""
)
st.subheader('Bokeh Chart')
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).mean()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
# Create the blank plot
provinces = list(ghg_2022.FacilityProvince.drop_duplicates())
histogram = figure(x_range=provinces,
title = 'Bokeh_TotalEmissions',
x_axis_label = 'TotalEmissions',
y_axis_label = 'FacilityProvince')
histogram.vbar(x=provinces, top=ghg_2022.TotalEmissions, width=0.9)
histogram.xgrid.grid_line_color = None
histogram.y_range.start = 0
st.bokeh_chart(histogram)
Altair was already used through Streamlit functions such as st.line_chart()
and st.map()
, and directly through st.altair_chart()
:
import streamlit as st
import pandas as pd
import altair as alt
st.title('Canada GHG Emission')
st.write(
"""This app analyzes GHG emission for Environment and Climate Change of Canada."""
)
ghg_df = pd.read_csv('GHGEmissions2004-Present.csv')
ghg_df_grouped = pd.DataFrame(ghg_df.groupby(['ReferenceYear',
'FacilityProvince']).sum()['TotalEmissions']).reset_index()
ghg_2022 = ghg_df_grouped[ghg_df_grouped.ReferenceYear==2022]
fig = alt.Chart(ghg_2022).mark_bar().encode(x ='FacilityProvince',
y='TotalEmissions').properties(
width=800,
height=500
)
st.altair_chart(fig)
Streamlit also allows us to use more complex visualization libraries such as PyDeck for geographical mapping. In fact, PyDeck was discussed through the native st.map()
function.
Data scientists often face a common challenge at the end of the model creation process: figuring out how to convince non-data scientists of their model's value. They might have performance metrics and static visualizations but lack an easy way to allow others to interact with their model.
Before Streamlit, the main options were creating a full-fledged app with Flask or Django or turning a model into an Application Programming Interface (API) for developers. While effective, these methods are time-consuming and not ideal for quick prototyping.
The incentives for teams can be misaligned. Data scientists aim to create the best models, but building a Flask or Django app requires significant time (a day or two, or a few hours for experienced developers), making it impractical to do so until the modeling process is nearly complete. However, it would be beneficial for data scientists to involve stakeholders early and often to ensure they are building solutions that meet actual needs.
Streamlit simplifies this process, turning the arduous task of app creation into a seamless experience. In this section, we'll cover how to create Machine Learning (ML) prototypes in Streamlit, add user interaction to ML apps, and interpret ML results. We'll use popular ML libraries such as PyTorch, Hugging Face, OpenAI, and scikit-learn.
The first step in creating an app that utilizes machine learning (ML) is developing the ML model itself. There are numerous popular workflows for creating ML models, and you likely already have your own! This process consists of two main parts:
If the plan is to train the model once and then use it in our Streamlit app, the best approach is to create the model outside of Streamlit first (e.g., in a Jupyter notebook or a standard Python file), and then incorporate the model into the app.
If the plan involves using user input to train the model within the app, the model must be trained inside the Streamlit app rather than externally.
We will begin by building our ML models outside of Streamlit and then progress to training our models within Streamlit apps.
Retrieved from Streamlit for Data Science - Second Edition by Richards,Tyler
import pickle
import streamlit as st
import numpy as np
st.title('Customer Churn Prediction')
st.write(
"""This app is created by [streamlit](https://streamlit.io/) using pre-trained model to predict the likelihood if bank
customers will turnover next cycle. Random forest classifier is trained by Bank Turnover Dataset from
[Kaggle](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/version/1).
The app is trained by 10 inputs (predictors). The user inputs are `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
rf_pickle = open("random_forest_churn.pickle", "rb")
rfc = pickle.load(rf_pickle)
rf_pickle.close()
Geography = st.selectbox("Geography", options=["France", "Germany", "Spain"])
CreditScore = st.number_input("CreditScore", min_value=300)
Gender = st.selectbox("Gender", options=['Male', 'Female'])
Age = st.number_input("Age", min_value=18)
Tenure = st.number_input("Tenure", min_value=2)
Balance = st.number_input("Balance", min_value=500)
NumOfProducts = st.number_input("NumOfProducts", min_value=1)
HasCrCard = st.selectbox("HasCrCard", options=[0, 1])
IsActiveMember = st.selectbox("IsActiveMember", options=[0, 1])
EstimatedSalary = st.number_input("EstimatedSalary", min_value=1000)
user_inputs = [Geography, CreditScore, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary]
st.write("The user inputs are `Geography`, `CreditScore`, `Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`")
Is_France, Is_Germany, Is_Spain = 0, 0, 0
if Geography == 'France':
Is_France = 1
elif Geography == 'Germany':
Is_Germany = 1
elif Geography == 'Spain':
Is_Spain = 1
if Gender == 'Male':
Gender = 1
elif Gender == 'Female':
Gender = 0
std_pickle = open("std.pickle", "rb")
scaler = pickle.load(std_pickle)
std_pickle.close()
clmn_std = np.array([CreditScore, Age, Tenure, Balance, EstimatedSalary]).reshape(1, 5)
clmn_not_std = np.array([Is_France, Is_Germany, Is_Spain,
Gender, NumOfProducts, HasCrCard, IsActiveMember]).reshape(1, 7)
feat_std = scaler.transform(clmn_std)
to_pred = np.concatenate((feat_std, clmn_not_std), axis=1)
if st.button("Predict", type="primary"):
y_pred = int(rfc.predict_proba(to_pred)[0][0]*100)
st.markdown(f"""<p style='font-size:24px;'>The likelihood of churn for this customer is <strong>{y_pred}%</strong></p>.""", unsafe_allow_html=True)
We often want user input to influence how our model is trained, whether it's by providing their own data, selecting specific features, or even choosing the type of machine learning algorithm. Streamlit makes all these options possible.
As noted earlier, if a model only needs to be trained once, it's usually better to train it outside of Streamlit and then import the trained model. However, consider a scenario where churn prediction data is stored locally, or the user knows how to retrain the model and has the data in the correct format. In such cases, we can use the st.file_uploader()
feature to allow users to upload their own data and get a custom model deployed without writing any code.
The code below allows users to upload their data and run the preprocessing/training steps to create a unique model for them. It's important to note that this will only work if the user's data matches the exact format and style we used, which may not always be the case. A potential improvement would be to display the required data format to the user, ensuring the app can train the model correctly.
import streamlit as st
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from ydata_profiling import ProfileReport
import shap
import time
from streamlit_pandas_profiling import st_profile_report
from matplotlib.ticker import PercentFormatter
st.title('Customer Churn Prediction')
st.write(
"""This app is created by [streamlit](https://streamlit.io/) to train a model to predict
customer turnover for next cycle. Random forest classifier is trained by Bank Turnover Dataset from
[Kaggle](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/version/1).
The app is trained by 10 inputs (predictors). The user inputs are `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
st.image('DataTable.jpg')
with st.form('input'):
churn_file = st.file_uploader('Upload your churn data')
st.form_submit_button()
#if churn_file is None:
# rf_pickle = open("random_forest_churn.pickle", "rb")
# rfc = pickle.load(rf_pickle)
# std_pickle = open("std.pickle", "rb")
# scaler = pickle.load(std_pickle)
# rf_pickle.close()
# std_pickle.close()
#
#else:
if churn_file is not None:
df = pd.read_csv(churn_file)
# Shuffle the data
np.random.seed(42)
df
if st.button("profiling", type="primary"):
profile = df.profile_report(title='Pandas Profiling Report')
st_profile_report(profile)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True) # Reset index
# Remove 'RowNumber','CustomerId','Surname' features
df = df.drop(['RowNumber','CustomerId','Surname'],axis=1,inplace=False)
# Training and Test
st.subheader("Split Data to Training and Test")
test_size = st.number_input("proportion of test data", min_value=0.1, max_value=0.4)
spt = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=42)
for train_idx, test_idx in spt.split(df, df['Exited']):
train_set_strat = df.loc[train_idx]
test_set_strat = df.loc[test_idx]
train_set_strat.reset_index(inplace=True, drop=True) # Reset index
test_set_strat.reset_index(inplace=True, drop=True) # Reset index
st.subheader("Select Input")
features = ["Geography", "CreditScore", "Gender", "Age", "Tenure",
"Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"]
selected_features = st.multiselect('Select features:', features)
selected_features_test = selected_features.copy()
st.subheader("RandomForest Hyperparameters")
n_estimators = st.number_input("n_estimators", min_value=10, max_value=200)
max_depth = st.number_input("max_depth", min_value=5, max_value=30)
min_samples_split = st.number_input("min_samples_split", min_value=5, max_value=30)
bootstrap = st.selectbox("bootstrap", options=[True, False])
random_state = st.number_input("random_state", min_value=1)
# Random Forest for taining set
if st.button("Train Random Forest", type="primary"):
# Text Handeling
# Convert Geography to one-hot-encoding
clmn = []
# Convert gender to 0 and 1
if 'Gender' in selected_features:
ordinal_encoder = OrdinalEncoder()
clmn.append('Gender')
train_set_strat['Gender'] = ordinal_encoder.fit_transform(train_set_strat[['Gender']])
# Remove 'Geography'
if 'Geography' in selected_features:
Geog_1hot = pd.get_dummies(train_set_strat['Geography'],prefix='Is')
clmn.append(list(Geog_1hot.columns))
selected_features.remove('Geography')
train_set_strat = train_set_strat.drop(['Geography'],axis=1,inplace=False)
train_set_strat = pd.concat([Geog_1hot,train_set_strat], axis=1) # Concatenate rows
if 'NumOfProducts' in selected_features:
clmn.append('NumOfProducts')
if 'HasCrCard' in selected_features:
clmn.append('HasCrCard')
if 'IsActiveMember' in selected_features:
clmn.append('IsActiveMember')
# Standardization
# Make training features and target
X_train = train_set_strat.drop("Exited", axis=1)
y_train = train_set_strat["Exited"].values
selected_features_con = [i for i in selected_features if i not in clmn]
if len(selected_features_con)==0 and len(clmn)==0:
raise ValueError("Please select at least one input!")
# Divide into two training sets (with and without standization)
X_train_for_std = X_train[selected_features_con]
clmn = [item for sublist in clmn for item in (sublist if isinstance(sublist, list) else [sublist])]
X_train_not_std = X_train[clmn]
st.session_state.clmns_all = selected_features_con + clmn
#
scaler = StandardScaler()
scaler.fit(X_train_for_std)
#
df_train_std = scaler.transform(X_train_for_std)
X_train_std = np.concatenate((df_train_std, X_train_not_std), axis=1)
# Initialize the progress bar
progress_bar = st.progress(0)
progress_step = 100 / 3 # Assuming 3 main steps in your process
rnd = RandomForestClassifier(n_estimators=n_estimators, max_depth= max_depth,
min_samples_split= 20, bootstrap= bootstrap,
random_state=random_state)
rnd.fit(X_train_std, y_train)
st.session_state.X_train_std = X_train_std
st.session_state.y_train = y_train
progress_bar.progress(100) # Complete the progress bar
# Convert gender to 0 and
if 'Gender' in selected_features_test:
ordinal_encoder = OrdinalEncoder()
test_set_strat['Gender'] = ordinal_encoder.fit_transform(test_set_strat[['Gender']])
# Remove 'Geography'
if 'Geography' in selected_features_test:
Geog_1hot = pd.get_dummies(test_set_strat['Geography'],prefix='Is')
test_set_strat = test_set_strat.drop(['Geography'],axis=1,inplace=False)
test_set_strat = pd.concat([Geog_1hot,test_set_strat], axis=1) # Concatenate rows
# Standardize data
X_test = test_set_strat.drop("Exited", axis=1)
y_test = test_set_strat["Exited"].values
#
X_test_for_std = X_test[selected_features_con]
X_test_not_std = X_test[clmn]
#
df_test_std = scaler.transform(X_test_for_std)
X_test_std = np.concatenate((df_test_std,X_test_not_std), axis=1)
# Random Forest for test set
y_test_pred = rnd.predict(X_test_std)
y_test_proba_rnd = rnd.predict_proba(X_test_std)
score = accuracy_score(y_test_pred, y_test)
st.markdown(f"""<p style='font-size:24px;'>Random Forest
model was trained. The accuracy score for test set is <strong>{int(score*100)}%</strong></p>""", unsafe_allow_html=True)
st.session_state.rnd = rnd
# Apply feature importance with shapely
if st.button("Shapely Feature Importance", type="secondary"):
# Plot the importance of features
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
# Initialize the progress bar
#progress_bar = st.progress(0)
progress_step = 100 / 2
explainer = shap.TreeExplainer(st.session_state.rnd)
X_train_std = pd.DataFrame(st.session_state.X_train_std,
columns=st.session_state.clmns_all)
shap_values = explainer(X_train_std)
shap_values_for_class = shap_values[..., 0]
progress_bar = st.progress(int(progress_step))
shap.plots.beeswarm(shap_values_for_class,
max_display=len(st.session_state.clmns_all))
st.pyplot(fig)
# Complete the progress bar
progress_bar.progress(100)
# Apply feature importance with Random Forest
if st.button("Random Forest Feature Importance", type="secondary"):
class prfrmnce_plot(object):
"""Plot performance of features to predict a target"""
def __init__(self,importance: list, title: str, ylabel: str,clmns: str,
titlefontsize: int=10, xfontsize: int=5, yfontsize: int=8) -> None:
self.importance = importance
self.title = title
self.ylabel = ylabel
self.clmns = clmns
self.titlefontsize = titlefontsize
self.xfontsize = xfontsize
self.yfontsize = yfontsize
#########################
def bargraph(self, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
,yshift: float=0.01,perent: bool=False, xlim: list=False,axt=None,
ylim: list=False, y_rot: int=0, graph_float: bool=True) -> pd.DataFrame():
ax1 = axt or plt.axes()
if not nsim:
# Make all negative coefficients to positive
sort_score=sorted(zip(abs(self.importance),self.clmns), reverse=True)
Clmns_sort=[sort_score[i][1] for i in range(len(self.clmns))]
sort_score=[sort_score[i][0] for i in range(len(self.clmns))]
else:
importance_agg=[]
importance_std=[]
for iclmn in range(len(self.clmns)):
tmp=[]
for isim in range(nsim):
tmp.append(abs(self.importance[isim][iclmn]))
importance_agg.append(np.mean(tmp))
importance_std.append(np.std(tmp))
# Make all negative coefficients to positive
sort_score=sorted(zip(importance_agg,self.clmns), reverse=True)
Clmns_sort=[sort_score[i][1] for i in range(len(self.clmns))]
sort_score=[sort_score[i][0] for i in range(len(self.clmns))]
index1 = np.arange(len(self.clmns))
# select the most important features
if (select):
Clmns_sort=Clmns_sort[:select]
sort_score=sort_score[:select]
ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color='b')
plt.title(self.title,fontsize=self.titlefontsize)
ax1.set_ylabel(self.ylabel,fontsize=self.yfontsize)
ax1.set_xticks(np.arange(len(Clmns_sort)))
ax1.set_xticklabels(Clmns_sort,fontsize=self.xfontsize, rotation=90,y=0.02)
if (perent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2)
if (xlim): plt.xlim(xlim)
if (ylim): plt.ylim(ylim)
if (fontsizelable):
for ii in range(len(sort_score)):
if (perent):
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.1f}".format(sort_score[ii]*100)}%',
fontsize=fontsizelable,rotation=y_rot,color='k')
else:
if graph_float:
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.3f}".format(sort_score[ii])}',
fontsize=fontsizelable,rotation=y_rot,color='k')
else:
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.0f}".format(sort_score[ii])}',
fontsize=fontsizelable,rotation=y_rot,color='k')
dic_Clmns={}
for i in range(len(Clmns_sort)):
dic_Clmns[Clmns_sort[i]]=sort_score[i]
# Plot the importance of features
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
# Calculate importance
importance = abs(st.session_state.rnd.feature_importances_)
df_most_important = prfrmnce_plot(importance, title=f'Feature Importance by Random Forest',
ylabel='Random Forest Score',clmns=st.session_state.clmns_all,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
yshift=0.01,ylim=[0, max(importance)+0.05], xlim=[-0.5,len(st.session_state.clmns_all)+0.5], y_rot=0)
st.pyplot(fig)
Up to this point, we've focused on building Streamlit apps, covering everything from creating intricate visualizations to developing and deploying machine learning models. Now, let's turn our attention to deployment, enabling these applications to be shared with anyone who has internet access. Deployment is essential for Streamlit apps because, without it, users may face obstacles accessing your work. If we believe that Streamlit removes barriers between creating data science analyses/products/models and sharing them, then the ability to share these apps widely is just as crucial as ease of development.
There are three primary methods for deploying Streamlit apps:
AWS and Heroku are paid options, while Streamlit Community Cloud and Hugging Face Spaces are free! The most straightforward and popular choice for many Streamlit users is Streamlit Community Cloud, so we’ll focus on it first and cover Hugging Face Spaces afterward.
Streamlit Community Cloud is Streamlit's streamlined solution for quick and easy deployment and is highly recommended for deploying Streamlit applications. After initially enjoying the experience of developing and deploying apps locally with Streamlit, trying to deploy on AWS was somewhat discouraging. Then, Streamlit introduced their own deployment solution, now known as Streamlit Community Cloud. There was initial skepticism about its simplicity, but it turned out to be as easy as pushing code to a GitHub repository and linking Streamlit to it—Streamlit takes care of the rest.
Though there are cases where configuring storage or memory may be needed, Streamlit Community Cloud usually handles deployment, resources, and sharing, simplifying the development workflow.
The goal now is to deploy the customer churn model on Streamlit Community Cloud. Before we start, remember that Streamlit Community Cloud integrates with GitHub. If you're already familiar with Git and GitHub, you can skip ahead.
git init
git add Churn_Modelling.py
git commit -m 'our first repo commit'
git branch -M main
git remote add origin https://github.com/MehdiRezvandehy/Customer_Chrun_Prediction.git
git push -u origin main
Once the app is built, it’s fully deployed as a Streamlit app. Any changes made to the GitHub repository will automatically be reflected in the app. For instance, the following code snippet changes the app's title (only the essential commands are shown for brevity):
git add .
git commit -m 'updated the title'
git push
The app will have its own unique URL, and if you ever need to locate your Streamlit apps, you can always find them at share.streamlit.io. The top of the app should now appear as shown in the following screenshot
With all the necessary files now in the GitHub repository, we’re almost ready to deploy our application. Follow these steps to complete the deployment:
When deploying to Streamlit Community Cloud, Streamlit uses its own servers to host the app. This requires us to specify the Python libraries our app needs to run. The following commands install the helpful pipreqs
library and create a requirements.txt
file in the format required by Streamlit:
pip install pipreqs
pipreqs .
!pipreqs . # in jupyter notebook
The requirements.txt
file generated by pipreqs
scans all the Python files, identifies the imported libraries, and lists them with the specific versions needed. This ensures that Streamlit can install the correct libraries, minimizing the risk of errors.
Since the requirements.txt
file is new, we need to add it to the GitHub repository. Use the following commands to do so:
git add requirements.txt
git commit -m 'add requirements file'
git push
The final step is to sign up for Streamlit Community Cloud at https://share.streamlit.io/. Once logged in, click on the New App button. For more details, refer to Deploy your app. You can then point Streamlit Community Cloud to the Python file that contains your app’s code, which in this case is Churn_Modelling_Train.py
.
The requirement files is very important. Make sure to mention right version of libraries used for this script:
Here is the app https://customerchrunprediction-59.streamlit.app/
Here is churn_modelling_app.py
python script in GitHub to make this app:
import pickle
import streamlit as st
import numpy as np
st.title('Customer Churn Prediction')
# Typing effect that stops at the author's name length and repeats from the beginning
st.markdown(
"""
<style>
.author-title {
font-size: 1.3em;
font-weight: bold;
color: #007acc; /* Color for "Author:" */
white-space: nowrap;
vertical-align: middle; /* Ensures alignment with animated text */
}
.author-name {
font-size: 1.2em;
font-weight: bold;
color: red; /* Color for the author's name */
overflow: hidden;
white-space: nowrap;
border-right: 3px solid;
display: inline-block;
vertical-align: middle; /* Aligns with the static "Author:" text */
animation: typing 5s steps(20, end) infinite, blink-caret 0.75s step-end infinite;
max-width: 10ch; /* Limit width to fit text length */
}
/* Typing effect */
@keyframes typing {
0% { max-width: 0; }
50% { max-width: 30ch; } /* Adjust to match the name's length */
100% { max-width: 0; } /* Reset back to zero */
}
/* Blinking cursor animation for the author's name */
@keyframes blink-caret {
from, to { border-color: transparent; }
50% { border-color: red; }
}
</style>
<p><span class="author-title">Author:</span> <span class="author-name">Mehdi Rezvandehy</span></p>
""",
unsafe_allow_html=True
)
st.write("""""")
st.write(
"""This app is created by [streamlit](https://streamlit.io/) to predict the likelihood if bank
customers will turnover next cycle. Random forest classifier is trained by Bank Turnover Dataset from
[Kaggle](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/version/1).
The app is trained by 10 inputs (predictors). The user inputs are `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
st.image('DataTable.jpg')
rf_pickle = open("random_forest_churn.pickle", "rb")
rfc = pickle.load(rf_pickle)
rf_pickle.close()
st.write('')
col1, col2, col3 = st.columns(3)
col1.subheader("Input Data")
Geography = col1.selectbox("Geography", options=["France", "Germany", "Spain"])
CreditScore = col1.number_input("CreditScore", min_value=300)
Gender = col1.selectbox("Gender", options=['Male', 'Female'])
Age = col1.number_input("Age", min_value=18)
Tenure = col1.number_input("Tenure", min_value=2)
col2.subheader(" ")
col2.subheader(" ")
Balance = col2.number_input("Balance", min_value=500)
NumOfProducts = col2.number_input("NumOfProducts", min_value=1)
HasCrCard = col2.selectbox("HasCrCard", options=[0, 1])
IsActiveMember = col2.selectbox("IsActiveMember", options=[0, 1])
EstimatedSalary = col2.number_input("EstimatedSalary", min_value=1000)
user_inputs = [Geography, CreditScore, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary]
Is_France, Is_Germany, Is_Spain = 0, 0, 0
if Geography == 'France':
Is_France = 1
elif Geography == 'Germany':
Is_Germany = 1
elif Geography == 'Spain':
Is_Spain = 1
if Gender == 'Male':
Gender = 1
elif Gender == 'Female':
Gender = 0
std_pickle = open("std.pickle", "rb")
scaler = pickle.load(std_pickle)
std_pickle.close()
clmn_std = np.array([CreditScore, Age, Tenure, Balance, EstimatedSalary]).reshape(1, 5)
clmn_not_std = np.array([Is_France, Is_Germany, Is_Spain,
Gender, NumOfProducts, HasCrCard, IsActiveMember]).reshape(1, 7)
feat_std = scaler.transform(clmn_std)
to_pred = np.concatenate((feat_std, clmn_not_std), axis=1)
y_pred = int(rfc.predict_proba(to_pred)[0][0]*100)
col3.subheader("Prediction")
col3.write(f"The likelihood of churn for this customer is predicted **{y_pred}**%")
When building and deploying Streamlit apps, you might need to incorporate sensitive information, like a password or API key, that should remain hidden from users. However, Streamlit Community Cloud defaults to public GitHub repositories, where all code, data, and models are visible. If you need to work with a private API key—required by many APIs such as Twitter's scraping API or Google Maps—or if you need to programmatically access data from a password-protected database, or even password-protect your Streamlit app, it’s essential to securely expose this private data to Streamlit. Streamlit addresses this with Streamlit Secrets, allowing you to set hidden, private “secrets” for each app. Let’s begin by setting a password to secure our existing Streamlit application.
First, we can modify the beginning of our app to prompt users to enter a password before loading the rest of the application. By using the st.stop()
function, we can halt the app if the password entered is incorrect, as shown in the following code:
import pickle
import streamlit as st
import numpy as np
st.title('Customer Churn Prediction')
password_guess = st.text_input('What is the Password?')
if password_guess != st.secrets["password"]:
st.stop()
st.write(
"""This app is created by [streamlit](https://streamlit.io/) to predict the likelihood if bank
customers will turnover next cycle. Random forest classifier is trained by Bank Turnover Dataset from
[Kaggle](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/version/1).
The app is trained by 10 inputs (predictors). The user inputs are `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
st.image('DataTable.jpg')
rf_pickle = open("random_forest_churn.pickle", "rb")
rfc = pickle.load(rf_pickle)
rf_pickle.close()
This code will create a password-protected app with the Streamlit Secret that we set:
To create a Streamlit Secret, go to the Streamlit Community Cloud main page at https://share.streamlit.io/ and select the "Edit secrets" option, as illustrated in the following screenshot:
The password for this website is churn-modeling-banking-7968
In this section, we’ll explore elements like sidebars, tabs, columns, and colors to enhance our ability to create visually appealing Streamlit applications.
We can split our Streamlit app into multiple columns of varying widths using st.columns()
, with each column acting as a distinct container in our app to display text, graphs, images, or any other elements we choose.
In Streamlit, columns are implemented using with
notation. This notation creates self-contained code blocks, specifying exactly where to position elements within the app’s layout.
import streamlit as st
st.title("Churn prediction")
st.write(
"""
This app predicts the likelood of customer leaving business
"""
)
col1, col2, col3 = st.columns(3)
with col1:
st.write("Column 1")
with col2:
st.write("Column 2")
with col3:
st.write("Column 3")
import streamlit as st
f_width = st.number_input('First Width', min_value=5, value=5)
s_width = st.number_input('Second Width', min_value=5, value=5)
t_width = st.number_input('Third Width', min_value=5, value=5)
col1, col2, col3 = st.columns((f_width,s_width,t_width))
with col1:
st.write('First column')
with col2:
st.write('Second column')
with col3:
st.write('Third column')
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
st.title('Customer Churn')
st.write(
"""This app analysises user inputs `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
df = pd.read_csv('Churn_Modelling.csv')
fig_mpl, ax_mpl = plt.subplots()
col1, col2, col3 = st.columns(3)
col1.subheader("Input 1")
selected_var1 = col1.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var1>")
with col1:
col1.write(df[selected_var1].head())
ax_mpl = plt.hist(df[selected_var1])
plt.xlabel(selected_var1)
col1.pyplot(fig_mpl)
#
col2.subheader("Input 2")
selected_var2 = col2.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var2>")
with col2:
col2.write(df[selected_var2].head())
ax_mpl = plt.hist(df[selected_var2])
plt.xlabel(selected_var2)
col2.pyplot(fig_mpl)
#
col3.subheader("Input 3")
selected_var3 = col3.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var3we>")
with col3:
col3.write(df[selected_var3].head())
ax_mpl = plt.hist(df[selected_var3])
plt.xlabel(selected_var3)
col3.pyplot(fig_mpl)
Tabs are helpful when content is too wide to fit neatly into columns, even in wide mode, or when you want to focus the user’s attention on a single piece of content at a time.
st.tabs
functions similarly to st.columns
, but instead of specifying the number of tabs, we provide the tab names and use familiar with
statements to add content to each tab. The following code transforms the columns in our recent Streamlit app into tabs:
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
st.title('Customer Churn')
st.write(
"""This app analysises user inputs `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
df = pd.read_csv('Churn_Modelling.csv')
fig_mpl, ax_mpl = plt.subplots()
tab1, tab2, tab3 = st.tabs(["Input 1", "Input 2", "Input 3"])
tab1.subheader("Input 1")
selected_var1 = tab1.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var1>")
with tab1:
ax_mpl = plt.hist(df[selected_var1])
plt.xlabel(selected_var1)
tab1.pyplot(fig_mpl)
tab2.subheader("Input 2")
selected_var2 = tab2.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var2>")
with tab2:
ax_mpl = plt.hist(df[selected_var2])
plt.xlabel(selected_var2)
tab2.pyplot(fig_mpl)
tab3.subheader("Input 3")
selected_var3 = tab3.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var3we>")
with tab3:
ax_mpl = plt.hist(df[selected_var3])
plt.xlabel(selected_var3)
tab3.pyplot(fig_mpl)
We might want to give users the ability to alter or edit the underlying data in a very user-friendly way. To help solve this, Streamlit released st.data_editor
. See the code below:
import pandas as pd
import streamlit as st
import matplotlib.pyplot as plt
st.title("Customer Churn")
st.write(
"""This app analysises user inputs `Geography`, `CreditScore`,
`Gender`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`"""
)
df = pd.read_csv('Churn_Modelling.csv')
df = st.data_editor(df)
fig_mpl, ax_mpl = plt.subplots()
col1, col2, col3 = st.columns(3)
col1.subheader("Input 1")
selected_var1 = col1.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var1>")
with col1:
col1.write(df[selected_var1].head())
ax_mpl = plt.hist(df[selected_var1])
plt.xlabel(selected_var1)
col1.pyplot(fig_mpl)
col2.subheader("Input 2")
selected_var2 = col2.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var2>")
with col2:
col2.write(df[selected_var2].head())
ax_mpl = plt.hist(df[selected_var2])
plt.xlabel(selected_var2)
col2.pyplot(fig_mpl)
col3.subheader("Input 3")
selected_var3 = col3.selectbox('What do you want the x variable to be?',
["Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"],
key = "<selected_var3we>")
with col3:
col3.write(df[selected_var3].head())
ax_mpl = plt.hist(df[selected_var3])
plt.xlabel(selected_var3)
col3.pyplot(fig_mpl)
if st.button("Save data and overwrite:"):
df.to_csv("data.csv", index=False)
st.write("Saved!")
This section highlights community-driven development using Streamlit Components. When building Streamlit, the team developed a structured approach for other developers to create additional features on top of the existing open-source Streamlit framework, called Components. Streamlit Components give developers the flexibility to create tools that are essential to their workflows or just enjoyable and innovative.
As Streamlit’s popularity as a framework has grown, so has the range of its Components. It seems like there’s always a new, interesting component to explore and try in my own apps! This section will focus on discovering and using community-made Streamlit Components:
streamlit-aggrid
streamlit-plotly-events
streamlit-lottie
pandas-profiling
st-folium
streamlit-extras
We've already explored a few methods for displaying DataFrames in our Streamlit apps, like the built-in st.write
and st.dataframe
functions. However, streamlit-aggrid
takes it a step further by providing a visually appealing, interactive, and editable version of st.dataframe
. This library is built on top of the JavaScript framework AgGrid (https://www.ag-grid.com/).
The best way to get familiar with streamlit-aggrid
is to dive in and give it a try! Let’s start with an example using the penguins dataset, aiming to create an interactive and editable DataFrame, something AgGrid
excels at.
import pandas as pd
import streamlit as st
from st_aggrid import AgGrid
st.title("Streamlit Churn Example")
df = pd.read_csv("Churn.csv")
AgGrid(df)
We previously discussed deploying Streamlit apps with Streamlit Community Cloud, learning how to make deployment quick, easy, and effective for most applications. However, Streamlit Community Cloud has limitations, such as a maximum of 1 GB RAM per app, making it unsuitable for resource-intensive applications.
This brings us to an alternative approach—integrating Streamlit with Snowflake. The paid Streamlit version is now part of the Snowflake ecosystem, which may initially seem restrictive. However, Snowflake’s popularity and performance can be advantageous, especially if your organization already utilizes Snowflake. This section will cover:
Prerequisites for Heroku deployment is are required as:
Heroku account: Heroku is a widely used platform for hosting applications, models, and APIs, and is owned by Salesforce. To create a free account, go to https://signup.heroku.com.
Heroku Command-Line Interface (CLI): The Heroku CLI is essential for running Heroku commands. Download it by following the instructions at https://devcenter.heroku.com/articles/heroku-cli.
Prerequisites for Hugging Face deployment is required as:
At a high level, deploying a Streamlit application for internet users to access essentially means renting a remote computer and instructing it to launch your app. Deciding which platform to use can be challenging without experience in system deployment or testing each option firsthand, but a few general guidelines can help.
The two primary considerations in this choice are system flexibility and setup time. These factors are often inversely related. With Streamlit Community Cloud, for instance, you can’t specify “Run this on GPUs with 30 GiB of memory,” but in return, you get a streamlined process—simply point Streamlit Community Cloud to your GitHub repository, and it handles the setup. In contrast, Hugging Face and Heroku offer more flexibility via paid plans but require a bit more initial configuration (as you’ll see!).
In short, if you’re already using a platform (such as Snowflake, Hugging Face, or Heroku), it’s best to continue with that. If you’re not yet on a platform or are a hobbyist programmer, Streamlit Community Cloud is an excellent choice.
For applications needing higher computational power in machine learning or NLP, Hugging Face is ideal. If you need extensive compute resources on a versatile platform with many integrations, Heroku is a great choice.
Let’s dive into setting up with Hugging Face!
Hugging Face provides a comprehensive suite of tools tailored to machine learning, widely favored by machine learning engineers and NLP professionals. It enables developers to easily access pre-trained models via the transformers library (which we’ve already used!) and also supports hosting custom models, datasets, and even data apps through its feature called Hugging Face Spaces. A Space is essentially a place to deploy an app on Hugging Face’s infrastructure, making it simple to get started.
To begin, visit https://huggingface.co/spaces and click on the "Create new Space" button.
Once logged in, you'll see several setup options. Here, you can name your Space, select a license, pick the Space type (Gradio is another popular data app tool owned by Hugging Face), choose the hardware (both free and paid options are available), and decide whether to make the Space public or private. The screenshot below shows the options I've selected (you can choose any name for your Space, but the remaining settings should align with these).
Now, you should click the Create Space button at the bottom of the page. Once you have created the Space, you need to clone that Space on your personal computer using the following Git command, https://huggingface.co/new-space
git clone https://huggingface.co/spaces/mehdi59/new-churn-modeling
Create your Streamlit app.py
file
Then commit and push
git add app.py
git commit -m "Add application file"
git push
It must be noted that we should create a new HuggingFace token as credential
Moreover, we should add token user to .git/config inside your repo https://discuss.huggingface.co/t/cant-push-to-new-space/35319
If we go back to our code and look at the README.md
file, we will notice that there are a bunch of useful configuration options, such as changing the emoji or the title. Hugging Face also allows us to specify other parameters like the Python version. The full documentation is in the link in your README.md:
Here is deployed app on Hugging Face: https://huggingface.co/spaces/mehdi59/new-churn-modeling
And that is it for deploying Streamlit apps on Hugging Face!
One drawback you might notice with deploying on Hugging Face Spaces is that it involves a few more setup steps than Streamlit Community Cloud, and Hugging Face's branding occupies a significant portion of the app’s layout. This is understandable, as Hugging Face aims to ensure that users recognize the app as hosted on their platform. For those already using Hugging Face, this branding can be beneficial, as it allows them to easily clone the Space and explore popular models. However, when sharing apps with non-ML colleagues or friends, this branding may detract from the viewing experience.
Another limitation of Hugging Face Spaces is that they often lag behind in supporting the latest Streamlit versions. Currently, Spaces are using Streamlit 1.10.0, while the latest release is version 1.16.0. So, if you’re looking for the latest Streamlit features, Spaces might not yet support them. Although this won’t impact most Streamlit apps, it’s something to consider when selecting a deployment platform.