archive/
folder contains examples that work with older versions of hillmaker.
In this notebook we'll focus on basic use of hillmaker for analyzing arrivals, departures, and occupancy by time of day and day of week for a typical discrete entity flow system. A few examples of such systems include:
Basically, any sort of discrete stock and flow system for which you are interested in time of day and day of week specific statistical summaries of occupancy, arrivals and departures, and have raw data on the arrival and departure times, is fair game for hillmaker.
Patients flow through a short stay unit for a variety of procedures, tests or therapies. Let's assume patients can be classified into one of five categories of patient types: ART (arterialgram), CAT (post cardiac-cath), MYE (myelogram), IVT (IV therapy), and OTH (other). From one of our hospital information systems we were able to get raw data about the entry and exit times of each patient and exported the data to a csv file. We call each row of such data a stop (as in, the patient stopped here for a while). Let's take a peek at the data.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
from IPython.display import Image
ssu_stopdata = '../data/ShortStay.csv'
stops_df = pd.read_csv(ssu_stopdata, parse_dates=['InRoomTS','OutRoomTS'])
stops_df.info() # Check out the structure of the resulting DataFrame
stops_df.head()
As part of an operational analysis we would like to compute a number of relevant statistics, such as:
In addition to tabular summaries, plots are needed. Like this:
Image(filename="images/ssu-occ.png")
Hillmaker was designed for precisely this type of problem. In fact, the very first version of hillmaker was written for analyzing an SSU when the author was an undergraduate interning at a large health care system. That very first version was written in BASIC on a DECwriter!
Image(filename="images/DECwriter,_Tektronix,_PDP-11_(192826605).jpg")
Source: By Wolfgang Stief from Tittmoning, Germany - DECwriter, Tektronix, PDP-11, CC0, https://commons.wikimedia.org/w/index.php?curid=105322423
Over the years, hillmaker was migrated to FoxPro, and then to MS Access where it lived for many years. In 2016, I moved it to Python.
Version 0.4.3 was just released and is available on PyPI or source from https://github.com/misken/hillmaker. This version is much faster than previous versions (thank you numpy) and includes a CLI that also supports TOML formatted config files, flow conservation checks and better logging. It does however revert back to only allowing a single category field (multiple category fields can easily be handled by constructing composite category strings). You can read more about this latest release of hillmaker at https://misken.github.io/blog/hillmaker-python-released/. It's free and open source.
The new hillmaker is implemented as an importable Python module and as a runnable script with a simple CLI. This new version of hillmaker is still in what I'd call a pre-release state. The output does match the Access version for the ShortStay database that I included in the original Hillmaker. I've been actively using it to process thousands of simulation output log files as part of a research project on OB patient flow. More testing is needed before I release it as version 1.0, but it does appear to be doing its primary job correctly. Please open an issue on GitHub if you think it's computing something incorrectly. Before using for any real project work, you should do your own testing to confirm that it is working appropriately for your needs. Use at your own risk - see LICENSE file in GitHub.
Over the years, I (and many others) have used the old Hillmaker in a variety of ways, including:
I'd like users to be able to use the new Python based version in a number of different ways as well. As I'll show in this Jupyter notebook, it can be used by importing the hillmaker
module and then calling hillmaker functions via:
While these two options provide tons of flexibility for power users, I have also added a CLI. The CLI is demo'd in this notebook as well.
This is uncharted territory for me. Python has a number of frameworks/toolkits for creating GUI apps. This is not the highest priority for me but I do plan on creating a GUI for hillmaker.
Whereas the old Hillmaker required MS Access, the new one requires an installation of Python 3.7+ along with several Python modules that are widely used for analytics and data science work. The free and open source Anaconda Distribution of Python is a great way to get started with Python for analytics work. It is available on all platforms. Once you've got a working version of Python, you can install into a Python (or Conda) virtual environment.
Option 1 - standard Python virtual environment
$ pip install hillmaker
Option 2 - conda Python virtual environment
$ ~/anaconda3/envs/<conda venv name>/bin/python -m pip install hillmaker
The reason for the more convoluted install command when using conda virtual environments is that it turns out to be rather tricky to properly pip install something into a conda virtual environment. This is the source of much confusion and much discussion on StackOverflow. It boils down to making sure that you are using the pip
executable in your conda virtual environment and NOT in your base conda environment. This SO post explains how the use of the -m
flag helps in this regard. However, this post doesn't appear to mention that the same issue applies to the python
executable itself. Again, you MUST make sure you are using the python
executable in your conda virtual environment and not in your conda base environment. If you use Option 2 above, you'll be able to run the hillmaker
script and its CLI and you'll also be able to import it if you wish to use hillmaker that way.
Obviously if you are comfortable working with source code, you can also install hillmaker
from its GitHub repo or a clone/fork of it.
To run hillmaker we only need to import a few modules. Since the main hillmaker function uses pandas DataFrames for both data input and output, we need to import pandas
in addition to hillmaker
.
#import pandas as pd
import hillmaker as hm
Here's the first few lines from our csv file containing the patient stop data:
PatID,InRoomTS,OutRoomTS,PatType
1,1/1/1996 7:44,1/1/1996 8:50,IVT
2,1/1/1996 8:28,1/1/1996 9:20,IVT
3,1/1/1996 11:44,1/1/1996 13:30,MYE
4,1/1/1996 11:51,1/1/1996 12:55,CAT
5,1/1/1996 12:10,1/1/1996 13:00,IVT
6,1/1/1996 14:16,1/1/1996 15:35,IVT
7,1/1/1996 14:40,1/1/1996 15:25,IVT
We have already read this data into a pandas DataFrame named stops_df
. Each record is a "stop" at the SSU.
Check out the top and bottom of stops_df
.
stops_df.head(7)
stops_df.tail(7)
Let's compute some basic summary statistics such as the earliest and latest arrival and departure as well as counts by patient type.
print(f'Earliest arrival = {stops_df["InRoomTS"].min()}')
print(f'Latest departure = {stops_df["OutRoomTS"].max()}')
stops_df.groupby(['PatType'])['PatID'].count()
Let's get a sense of the number of patient visits by month.
stops_df['InRoomTS'].groupby(stops_df.InRoomTS.dt.to_period("M")).agg('count')
You probably want to do some length of stay analysis, so let's compute it in hours and then do describe
by patient type. The plan is to add functions to hillmaker to do length of stay analysis automatically. In addition to statistical summaries as shown below, it would be nice to have histograms.
stops_df['LOS'] = (stops_df['OutRoomTS'] - stops_df['InRoomTS']) / pd.Timedelta(1, "h")
stops_df.groupby(['PatType'])['LOS'].describe()
The primary function in hillmaker is called make_hills
and plays the same role as the Hillmaker
function in the original Access VBA version of Hillmaker. Let's get a little help on this function.
help(hm.make_hills)
Most of the parameters are similar to those in the original VBA version, though a few new ones have been added. For example, the cat_to_exclude
parameter allows you to specify a list of category values for which you do not want occupancy statistics computed. Also, since the VBA version used an Access database as the container for its output, new parameters were added to control output to csv files instead.
Let's create variables for the required inputs as well as a few of the optional inputs.
scenario
value gets used in creating output filenames. in_field_name
, out_field_name
and cat_field_name
are all string variables used to specifiy the arrival time field, departure time field and patient category field, respectively.start_date
and end_date
are also strings that are capabile of being converted to a pandas Timestamp. There is no need to include times as hillmaker will include the entire end date automatically. You need to be careful when specifying the start and end dates for your analysis (we call this the analysis range). You need to consider horizon and warmup effects. In the SSU example, each stop is just a few hours and there aren't any patients who arrive before 1/1/1996 are are still in the SSU on 1/1/1996. However, if we were working with data in which the stops are a few days in length (such as on an inpatient nursing unit), we need to think about what start date we should use and exactly how the original dataset was extracted. Hillmaker is completely capable of properly accounting for patients who arrive before the specified start date for the analysis as well as those who are discharged after the end date. However, it can only work with the stop data provided and you are responsible for considering warmup effects - a transient phase as occupancy builds to some stochastic steady state. Assume you know that you have stop data that was extracted, say, to include all patients discharged between 1/1/2021 and 12/30/2021 and that each stop might last for several days. You wouldn't want to set your hillmaker start date to 1/1/2021 as the system will appear to start out empty and occupancy will have a transient phase until the system fills to some sort of steady state. The longer the length of stay, the longer this warmup phase will take. You might want to experiment with start dates ranging from a few weeks to a few months after your earliest arrival time in your hillmaker stop data to see how long the system takes to reach a steady state. Similarly, if your criteria for selectng the stop data was discharges in 1/1/2021-12/30/2021, your data will not contain records for those patients admitted before 12/30/2021 but discharged after 12/30/2021. So, you might want to set your end date for hillmaker to be a few weeks before 12/20/2021.
For our SSU data, we don't need to worry about this as patients only stay a few hours and the SSU typically only houses patients between ~6am-10pm.
# Required inputs
scenario = 'ss_example_1'
in_field_name = 'InRoomTS'
out_field_name = 'OutRoomTS'
cat_field_name = 'PatType'
start_date = '1/1/1996'
end_date = '9/30/1996'
# Optional inputs
verbosity = 1 # INFO level logging
output_path = './output'
bin_size_minutes = 60
Now we'll call the main make_hills
function. In addition to capturing the return values we will also take the default behavior of having the summaries exported to csv files. You'll see that the filenames will contain the scenario string.
results_ex1 = hm.make_hills(scenario, stops_df, in_field_name, out_field_name,
start_date, end_date, cat_field_name, verbosity=verbosity, output_path=output_path)
There are two main types of output dataframes:
The results dictionary is organized with these two types as the highest level keys.
results_ex1.keys()
Digging into the bydatetime
keys reveals that there is one DataFrame that is by datetime by category and one datetime DataFrame for the totals over all the category values
results_ex1['bydatetime'].keys()
results_ex1['bydatetime']['PatType_datetime'].info()
results_ex1['bydatetime']['PatType_datetime'].head()
results_ex1['bydatetime']['PatType_datetime'].tail()
results_ex1['bydatetime']['PatType_datetime'].iloc[23:40]
Note the multi-index with one level for PatType
and one level for the datetime
, a Timestamp
. Each row is one specific hour on a specific date in the analysis range. While arrivals
and departures
are integer values, occupancy
can be fractional since patients may arrive and depart anywhere within a time bin and are given "occupancy credit" for the fraction of time they are present during that timebin.
The other datetime DataFrame represents the overall totals (over all patient types). Notice that now the index is just a simple DateTimeIndex
.
results_ex1['bydatetime']['datetime'].info()
results_ex1['bydatetime']['datetime'].iloc[23:40]
Creating a time series plot of occupancy over the entire date range is a good check that things make sense overall.
# Create a Figure and Axes object
fig1 = plt.figure()
ax1 = fig1.add_subplot(1,1,1)
# Use datetime index as the X-axis values
timestamps = results_ex1['bydatetime']['datetime'].index
# Create series to plot
occupancy = results_ex1['bydatetime']['datetime']['occupancy']
# Styling of bars, lines, plot area
# Style the bars for mean occupancy
bar_color = 'grey'
bar_opacity = 0.8
# Set the background color of the plot. Argument is a string float in
# (0,1) representing greyscale (0=black, 1=white)
ax1.patch.set_facecolor('0.95')
# Add data to the plot
# Occupancy as bars
ax1.bar(timestamps, occupancy, color=bar_color, alpha=bar_opacity, label='Occupancy')
# Set plot and axis titles
fig1.suptitle('Occupancy by hourly time bins', fontsize=28, fontweight='bold')
ax1.set_title('All patient types', fontsize=24)
ax1.set_xlabel('Time Bin', fontsize=14)
ax1.set_ylabel('Patients', fontsize=14)
# Gridlines
ax1.grid(True, color='k')
# Legend
leg = ax1.legend(loc='best', frameon=True, fontsize=20)
leg.get_frame().set_facecolor('white')
# Plot size
fig1.set_size_inches(16,12)
Now let's dig into the summary dataframes. There are more of these as each metric - arrivals, departure, and occupancy, gets their own DataFrames. In addition, DataFrames are created for various groupings:
The first two items are referred to as nonstationary summaries since the statistics are by day of week and time of day. The latter two items are stationary summaries as they are simply overall, time independent, statistics.
results_ex1['summaries'].keys()
results_ex1['summaries']['nonstationary'].keys()
Within each of these is a dictionary whose keys are flow metrics. To look at overall statistics (i.e. not by PatType
) use the 'dow_binofday'
key.
results_ex1['summaries']['nonstationary']['dow_binofday'].keys()
Each of these three DataFrames has the same structure. We will focus on occupancy as it is the reason we are here. Notice that there are 168 rows in this dataframe, corresponding to the hours of the week. Of course, you can choose a different time bin size when running hillmaker. The multi-index contains day of week and time of bin of day levels in both integer string formats to facilitate plotting.
Most of the column names are pretty self-explanatory:
count
- number of occurances of this datetime bin in the analysis rangemean
- mean occupancy in this datetime binmin
- minimum occupancy in this datetime binmax
- maximum occupancy in this datetime binstdev
- standard deviation occupancy in this datetime binsem
- standard error of mean occupancy in this datetime bin (stdev / sqrt(mean)
)var
- variance of occupancy in this datetime bin (stdev ** 2
)cv
- coefficient of variation of occupancy in this datetime bin (stdev / mean
)skew
- skew of occupancy in this datetime binkurtosis
- kurtosis of occupancy in this datetime binp25
- 25th percentile of occupancy in this datetime binp50
- 50th percentile (median) of occupancy in this datetime binp75
- 75th percentile of occupancy in this datetime binp95
- 95th percentile of occupancy in this datetime binp99
- 99th percentile of occupancy in this datetime binYou can specify which percentiles are computed through the percentiles
argument of the make_hills
function. Those shown above are the default percentiles that are computed.
results_ex1['summaries']['nonstationary']['dow_binofday']['occupancy'].info()
results_ex1['summaries']['nonstationary']['dow_binofday']['occupancy'].head(10)
Using the occupancy summary DataFrame, occupancy plots can be created. Here's a matplotlib approach.
# Create a Figure and Axes object
fig1 = plt.figure()
ax1 = fig1.add_subplot(1,1,1)
# Create a list to use as the X-axis values
num_bins = 7 * 1440 / bin_size_minutes
base_date_for_first_dow = '01/05/2015' # Pick any date with associated DOW you want to appear first on plot
timestamps = pd.date_range(base_date_for_first_dow, periods=num_bins, freq=f'{bin_size_minutes}Min').tolist()
# Choose appropriate major and minor tick locations
major_tick_locations = pd.date_range(f'{base_date_for_first_dow} 12:00:00', periods=7, freq='24H').tolist()
minor_tick_locations = pd.date_range(f'{base_date_for_first_dow} 06:00:00', periods=42, freq='4H').tolist()
# Set the tick locations for the axes object
ax1.set_xticks(major_tick_locations)
ax1.set_xticks(minor_tick_locations, minor=True)
ax1.tick_params(which='both', direction='in')
# Specify the mean occupancy and percentile values
mean_occ = results_ex1['summaries']['nonstationary']['dow_binofday']['occupancy']['mean']
pctile_occ = results_ex1['summaries']['nonstationary']['dow_binofday']['occupancy']['p95']
# Styling of bars, lines, plot area
# Style the bars for mean occupancy
bar_color = 'grey'
bar_opacity = 0.8
# Style the line for the occupancy percentile
pctile_line_style = '-'
pctile_color = 'black'
pctile_line_width = 1
# Set the background color of the plot. Argument is a string float in
# (0,1) representing greyscale (0=black, 1=white)
ax1.patch.set_facecolor('0.95')
# Add data to the plot
# Mean occupancy as bars - here's the GOTCHA involving the bar width
bar_width = 1 / (1440 / bin_size_minutes)
ax1.bar(timestamps, mean_occ, color=bar_color, alpha=bar_opacity, label='Mean occupancy', width=bar_width)
# Some percentile as a line
ax1.plot(timestamps, pctile_occ, linestyle=pctile_line_style, linewidth=pctile_line_width, color=pctile_color, \
label='95th %ile occupancy')
# Create formatter variables
dayofweek_formatter = DateFormatter('%a')
qtrday_formatter = DateFormatter('%H')
# Format the tick labels
ax1.xaxis.set_major_formatter(dayofweek_formatter)
ax1.xaxis.set_minor_formatter(qtrday_formatter)
# Slide the major tick labels underneath the default location by 20 points
ax1.tick_params(which='major', pad=25)
# Add other chart elements
# Set plot and axis titles
fig1.suptitle('SSU Occupancy by Time of Day and Day of Week', fontsize=28, fontweight='bold')
ax1.set_title('All patient types', fontsize=24)
ax1.set_xlabel('Time Bin of Week', fontsize=14)
ax1.set_ylabel('Patients', fontsize=14)
# Gridlines
ax1.grid(True, color='k')
# Legend
leg = ax1.legend(loc='best', frameon=True, fontsize=20)
leg.get_frame().set_facecolor('white')
# Plot size
fig1.set_size_inches(16,12)
Detailed plots by patient type can be created from the occupancy dataframe associated with the 'PatType_dow_binofday'
key.
results_ex1['summaries']['nonstationary']['PatType_dow_binofday']['occupancy'].head(10)
Each dataframe in the results dictionary can be exported to a csv file using the export_bydatetime_csv
and export_summaries_csv
arguments.
# *nix
!ls ./output/ss_example_1_*.csv
# Windows
# !dir .\output\ss_example_1_*.csv
Those csv files (and associated dataframes) not having 'dow_binofday'
in their name are stationary summaries.
# *nix
!cat ./output/ss_example_1_occupancy.csv
# Windows
# !type .\output\ss_example_1_occupancy.csv
# *nix
!cat ./output/ss_example_1_occupancy_PatType.csv
# Windows
# !type .\output\ss_example_1_occupancy_PatType.csv
For the stationary summaries, the count
value of 6576 represents the total number of hourly time bins in the analysis range.
num_days = (pd.Timestamp(end_date) - pd.Timestamp(start_date)) / pd.Timedelta(1, "d") + 1
num_total_bins = num_days * 24
print(f'Total number of hourly bins in analysis range: {num_total_bins:.0f}')
Instead of importing hillmaker and calling the make_hills
function, you can also use its CLI. Not every input argument to make_hills
is exposed in the CLI.
!hillmaker -h
!hillmaker --scenario cli_test_ssu --stop_data_csv ../data/ShortStay.csv --in_field InRoomTS --out_field OutRoomTS --start_analysis_dt 1996-01-01 --end_analysis_dt 1996-09-30 --cat_field PatType --verbosity 1 --output_path output
Instead of using all those named command line arguments, you can also use a simple configuration file. I decided to use TOML for the configuration file format as it's very readable and only a quite simple configuration file is needed. Furthermore, TOML has been becoming more widely used in the Python community and a TOML parser will become part of the standard library in Python 3.11. For now, we can use the well known tomli
library. Here's what the config file looks like. You'll notice it looks a bit like an INI file. See https://toml.io/en/ for more info on TOML.
!cat ../ssu_example.toml
It's just a simple way of grouping the command line arguments and specifying their values.
!hillmaker --config ../ssu_example.toml
I will be creating additional examples to illustrate the capabilities of hillmaker. I will also write some actual documentation.