Author Response:
Reviewer #1 (Public Review):
This paper is of potential interest to researchers performing animal behavioral quantification with
computer vision tools. The manuscript introduces 'BehaviorDEPOT', a MATLAB application and GUI intended
to facilitate quantification and analysis of freezing behavior from behavior movies, along with several other
classifiers based on movement statistics calculated from animal pose data. The paper describes how the tool
can be applied to several specific types of experiments, and emphasizes the ease of use - particularly for
groups without experience in coding or behavioral quantification. While these aims are laudable, and the
software is relatively easy to use, further improvements to make the tool more automated would substantially
broaden the likely user base.
In this manuscript, the authors introduce a new piece of software, BehaviorDEPOT, that aims to serve as an
open source classifier in service of standard lab-based behavioral assays. The key arguments the authors
make are that 1) the open source code allows for freely available access, 2) the code doesn't require any
coding knowledge to build new classifiers, 3) it is generalizable to other behaviors than freezing and other
species (although this latter point is not shown) 4) that it uses posture-based tracking
that allows for higher resolution than centroid-based methods, and 5) that it is possible to isolate features used
in the classifiers. While these aims are laudable, and the software is indeed relatively easy to use, I am not
convinced that the method represents a large conceptual advance or would be highly used outside the rodent
freezing community.
Major points:
1) I'm not convinced over one of the key arguments the authors make - that the limb tracking produces
qualitatively/quantitatively better results than centroid/orientation tracking alone for the tasks they measure. For
example, angular velocities could be used to identify head movements. It would be good to test this with their
data (could you build a classifier using only the position/velocity/angular velocities of the main axis of the
body?
2) This brings me to the point that the previous state-of-the-art open-source methodology, JAABA, is barely
mentioned, and I think that a more direct comparison is warranted, especially since this method has been
widely used/cited and is also aimed at a not-coding audience.
Here we address points 1 and 2 together. JAABA has been widely adopted by the drosophila community with
great success. However, we noticed that fewer studies use JAABA to study rodents. The ones that did typically
examined social behaviors or gross locomotion, usually in an empty arena such as an open field or a standard
homecage. In a study of mice performing reaching/grasping tasks against complex backgrounds, investigators
modified the inner workings of JAABA to classify behavior (Sauerbrei et al., 2020), an approach that is largely
inaccessible to inexperienced coders. This suggested to us that it may be challenging to implement JAABA for
many rodent behavioral assays.
We directly compared BehaviorDEPOT to JAABA and determined that BehaviorDEPOT outperforms JAABA in
several ways. First, we used MoTr and Ctrax (the open-source centroid tracking software packages that are
typically used with JAABA) to track animals in videos we had recorded previously. Both MoTr and Ctrax could
fit ellipses to mice in an open field, in which the mouse is small relative to the environment and runs against a
clean white background. However, consistent with previous reports (Geuther et al., Comm. Bio, 2019), MoTr
and Ctrax performed poorly when rodents were fear conditioning chambers which have high contrast bars on
the floor (Fig. 10A–C). These tracking-related hurdles may explain, at least in part, why relatively few rodent
studies have employed JAABA.
We next tried to import our DeepLabCut (DLC) tracking data into JAABA. The JAABA website instructs users
to employ Animal Part Tracker (https://kristinbranson.github.io/APT/) to convert DLC outputs into a format that
is compatible with JAABA. We discovered that APT was not compatible with the current version of DLC, an
insurmountable hurdle for labs with limited coding expertise. We wrote our own code to estimate a centroid
from DLC keypoints and fed the data into JAABA to train a freezing classifier. Even when we gave JAABA
more training data than we used to develop BehaviorDEPOT classifiers (6 videos vs. 3 videos),
BehaviorDEPOT achieved higher Recall and F1 scores (Fig. 10D).
In response to point 1, we also trained a VTE classifier with JAABA. When we tested its performance on a
separate set of test videos, JAABA could not distinguish VTE vs. non-VTE trials. It labeled every trial as
containing VTE (Fig. 10E), indicating that a fitted ellipse is not sufficient to detect fine angular head
movements. JAABA has additional limitations as well. For instance, JAABA reports the occurrence of behavior
in a video timeseries but does not allow researchers to analyze the results of experiments. BehaviorDEPOT
shares features of programs like Ethovision or ANYmaze in that it can classify behaviors and also report their
occurrence with reference to spatial and temporal cues. These direct comparisons address some of the key
concerns centered around the advances BehaviorDEPOT offers beyond JAABA. They also highlight the need
for new behavioral analysis software targeted towards a noncoding audience, particularly in the rodent domain.
3) Remaining on JAABA: while the authors' classification approach appeared to depend mostly on a relatively
small number of features, JAABA uses boosting to build a very good classifier out of many not-so-good
classifiers. This approach is well-worn in machine learning and has been used to good effect in highthroughput behavioral data. I would like the authors to comment on why they decided on the classification
strategy they have.
We built algorithmic classifiers around keypoint tracking because of the accuracy flexibility and speed it affords.
Like many behavior classification programs, JAABA relies on tracking algorithms that use background
subtraction (MoTr) or pattern classifiers (Ctrax) to segment animals from the environment and then abstract
their position to an ellipse. These methods are highly sensitive to changes the experimental arena and cannot
resolve fine movement of individual body parts (Geuther et al., Comm. Bio, 2019; Pennington et al., Sci. Rep.
2019; Fig. 10A). Keypoint tracking is more accurate and less sensitive to environmental changes. Models can
be trained to detect animals in any environment, so researchers can analyze videos they have already
collected. Any set of body parts can be tracked and fine movements such as head turns can be easily resolved
(Fig. 10E).
Keypoint tracking can be used to simultaneously track the location of animals and classify a wide range of
behaviors. Integrated spatial-behavioral analysis is relevant to many assays including fear conditioning,
avoidance, T-mazes (decision making), Y-mazes (working memory), open field (anxiety, locomotion), elevated
plus maze (anxiety), novel object exploration, and social memory. Quantifying behaviors in these assays
requires analysis of fine movements (we now show Novel Object Exploration, Fig. 5 and VTE, Fig. 6 as
examples). These behaviors have been carefully defined by expert researchers. Algorithmic classifiers can be
created quickly and intuitively based on small amounts of video data (Table 4) and easily tweaked for out of
sample data (Fig. 9). Additional rounds of machine learning are time consuming, computationally intensive,
and unnecessary, and we show in Figure 10 that JAABA classifiers have higher error rates than
BehaviorDEPOT classifiers, even when provided with a larger set of training data. Moreover, while JAABA
reports behaviors in video timeseries, BehaviorDEPOT has integrated features that report behavior occurring
at the intersection of spatial and temporal cues (e.g. ROIs, optogenetics, conditioned cues), so it can also
analyze the results of experiments. The automated, intuitive, and flexible way in which BehaviorDEPOT
classifies and quantifies behavior will propel new discoveries by allowing even inexperienced coders to
capitalize on the richness of their data.
Thank you for raising these questions. We did an extensive rewrite of the intro and discussion to ensure these
important points are clear.
4) I would also like more details on the classifiers the authors used. There is some detail in the main text, but a
specific section in the Methods section is warranted, I believe, for transparency. The same goes for all of the
DLC post-processing steps.
Apologies for the lack of detail. We included much more detail in both the results
and methods sections that describe how each classifier works, how they were developed and validated, and
how the DLC post-processing steps work.
5) It would be good for the authors to compare the Inter-Rater Module to the methods described in the MARS
paper (reference 12 here).
We included some discussion of how BehaviorDEPOT Inter-Rater Module
compares to the MARS.
6) More quantitative discussion about the effect of tracking errors on the classifier would be ideal. No tracking
is perfect, so an end-user will need to know "how good" they need to get the tracking to get the results
presented here.
We included a table detailing the specs of our DLC models and the videos that we used for
validating our classifiers (Table 4). We also added a paragraph about designing video ‘training’ and test sets to
the methods.
Reviewer #2 (Public Review):
BehaviorDEPOT is a Matlab-based user interface aimed at helping users
interact with animal pose data without significant coding experience. It is composed of several tools for
analysis of animal tracking data, as well as a data collection module that can interface via Arduino to control
experimental hardware. The data analysis tools are designed for post-processing of DeepLabCut pose
estimates and manual pose annotations, and includes four modules: 1) a Data Exploration module for
visualizing spatiotemporal features computed from animal pose (such as velocity and acceleration), 2) a
Classifier Optimization module for creating hand-fit classifiers to detect behaviors by applying windowing to
spatiotemporal features, 3) a Validation module for evaluating performance of classifiers, and 4) an Inter-Rater
Agreement module for comparing annotations by different individuals.
A strength of BehaviorDEPOT is its combination of many broadly useful data visualization and evaluation
modules within a single interface. The four experimental use cases in the paper nicely showcase various
features of the tool, working the user from the simplest example (detecting optogenetically induced freezing) to
a more sophisticated decision-making example in which BehaviorDEPOT is used to segment behavioral
recordings into trials, and within trials to count head turns per trial to detect deliberative behavior (vicarious trial
and error, or VTE.) The authors also demonstrate the application of their software using several different
animal pose formats (including from 4 to 9 tracked body parts) from multiple camera types and framerates.
1) One point that confused me when reading the paper was whether BehaviorDEPOT was using a single, fixed
freezing classifier, or whether the freezing classifier was being tuned to each new setting (the latter is the
case.) The abstract, introduction, and "Development of the BehaviorDEPOT Freezing Classifier" sections all
make the freezing classifier sound like a fixed object that can be run "out-of-the-box" on any dataset. However,
the subsequent "Analysis Module" section says it implements "hard-coded classifiers with adjustable
parameters", which makes it clear that the freezing classifier is not a fixed object, but rather it has a set of
parameters that can (must?) be tuned by the user to achieve desired performance. It is important to note that
the freezing classifier performances reported in the paper should therefore be read with the understanding that
these values are specific to the particular parameter configuration found (rather than reflecting performance a
user could get out of the box.)
Our classifier does work quite well “out of the box”. We developed our freezing classifier based on a small
number of videos recorded with a FLIR Chameleon3 camera at 50 fps (Fig. 2F). We then demonstrated its high
accuracy in three separately acquired data sets (webcam, FLIR+optogenetics, and Minicam+Miniscope, Fig.
2–4, Table 4). The same classifier also had excellent performance in mice and rats from external labs. With
minor tweaks to the threshold values, we were able to classify freezing with F1>0.9 (Fig. 9). This means that
the predictive value of the metrics we chose (head angular velocity and back velocity) generalizes across
experimental setups.
Popular freezing detection software including FreezeFrame, VideoFreeze as well as the newly created ezTrack
also allow users to adjust freezing classifier thresholds. Allowing users to adjust thresholds ensures that the
BehaviorDEPOT freezing classifier can be applied to videos that have already been recorded with different
resolutions, lighting conditions, rodent species, etc. Indeed, the ability to easily adjust classifier thresholds for
out-of-sample data represents one of the main advantages of hand-fitting classifiers. Yet BehaviorDEPOT
offers additional advantages above FreezeFrame, VideoFreeze, and ezTrack. For one, it adds a level of rigor
to the optimization step by quantifying classifier performance over a range of threshold values, helping users
select the best ones. Also, it is free, it can quantify behavior with reference to user-defined spatiotemporal
filters, and it can classify and analyze behaviors beyond freezing. We updated the results and
discussions sections to make these points clear.
2) This points to a central component of BehaviorDEPOT's design that makes its classifiers different from
those produced by previously published behavior detection software such as JAABA or SimBA. So far as I can
tell, BehaviorDEPOT includes no automated classifier fitting, instead relying on the users to come up with
which features to use and which thresholds to assign to those features. Given that the classifier optimization
module still requires manual annotations (to calculate classifier performance, Fig 7A), I'm unsure whether hand
selection of features offers any kind of advantage over a standard supervised classifier training approach. That
doesn't mean an advantage doesn't exist- maybe the hand-fit classifiers require less annotation data than a
supervised classifier, or maybe humans are better at picking "appropriate" features based on their
understanding of the behavior they want to study.
See response to reviewer 1, point 3 above for an extensive discussion of the rationale for our classification
method. See response to reviewer 2 point 3 below for an extensive discussion of the capabilities of the data
exploration module, including new features we have added in response to Reviewer 2’s comments.
3) There is something to be said for helping users hand-create behavior classifiers: it's easier to interpret the
output of those classifiers, and they could prove easier to fine-tune to fix performance when given out-ofsample data. Still, I think it's a major shortcoming that BehaviorDEPOT only allows users to use up to two
parameters to create behavior classifiers, and cannot create thresholds that depend on linear or nonlinear
combinations of parameters (eg, Figure 6D indicates that the best classifier would take a weighted sum of
head velocity and change in head angle.) Because of these limitations on classifier complexity, I worry that it
will be difficult to use BehaviorDEPOT to detect many more complex behaviors.
To clarify, users can combine as many parameters as they like to create behavior classifiers. However, the
reviewer raises a good point and we have now expanded the functions of the Data Exploration Module. Now,
users can choose ‘focused mode’ or ‘broad mode’ to explore their data. In focused mode, researchers use their
intuition about behaviors to select the metrics to examine. The user chooses two metrics at a time and the
Data Exploration Module compares values between frames where behavior is present or absent and provides
summary data and visual representations in the form of boxplots and histograms. A generalized linear model
(GLM) also estimates the likelihood that the behavior is present in a frame across a range of threshold values
for both selected metrics (Fig. 8A), allowing users to optimize parameters in combination. This process can be
repeated for as many metrics as desired.
In broad mode, the module uses all available keypoint metrics to generate a GLM that can predict behavior. It
also rank-orders metrics based on their predictive weights. Poorly predictive metrics are removed from the
model if their weight is sufficiently small. Users also have the option to manually remove individual metrics from
the model. Once suitable metrics and thresholds have been identified using either mode, users can plug any
number and combination of metrics into a classifier template script that we provide and incorporate their new
classifier into the Analysis Module. Detailed instructions for integrating new classifiers are available in our
GitHub repository (https://github.com/DeNardoLab/BehaviorDEPOT/wiki/Customizing-BehaviorDEPOT).
MoSeq, JAABA, MARS, SimBA, B-SOiD, DANNCE, and DeepEthogram are among a group of excellent opensource software packages that already do a great job detecting complex behaviors. They use supervised or
unsupervised machine learning to detect behaviors that are difficult to see by eye including social interactions
and fine-scale grooming behaviors. Instead of trying to improve upon these packages, BehaviorDEPOT is
targeting unmet needs of a large group of researchers that study human-defined behaviors and need a fast
and easy way to automate their analysis. As examples, we created a classifier to detect vicarious trial and error
(VTE), defined by sweeps on the head (Fig. 9). Our revised manuscript also describes our new novel object
exploration classifier (Fig. 5). Both behaviors are defined based on animal location and the presence of fine
movements that may not be accurately detected by algorithms like MoTr and Ctrax (Fig. 10). As discussed in
response to reviewer 1, point 3, additional rounds of machine learning are laborious (humans must label
frames as input), computationally intensive, harder to adjust for out-of-sample videos, and are not necessary to
quantify these kinds of behaviors.
4) Finally, I have some concerns about how performance of classifiers is reported. For example, the authors
describe "validation" set of videos used to assess freezing classifier performance, but they are very vague
about the detector was trained in the first place, stating "we empirically determined that thresholding the
velocity of a weighted average of 3-6 body parts ... and the angle of head movements produced the bestperforming freezing classifier." What videos were used to come to this conclusion? It is imperative that when
performance values are reported in the paper, they are calculated on a separate set of validation videos,
ideally from different animals, that were never referenced while setting the parameters of the classifier.
Otherwise, there is a substantial risk of overfitting, leading to overestimation of classifier performance.
Similarly, Figure 7 shows the manual fitting of classifiers to rat and mouse data; the fitting process in 7A is
shown to include updating parameters and recalculating performance iteratively. This approach is fine,
however I want to confirm that the classifier performances in panels 7F-G were computed on videos not used
during fitting.
Thank you for pointing this out. We have included detailed descriptions of the classifier development and
validation in the results (149–204) and methods (789–820) sections and added a table that describes videos
used to validate each classifier (Table 4).
To develop the classifier freezing, we explored linear and angular velocity metrics for various keypoints, finding
that angular velocity of the head and linear velocity of a back point tracked best with freezing. Common errors
in our classifiers were identified as short sequences of frames at the beginning or end of a behavior bout. This
may reflect failures in human detection. Other common errors were sequences of false positive or false
negative frames that were shorter than a typical behavior bout. We included the convolution algorithm to
correct these short error sequences.
When developing classifiers (including adjust the parameters for the external videos), videos were randomly
assigned to classifier development (e.g. ‘training’) and test sets. Dividing up the dataset by video rather than by
frame ensures that highly correlated temporally adjacent frames are not sorted into training and test sets,
which can cause overestimation of classifier accuracy. Since the videos in the test set were separate from
those used to develop the algorithms, our validation data reflects the accuracy levels users can expect from
BehaviorDEPOT.
5) Overall, I like the user-friendly interface of this software, its interaction with experimental hardware, and its
support for hand-crafted behavior classification. However, I feel that more work could be done to support
incorporation of additional features and feature combinations as classifier input- it would be great if
BehaviorDEPOT could at least partially automate the classifier fitting process, eg by automatically fitting
thresholds to user-selected features, or by suggesting features that are most correlated with a user's provided
annotations. Finally, the validation of classifier performance should be addressed.
Thank you for the positive feedback on the interface. We addressed these comments in response to points 3
and 4. To recap, we updated the Data Exploration Module to include Generalized Linear Models that can
suggest features with the highest predictive value. We also generated template scripts that simplify the process
of creating new classifiers and incorporating them into the Analysis Module. We also included all the details of
the videos we used to validate classifier performance, which were separate from the videos that we used to
determine the parameters (Table 4).
Reviewer #3 (Public Review):
There is a need for standardized pipelines that allow for repeatable robust analysis of behavioral data, and this
toolkit provides several helpful modules that researchers will find useful. There are, however, several
weaknesses in the current presentation of this work.
1) It is unclear what the major advance is that sets BehaviorDEPOT apart from other tools mentioned (ezTrack,
JAABA, SimBA, MARS, DeepEthogram, etc). A comparison against other commonly used classifiers would
speak to the motivation for BehaviorDEPOT - especially if this software is simpler to use and equally efficient at
classification.
We also address this in response to reviewer 1, points 1–3. To summarize, we added direct comparisons with
JAABA to a revised manuscript. In Fig. 10, we show that BehaviorDEPOT outperforms JAABA in several ways.
First, DLC is better at tracking rodents in complex environments than MoTr and Ctrax, which are the most used
JAABA companion software packages for centroid tracking. Second, we show that even when we use DLC to
approximate centroids and use this data to train classifiers with JAABA, the BehaviorDEPOT classifiers
perform better than JAABA’s.
In a revised manuscript, we included more discussion of what sets BehaviorDEPOT apart from other software,
focusing on these main points:
BehaviorDEPOT vs. commercially available packages (Ethovision, ANYmaze, FreezeFrame, VideoFreeze)
1) Ethovision, ANYmaze, FreezeFrame, VideoFreeze cost thousands of dollars per license while
BehaviorDEPOT is free.
2) The BehaviorDEPOT freezing classifier performs robustly even when animals are wearing a tethered patch
cord, while VideoFreeze and FreezeFrame often fail under these conditions.
3) Keypoint tracking is more accurate, flexible, and can resolve more detail compared to those that use
background subtraction or pixel change detection algorithms combined with center of mass or fitted ellipses.
BehaviorDEPOT vs. packages targeted at non-coding audiences (JAABA, ezTrack)
1) DLC keypoint tracking performs better than MoTr and Ctrax in complex environments. As a result, JAABA
has not been widely used in the rodent community. Built around keypoint tracking, BehaviorDEPOT will enable
researchers to analyze videos in any type of arena, including videos they have already collected. Keypoint
track also allows for detection of finer movements, which is essential for behaviors like VTE and object
exploration.
2) Hand-fit classifiers can be creative quickly and intuitively for well-defined laboratory behaviors. Compared to
machine learning-derived classifiers, they are easier to interpret and easier to fine-tune to optimize
performance when given out-of-sample data.
3) Even when using DLC as the input to JAABA, BehaviorDEPOT classifiers perform better (Figure 10)
4) BehaviorDEPOT integrates behavioral classification, spatial tracking, and quantitative analysis of behavior
and position with reference to spatial ROIs and temporal cues of interest. It is flexible and can accommodate
varied experimental designs. In ezTrack, spatial tracking is decoupled from behavioral classification. In JAABA,
spatial ROIs can be incorporated into machine learning algorithms, but users cannot quantify behavior with
reference to spatial ROIs after classification has occurred. Neither JAABA nor ezTrack provide a way to
quantify behavior with reference to temporal events (e.g. optogenetic stimuli, conditioned cues).
5) BehaviorDEPOT includes analysis and visualization tools, providing many features of the costly commercial
software packages for free.
BehaviorDEPOT vs. packages based on keypoint tracking (SimBA, MARS, B-SOiD)
Other software packages based on keypoint tracking use supervised or unsupervised methods to classify
behavior from animal poses. These software packages target researchers studying complex behaviors that are
difficult to see by eye including social interactions and fine-scale grooming behaviors whereas BehaviorDEPOT
targets a large group of researchers that study human defined behaviors and need a fast and easy way to
automate their analysis. Many behaviors of interest will require spatial tracking in combination with detection of
specific movements (e.g. VTE, NOE). Additional rounds of machine learning are laborious (humans must label
frames as input), computationally intensive, and are not necessary to quantify these kinds of behaviors.
2) While the idea might be that joint-level tracking should simplify the classification process, the number of
markers used in some of the examples is limited to small regions on the body and might not justify using these
markers as input data. The functionality of the tool seems to rely on a single type of input data (a small number
of keypoints labeled using DeepLabCut) and throws away a large amount of information in the keypoint
labeling step. If the main goal is to build a robust freezing detector then why not incorporate image data
(particularly when the best set of key points does not include any limb markers)?
While one main goal was to build a robust freezing detector, BehaviorDEPOT is a general-purpose software.
BehaviorDEPOT can classify behaviors from video timeseries and can analyze the results of experiments
similar to Ethovision or FreezeFrame. BehaviorDEPOT is particularly useful for assays in which behavioral
classification is integrated with spatial location, including avoidance, decision making (T maze), and novel
object memory/recognition. While image data is useful for classifying behavior, it cannot combine spatial
tracking with behavioral classification. However, DLC keypoint tracking is well-suited for this purpose. We find
that tracking 4–8 points is sufficient to hand-fit high performing classifiers for freezing, avoidance, reward
choice in a T-maze, VTE, and novel object recognition. Of course, users always have the option to track more
points because BehaviorDEPOT simply imports the X-Y coordinates and likelihood scores of any keypoints of
interest.
3) Need a better justification of this classification method
See response to reviewer 1, points 1–3 above.
4) Are the thresholds chosen for smoothing and convolution adjusted based on agreement to a user-defined
behavior?
Yes. We added more details in the text. Briefly, users can change the
thresholds used in both smoothing and convolution in the GUI and can optimize the values using the Classifier
Optimization Module. Smoothing is performed once at the beginning of a session and has an adjustable span
for the smoothing window. The convolution is a feature of each classifier, and thus can be adjusted when
adjusting the classifier. When developing the freezing classifier, we started with a smoothing window that had
the largest value that did not exceed the rate of motion of the animal and then fine-tuned the value to optimize
smoothing. In the classifiers we have developed, window widths that are the length of the smallest bout of ‘real’
behavior and count thresholds approximately 1/3 the window width yielded the best results.
5) Jitter is mentioned as a limiting factor in freezing classifier performance - does this affect human scoring as
well?
We were referring to jitter in terms of point location estimates by DeepLabCut. In other words, networks
that are tailored to the specific recording conditions have lower error rates in the estimates of keypoint
positions. Human scoring is an independent process that is not affected by this jitter. We changed the wording
in the text to avoid any confusion.
6) The use of a weighted average of body part velocities again throws away information - if one had a very
high-quality video setup with more markers would optimal classification be done differently? What if the input
instead consisted of 3D data, whether from multi-camera triangulation or other 3D pose estimation? Multianimal data?
From reviewer 2, point 3: MARS, SimBA, and B-SOiD are excellent open-source software packages that are
also based on keypoint tracking. They use supervised or unsupervised methods to classify complex behaviors
that are difficult to see by eye including social interactions and fine-scale grooming behaviors. Instead of trying
to improve upon these packages, which are already great, BehaviorDEPOT is targeting unmet needs of a large
group of researchers that study human defined behaviors and need a fast and easy way to automate their
analysis. Additional rounds of machine learning are laborious (humans must label frames as input),
computationally intensive, and are not necessary to quantify these kinds of behaviors. However, keypoint
tracking offers accuracy, precision and flexibility that is superior to behavioral classification programs that
estimate movement based on background subtraction, center of mass, ellipse fitting, etc.
7) It is unclear where the manual annotation of behavior is used in the tool as currently stands. Is the validation
module used to simply say that the freezing detector is as good as a human annotator? One might expect that
algorithms which use optic flow or pixel-based metrics might be superior to a human annotator, is it possible to
benchmark against one of these? For behaviors other than freezing, a tool to compare human labels seems
useful. The procedure described for converging on a behavioral definition is interesting and an example of this
in a behavior other than freezing, especially where users may disagree, would be informative. It appears that
manual annotation doesn't actually happen in the GUI and a user must create this themselves - this seems
unnecessarily complicated.
Manual annotation of behavior is used in the four classifier development modules: inter-rater, data exploration,
optimization, and validation. The inter-rater module can be used as a tool to refine ground-truth behavioral
definitions. It imports annotations from any number of raters and generates graphical and text-based statistical
reports about overlap, disagreement, etc. Users can use this tool to iteratively refine annotations until they
converged maximally. The inter-rater module can be used to compare human labels (or any reference set of
annotations) for any behavior. To ensure this is clear to the readers, we added more details to the text and second demonstration of the inter-rater module for novel object exploration annotations (Fig. 7).
The validation module imports reference annotations which can be produced by a human or another program,
which can benchmark classifier performance against the reference. We added more details to this section as
well.
Freezing is a straightforward behavior that is easy to detect by eye. Rather than benchmark against an optic
flow algorithm, we benchmarked against JAABA, another user-friendly behavioral classification software that
uses machine learning algorithms. We find that BehaviorDEPOT is easier to use and labels freezing more
accurately than JAABA. We also made a second freezing classifier that uses a changepoint algorithm to
identify transitions from movement to freezing that may accommodate a wider range of video framerates and
resolutions.
We plan to incorporate an annotation feature into the GUI, but in the interest of disseminating our work soon,
we argue that this is not necessary for inclusion now. There are many free or cheap programs that allow
framewise annotation of behavior including FIJI, Quicktime, VLC, and MATLAB. In fact, users may already
have manual annotations or annotations produced by a different software and BehaviorDEPOT can import
these directly. While machine learning classifiers like JAABA require human annotations to be entered into
their GUI, allowing people to import annotations they collected previously saves time and effort.
8) A major benefit of BehaviorDEPOT seems to be the ability to run experiments, but the ease of programming
specific experiments is not readily apparent. The examples provided use different recording methods and
networks for each experimental context as well as different presentations of data - it is not clear which
analyses are done automatically in BehaviorDEPOT and which require customizing code or depend on the
MiniCAM platform and hardware. For example - how does synchronization with neural or stimulus data occur?
Overall it is difficult to judge how these examples would be implemented without some visual documentation.
We added visual documentation of the experimental module graphical interface to figure 1 and added more
detail to the results, methods and to our GitHub repository
(https://github.com/DeNardoLab/Fear-Conditioning-Experiment-Designer). Synchronization with stimulus data
can occur within the Experiment Module (designed for fear conditioning experiments) or stimuli timestamps can
be easily imported into the Analysis Module. Synchronization with neural data occurs post hoc using the data
structures produced by the BehaviorDEPOT Analysis Module. We include our code for aligning behavior to
Miniscope on our GitHub repository https://github.com/DeNardoLab/caAnalyze).