Going Further
[Mathilde]: This is conceptually an exercice for last session, but because you use the unclean variables in this session we cannot do it before: harmonise "Yes" vs "yes" etc. in all columns where this is pertinent.
Going Further
[Mathilde]: This is conceptually an exercice for last session, but because you use the unclean variables in this session we cannot do it before: harmonise "Yes" vs "yes" etc. in all columns where this is pertinent.
: “for this row, is the value of hospitalisation equal to "yes"?”
[Mathilde]: lots of punctuation and quotes going on. Consider updating to:
and testing if the value of hospitalisation equals to "yes" in this row.
This will be the focus of our session today
[Mathilde]: deending on whether you think of bricks of needed stuff, or more general concepts (cleaning, recoding...), this might be true or not for them. Maybe rephrase to insist that they will be usefull thoughout today session?
column_name
Logic
[Mathilde]: I am not sure about key words anymore. I feel either they are too precise to be usefull (my problem on session one and two), or too general (here?). I can't really imagine myself looking for something and these keywords being near enough to be usefull in this session for exemple.
extension
[Mathilde]: Maybe precise that it's ok to have both commands, to have a dataset ready to be imported in R and one to share with colleagues who do not use R and need to be able to open a dataset in Excel or other softwares.
filetype
[Mathilde]: file type?
we will go back to the packages {rio} and {here}
[Mathilde]: completely redundant with the following sentence. Remove?
Wouldn’t it be great to save this (more or less clean) dataset? I agree.
[Mathilde]: bit too much that one ;)
sex
No
[Mathilde]: Should we have a tip afterwards that say that in some cases, if the vector is going to be used in several cleaning steps it can be saved outside of the pipe and show an exemple? It's not really necessary, but I like to remind them that they can create object and use them even in pipes.
case_when()
[Mathilde]: Just so you know, on my screen the background of the inline code is not very distinguishable from the white backgroud of the page. I don't think I have special settings for colours during the day, so it may be a problem on other screens as well. Maybe make is slightly darker?
ie:
[Mathilde]: Ok, I understand that both italicised and non italiscised are accepted; that removing the comas is pretty informal but growing in use (you may consider that most non native speakers might have encountered in formal text, and may find it harder to parse without the italics or the comas). But I think the : is weird.
inspect the categorical variables in df_raw
[Mathilde]: ok, this might take some time if they try to redo it all, and they should have done something similar many times (first atr lenght with Elisabeth, then with summary in session 2, then at the sart of session 3). So maybe "look at your notes about categorical variables, whuch of them need standardizing. Hint, if you did not remember you can use count" or something like that?
Using
[Mathilde]: Is it really a look box? They need to code.
Be careful. The order of your statements is important here. What case_when() will do is go through each statement from top to bottom and assign the first value that is TRUE. So in our above example, case_when() will ask the following questions in sequence: Does this patient have SAM (is muac < 110)? If so, assign the value "SAM" If the patient didn’t have SAM, do they have MAM (is muac < 125)? If so, assign the value `“MAM” If none of the above conditions were true, assign the default value "Normal"
[Mathilde]: I would move it before their exercice, when we dissect the example.
Try running the above code to see if it successfully creates a new column malnut with the malnutrition status of each case. You should get something like this:
[Mathilde]: I don't think this is very usefull, just copy and pasting the code, especially with that edgy formulation. Maybe ask them an alternative column on MUAC. Aren't there protocol modifications where MSF uses 120 mm? Cannot remember exactly. But that would allow to keep the code very similar, but the exercice a bit more complex than just copy and pasting.
So for example, if we want our case_when() to say that anytime a patient had a MUAC less than 110 we want to have a value of "SAM", the first part of our case when would be muac < 110 ~ "SAM'. Here the left side of the ~ provides the condition and the right side gives the value we want whenever that condition is true. We can add multiple possible outcomes by adding additional lines. So in this case, our next condition might check if the patient is moderately but not severly malnourished using the statement muac < 125 ~ "MAM". The last line, with the argument .default then gives the value we want case_when() to use when none of the above conditions have been met. In this case, we might give the value "Normal". To put this together, if we wanted to use case_when() to create a variable that classifies the malnutrition status of patients using their MUAC, we would write: df_raw |> mutate(malnut = case_when(muac < 110 ~ 'SAM', muac < 125 ~ 'MAM', .default = 'Normal'))
[Mathilde]: this part is complicated, I think because it describes an example that we don't see yet (and the text description is long and hard to understand) with general information. I advise to move the code example way higher, so that we have the real exemple in front of us before attempting to dissect it with text, not two paragraphs after.
[
[Mathilde]: go to next line to avoid very long conditions lines that needs to be scrolled right?
This is where the {dplyr} function case_when() is here to help us
[Mathilde]: the "this is where" and "is here" feel redundant. Rephrase sightly?
B
[Mathilde]: Should we allow a special case after it for the is.na()? It sort of fits there, and we can tell we will use it in the enxt session? At least the table is a bit more complete?
not the same as
under five who were hospitalized
[Mathilde]: Make it bold like in the other examples above.
Easy
[Mathilde]: I find it comical that this is easy when != was complex :D. I would say that the condition is more complex, but fortunately the code is still easy.
a more complex question
[Mathilde]: I do not think it is more complex, nor that we should stress them by saying that. maybe keeps it for when we have several conditions?
using
[Mathilde]: Maybe remove the markdown syntax from the hover, as they will not know what ** means. (this is a general comments, for all hovers).
selecting particular variables of interest and modifying them to better suit our needs. Beyond selecting variables of interest
[Mathilde]: I would put the bold emphasis on the first part "selecting particular variables of interest", and maybe simplifying the start of the second sentence to "Beyong that" to avoid the repetition that sounds a bit heavy and still keep the mirror of variables vs observations, which is nice.
[Mathilde]: little detail, but do we need the practise at the end of these scripts? It is litteraly going to be the case for any script they produce so it feels redundant.
Here is our final table !
[Mathilde]: I don't think we should use DT table to make it look good. It does not correspond to what they are producing, and it will cause questions and frustrations.
Tip You want to count rows (so use sum()) that fill a specific condition for outcome (outcome == "dead"), but only when age_group == "< 6 months"
[Mathilde]: Maybe move the tip out of the box so that they have the time to think about it without reading it first?
n_patients = n(),
[Mathilde]: We don't need this line for that example. Simplify.
LOGIC_TEST,
[Mathilde]: I would make a note that explains that this works because TRUE is treated 1 (as we mentioned in session 1). Otherwise, this part is very mysterious and illogical.
ie: rows that have yes in variable hospitalisation)
[Mathilde]: That might be a bit to obvious
df_linelist |> summarize( .by = sub_prefecture, n_patients = n(), mean_age = mean(age) )
[Mathilde]: show output
n():
[Mathilde]: I would add a note at the end of the section to say that count is basically a shortcut for ...
Ok now let’s build a summary table for each sub_prefecture. First start by replicating the above lines
[Mathilde]: this is weird, we should use a different example than them so that they work it out; not just copy code.
df |> summarize( .by = grouping_variable, new_col = summary_function(existing_col) )
[Mathilde]: I would not introduce the .by group here, as i) it makes it different from the mutate and ii) the next exemple does not use it yet.
we may want to increase the complexity
[Mathilde]: I am not sure taking the mean by group is conceptually more complicated than counting. This sounds scary. I would remove the first sentence and go to the second (tweaking it to see that after counting modalities in categorical data, we can also do stuff on numeric data, bu group or not)
df_linelist |> filter( outcome != "left against medical advice", !is.na(outcome) ) |> count(outcome)
[Mathilde]: Again, I would show the head of the output.
sum(n)
[Mathilde]: Have we shown them that this is possible to write something like that? If not, maybe we should make a note.
What is the proportion?
[Mathilde]: Do they know how to tweak the table to do that, or are they supposed to just type in the console to find out the answer?
df_linelist |> count(sub_prefecture, age_group)
[Mathilde]: I would show the head of the output.
contingency tables.
[Mathilde]: The stuff in the hovertip should not be there, it's just the end of the sentence. if we hide it in the hover, the next sentence should not start like this.
values
[Mathilde]: maybe let's be more precise and say "modalities" or "categories", as we are going to use it for categorical variables.
Create summary tables.
[Mathilde]: Maybe flesh it out slightly?
real
[Mathilde]: ahah, that's what we wish, but the real business was mostly cleaning them. Maybe replace by "the most interesting part"
Finally, a reminder that as usual summarise() returns a dataframe, which can be further used and modified using mutate() - this means that we should be able to add a variable about the proportion of female in each sub_prefecture using our newly created n_female ! Can you add a mutate() after your summarise() to make this happen ?
didn't you already do this earlier ?
sub_pref_df <- df_linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE), mean_age_female = mean(age[sex == "f"], na.rm = TRUE), n_death_u6m = sum(outcome[age_group == "< 6 months"] == "dead", na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients ) sub_pref_df
i wouldn't give them the code for the solution... but i think it can be nice to show the output !
Can you try to use the syntax to calculate the mean age of female ?
it's hard to tell if they are supposed to be building up a pipe or just do one offs
_linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients )
i don't know that i would show this in full pipe, it may be hard for them to see what you're talking about esp since its not added at the end. maybe just show the specific example you are trying to illustrate ?
hospitalised patients
by site ? is this the same pipe or a new one on the raw df ?
n_female / n_patients
unclear why this is done outside the summarize
also call the proportion of dead patients
i'm not sure what you mean by this... are you trying to check if they know what cfr is ?
That’s the new rows of our data now that we have grouped
i'm not sure what this means
Note. You can write either summarize() (US spelling) or summarise() (British spelling) in R.
i think this is low yeild, but up to you. if you want to keep it, you should make it one of the callouts (for example "tip").
Can you think of a way of using filter() to obtain the same results ?
unclear what you are asking them to do here
pipe %>% operator
i think we decided to use the native pipe (|>); this should be updated throughout
Compare the number of rows
[Mathilde]: is nrow() a function they will review in the mini exploration activity? If not, I may add a hint that we saw such a function in the past, because it was a small one we did not use much, so easy to forget (and I would not use it in a reviewing demo to refresh them on R, too small).
reproducability
[Mathilde]: this sounds like a mix between "reproducibility" and "replicability". Are you sure it exists? I would have said reproducibility here.
This is exactly what the pipe operator, |> is for! The pipe has the following basic syntax:
[Mathilde]: Nice explanation of the pipe you did here.
duplicated
[Mathilde]: I am a bit confused as to when it is grammatically correct to say "duplicate observations or entries" instead of "duplicated observations". I always use the later, and I am always a bit surprised when you use the former (in the objectives, or in the dedicated paragraph).
Date
[Mathilde]: same remark as before.
Hold on, have we just sneakily learned how to do a lot of basic data cleaning? I sure think so!
[Mathilde]: this sounds a bit weird. There was nothing sneaky about spending one hour and half doing it^^ Maybe it's just a bit too much and needs reformulating.
easily done
[Mathilde]: consider adding "at least in the case of fully duplicated lines" to this sentence or as a tooltip. Even with the next paragraph, I want them to realise there that two lines, identical for 32 variables, and with a typo difference in ONE variable are not going to be removed.
Date
[Mathilde]: decide to use "date" or "Date" for the format name.
ymd()
[Mathilde]: Make it explicit that it use to obtain the desired for year-month-date format.
, {lubridate}.
[Mathilde]: maybe select more of the text as the tooltip anchor, because it is not obvious to readers that there is anything to read here. (same for ymd just after, and each time we use a tooltip just on the package name)
We might want to keep age in months as well as years, so we won’t reassign that column. But there are some other columns that could stand to be changed. There are a lot of reasons we might want to change a column, two of the most common ones are: The format of a string needs changing The data type of a column is incorrect
[Mathilde]: I feel around here is a right moment to remind them, either in a tooltip or a note below, that because we import the data in R and create new objects, at no point modifyng df_raw modifies our raw data in the data subfolder.
age_years
[Mathilde]: Maybe a nice extra exercice would be to get them to use the round function, first showing them how to use it outside of mutate, and then getting them to update their mutate code to add rounding.
MUAC in cm
[Mathilde]: because the tooltip does not render Markdown well, I'd suggest removint the backticks around function names in them, so as to not confused them since they never see that syntax in the rest of the text.
tmp
[Mathilde]: I would argue that for the French translation we should use another name (temp maybe, I don't think it is reserved in R). I feel that the most complicated english names to parse as a french are the ones where the vowells were removed. We don't do it so much as shortcuting and it's hard to figure out what the remainning letters mean. Even now, I need to read it 2 or 3 times before my brain stops reading it "te - em - pe" and understand it's a shortcuf for "temporary".
Often, we want to keep most of the variables in our dataset and only remove one or two. We can use the aabove syntax to do this, but it can become pretty tedious to write out every column name. In these cases, instead of telling select what to keep, we can use a subtraction sign (-) to tell it what to remove. For example, if we wanted to remove the village_commune column from our dataframe we can use the following:
[Mathilde]: I like that we show it.
daframe
[Mathilde]: Sophie had pointed in session 1 that it was "data frame" with a space. I have no idea why we wrote it otherwise in the past, but I checked the help, and I think she is right. I made the change to session 1, session 2, session 7. I'll make the change to this one as well directly in the code, just explaingin why here.
select(df_raw, id, sex, age)
[Mathilde]: I would print the head of this so that they see directly what it does.
don’t need to be put into quotation marks
[Mathilde]: should we have a tooltip saying that it is due to non standard evaluation, and that the details are out of the scope of this lesson. I feel it is referenced much more often now than in the past in the documentation and online, so maybe it would be good that they have the key word somewhere?
tidyverse
[Mathilde]: should we put this one in the curly braces? I agree that there is the concept of the collection of package, and the actual bundle packages, but do we want to get into this here? We told theme we would use a convention, so unless we explain them why we break it here, I would stick to it.
data-verbs-practice.R
[Mathilde]: In the previous session we used underscores in script name. I think we should we use them here as well for the sake of homogeneity in their folder. We can make a note somewhere that - are appropriate too, but it is good to be consistent within a folder.
his session will work with the raw Moissala linelist data, which can be downloaded here:
[Mathilde]: the link does not work for now. I thought it should since it references an existing page, but maybe I am misunderstanding something?. The culprit is a "blob" in the middle: this adress works: https://github.com/epicentre-msf/repicentre/blob/main/data/raw/moissala_linelist_EN.xlsx
Importing .xlsx files
[hugo] I still feel strongly about this. I think we should stick to simple, conventional importing with .csv and move the .xlsx import/export into it's own dedicated satellite - happy to hear what people think here. I agree the current session is not so long so .xlsx would fit in there but it's more for organisation purpose. especially because we will probably have a satellite regarding satellites, there are a lot to be said so better not duplicate.
comments. B
[cat] i don't know that this linking is necessary and it adds a breakable dependency.