Hypothesis

80 Matching Annotations

Feb 2025
Local file Local file

Data Manipulation with Conditional Logic – {repicentre}

28
1. mmousset2 12 Feb 2025
  
  in Public
  
  Going Further
  
  [Mathilde]: This is conceptually an exercice for last session, but because you use the unclean variables in this session we cannot do it before: harmonise "Yes" vs "yes" etc. in all columns where this is pertinent.
2. mmousset2 12 Feb 2025
  
  in Public
  
  : “for this row, is the value of hospitalisation equal to "yes"?”
  
  [Mathilde]: lots of punctuation and quotes going on. Consider updating to: and testing if the value of hospitalisation equals to "yes" in this row.
3. mmousset2 12 Feb 2025
  
  in Public
  
  This will be the focus of our session today
  
  [Mathilde]: deending on whether you think of bricks of needed stuff, or more general concepts (cleaning, recoding...), this might be true or not for them. Maybe rephrase to insist that they will be usefull thoughout today session?
4. mmousset2 12 Feb 2025
  
  in Public
  
  column_name
5. mmousset2 12 Feb 2025
  
  in Public
  
  Logic
  
  [Mathilde]: I am not sure about key words anymore. I feel either they are too precise to be usefull (my problem on session one and two), or too general (here?). I can't really imagine myself looking for something and these keywords being near enough to be usefull in this session for exemple.
6. mmousset2 12 Feb 2025
  
  in Public
  
  extension
  
  [Mathilde]: Maybe precise that it's ok to have both commands, to have a dataset ready to be imported in R and one to share with colleagues who do not use R and need to be able to open a dataset in Excel or other softwares.
7. mmousset2 12 Feb 2025
  
  in Public
  
  filetype
  
  [Mathilde]: file type?
8. mmousset2 12 Feb 2025
  
  in Public
  
  we will go back to the packages {rio} and {here}
  
  [Mathilde]: completely redundant with the following sentence. Remove?
9. mmousset2 12 Feb 2025
  
  in Public
  
  Wouldn’t it be great to save this (more or less clean) dataset? I agree.
  
  [Mathilde]: bit too much that one ;)
10. mmousset2 12 Feb 2025
  
  in Public
  
  sex
11. mmousset2 12 Feb 2025
  
  in Public
  
  No
  
  [Mathilde]: Should we have a tip afterwards that say that in some cases, if the vector is going to be used in several cleaning steps it can be saved outside of the pipe and show an exemple? It's not really necessary, but I like to remind them that they can create object and use them even in pipes.
12. mmousset2 12 Feb 2025
  
  in Public
  
  case_when()
  
  [Mathilde]: Just so you know, on my screen the background of the inline code is not very distinguishable from the white backgroud of the page. I don't think I have special settings for colours during the day, so it may be a problem on other screens as well. Maybe make is slightly darker?
13. mmousset2 12 Feb 2025
  
  in Public
  
  ie:
  
  [Mathilde]: Ok, I understand that both italicised and non italiscised are accepted; that removing the comas is pretty informal but growing in use (you may consider that most non native speakers might have encountered in formal text, and may find it harder to parse without the italics or the comas). But I think the : is weird.
14. mmousset2 12 Feb 2025
  
  in Public
  
  inspect the categorical variables in df_raw
  
  [Mathilde]: ok, this might take some time if they try to redo it all, and they should have done something similar many times (first atr lenght with Elisabeth, then with summary in session 2, then at the sart of session 3). So maybe "look at your notes about categorical variables, whuch of them need standardizing. Hint, if you did not remember you can use count" or something like that?
15. mmousset2 12 Feb 2025
  
  in Public
  
  Using
  
  [Mathilde]: Is it really a look box? They need to code.
16. mmousset2 12 Feb 2025
  
  in Public
  
  Be careful. The order of your statements is important here. What case_when() will do is go through each statement from top to bottom and assign the first value that is TRUE. So in our above example, case_when() will ask the following questions in sequence: Does this patient have SAM (is muac < 110)? If so, assign the value "SAM" If the patient didn’t have SAM, do they have MAM (is muac < 125)? If so, assign the value `“MAM” If none of the above conditions were true, assign the default value "Normal"
  
  [Mathilde]: I would move it before their exercice, when we dissect the example.
17. mmousset2 12 Feb 2025
  
  in Public
  
  Try running the above code to see if it successfully creates a new column malnut with the malnutrition status of each case. You should get something like this:
  
  [Mathilde]: I don't think this is very usefull, just copy and pasting the code, especially with that edgy formulation. Maybe ask them an alternative column on MUAC. Aren't there protocol modifications where MSF uses 120 mm? Cannot remember exactly. But that would allow to keep the code very similar, but the exercice a bit more complex than just copy and pasting.
18. mmousset2 12 Feb 2025
  
  in Public
  
  So for example, if we want our case_when() to say that anytime a patient had a MUAC less than 110 we want to have a value of "SAM", the first part of our case when would be muac < 110 ~ "SAM'. Here the left side of the ~ provides the condition and the right side gives the value we want whenever that condition is true. We can add multiple possible outcomes by adding additional lines. So in this case, our next condition might check if the patient is moderately but not severly malnourished using the statement muac < 125 ~ "MAM". The last line, with the argument .default then gives the value we want case_when() to use when none of the above conditions have been met. In this case, we might give the value "Normal". To put this together, if we wanted to use case_when() to create a variable that classifies the malnutrition status of patients using their MUAC, we would write: df_raw |> mutate(malnut = case_when(muac < 110 ~ 'SAM', muac < 125 ~ 'MAM', .default = 'Normal'))
  
  [Mathilde]: this part is complicated, I think because it describes an example that we don't see yet (and the text description is long and hard to understand) with general information. I advise to move the code example way higher, so that we have the real exemple in front of us before attempting to dissect it with text, not two paragraphs after.
19. mmousset2 12 Feb 2025
  
  in Public
  
  [
  
  [Mathilde]: go to next line to avoid very long conditions lines that needs to be scrolled right?
20. mmousset2 12 Feb 2025
  
  in Public
  
  This is where the {dplyr} function case_when() is here to help us
  
  [Mathilde]: the "this is where" and "is here" feel redundant. Rephrase sightly?
21. mmousset2 12 Feb 2025
  
  in Public
  
  B
  
  [Mathilde]: Should we allow a special case after it for the is.na()? It sort of fits there, and we can tell we will use it in the enxt session? At least the table is a bit more complete?
22. mmousset2 12 Feb 2025
  
  in Public
  
  not the same as
23. mmousset2 12 Feb 2025
  
  in Public
  
  under five who were hospitalized
  
  [Mathilde]: Make it bold like in the other examples above.
24. mmousset2 12 Feb 2025
  
  in Public
  
  Easy
  
  [Mathilde]: I find it comical that this is easy when != was complex :D. I would say that the condition is more complex, but fortunately the code is still easy.
25. mmousset2 12 Feb 2025
  
  in Public
  
  a more complex question
  
  [Mathilde]: I do not think it is more complex, nor that we should stress them by saying that. maybe keeps it for when we have several conditions?
26. mmousset2 12 Feb 2025
  
  in Public
  
  using
  
  [Mathilde]: Maybe remove the markdown syntax from the hover, as they will not know what ** means. (this is a general comments, for all hovers).
27. mmousset2 12 Feb 2025
  
  in Public
  
  selecting particular variables of interest and modifying them to better suit our needs. Beyond selecting variables of interest
  
  [Mathilde]: I would put the bold emphasis on the first part "selecting particular variables of interest", and maybe simplifying the start of the second sentence to "Beyong that" to avoid the repetition that sounds a bit heavy and still keep the mirror of variables vs observations, which is nice.
28. mmousset2 12 Feb 2025
  
  in Public
  
  [Mathilde]: little detail, but do we need the practise at the end of these scripts? It is litteraly going to be the case for any script they produce so it feels redundant.
Annotators

mmousset2
Local file Local file

Summary tables – {repicentre}

29
1. mmousset2 11 Feb 2025
  
  in Public
  
  Here is our final table !
  
  [Mathilde]: I don't think we should use DT table to make it look good. It does not correspond to what they are producing, and it will cause questions and frustrations.
2. mmousset2 11 Feb 2025
  
  in Public
  
  Tip You want to count rows (so use sum()) that fill a specific condition for outcome (outcome == "dead"), but only when age_group == "< 6 months"
  
  [Mathilde]: Maybe move the tip out of the box so that they have the time to think about it without reading it first?
3. mmousset2 11 Feb 2025
  
  in Public
  
  n_patients = n(),
  
  [Mathilde]: We don't need this line for that example. Simplify.
4. mmousset2 11 Feb 2025
  
  in Public
  
  LOGIC_TEST,
  
  [Mathilde]: I would make a note that explains that this works because TRUE is treated 1 (as we mentioned in session 1). Otherwise, this part is very mysterious and illogical.
5. mmousset2 11 Feb 2025
  
  in Public
  
  ie: rows that have yes in variable hospitalisation)
  
  [Mathilde]: That might be a bit to obvious
6. mmousset2 11 Feb 2025
  
  in Public
  
  df_linelist |> summarize( .by = sub_prefecture, n_patients = n(), mean_age = mean(age) )
  
  [Mathilde]: show output
7. mmousset2 11 Feb 2025
  
  in Public
  
  n():
  
  [Mathilde]: I would add a note at the end of the section to say that count is basically a shortcut for ...
8. mmousset2 11 Feb 2025
  
  in Public
  
  Ok now let’s build a summary table for each sub_prefecture. First start by replicating the above lines
  
  [Mathilde]: this is weird, we should use a different example than them so that they work it out; not just copy code.
9. mmousset2 11 Feb 2025
  
  in Public
  
  df |> summarize( .by = grouping_variable, new_col = summary_function(existing_col) )
  
  [Mathilde]: I would not introduce the .by group here, as i) it makes it different from the mutate and ii) the next exemple does not use it yet.
10. mmousset2 11 Feb 2025
  
  in Public
  
  we may want to increase the complexity
  
  [Mathilde]: I am not sure taking the mean by group is conceptually more complicated than counting. This sounds scary. I would remove the first sentence and go to the second (tweaking it to see that after counting modalities in categorical data, we can also do stuff on numeric data, bu group or not)
11. mmousset2 11 Feb 2025
  
  in Public
  
  df_linelist |> filter( outcome != "left against medical advice", !is.na(outcome) ) |> count(outcome)
  
  [Mathilde]: Again, I would show the head of the output.
12. mmousset2 11 Feb 2025
  
  in Public
  
  sum(n)
  
  [Mathilde]: Have we shown them that this is possible to write something like that? If not, maybe we should make a note.
13. mmousset2 11 Feb 2025
  
  in Public
  
  What is the proportion?
  
  [Mathilde]: Do they know how to tweak the table to do that, or are they supposed to just type in the console to find out the answer?
14. mmousset2 11 Feb 2025
  
  in Public
  
  df_linelist |> count(sub_prefecture, age_group)
  
  [Mathilde]: I would show the head of the output.
15. mmousset2 11 Feb 2025
  
  in Public
  
  contingency tables.
  
  [Mathilde]: The stuff in the hovertip should not be there, it's just the end of the sentence. if we hide it in the hover, the next sentence should not start like this.
16. mmousset2 11 Feb 2025
  
  in Public
  
  values
  
  [Mathilde]: maybe let's be more precise and say "modalities" or "categories", as we are going to use it for categorical variables.
17. mmousset2 11 Feb 2025
  
  in Public
  
  Create summary tables.
  
  [Mathilde]: Maybe flesh it out slightly?
18. mmousset2 11 Feb 2025
  
  in Public
  
  real
  
  [Mathilde]: ahah, that's what we wish, but the real business was mostly cleaning them. Maybe replace by "the most interesting part"
19. mmousset2 11 Feb 2025
  
  in Public
  
  Finally, a reminder that as usual summarise() returns a dataframe, which can be further used and modified using mutate() - this means that we should be able to add a variable about the proportion of female in each sub_prefecture using our newly created n_female ! Can you add a mutate() after your summarise() to make this happen ?
  
  didn't you already do this earlier ?
20. mmousset2 11 Feb 2025
  
  in Public
  
  sub_pref_df <- df_linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE), mean_age_female = mean(age[sex == "f"], na.rm = TRUE), n_death_u6m = sum(outcome[age_group == "< 6 months"] == "dead", na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients ) sub_pref_df
  
  i wouldn't give them the code for the solution... but i think it can be nice to show the output !
21. mmousset2 11 Feb 2025
  
  in Public
  
  Can you try to use the syntax to calculate the mean age of female ?
  
  it's hard to tell if they are supposed to be building up a pipe or just do one offs
22. mmousset2 11 Feb 2025
  
  in Public
  
  _linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients )
  
  i don't know that i would show this in full pipe, it may be hard for them to see what you're talking about esp since its not added at the end. maybe just show the specific example you are trying to illustrate ?
23. mmousset2 11 Feb 2025
  
  in Public
  
  hospitalised patients
  
  by site ? is this the same pipe or a new one on the raw df ?
24. mmousset2 11 Feb 2025
  
  in Public
  
  n_female / n_patients
  
  unclear why this is done outside the summarize
25. mmousset2 11 Feb 2025
  
  in Public
  
  also call the proportion of dead patients
  
  i'm not sure what you mean by this... are you trying to check if they know what cfr is ?
26. mmousset2 11 Feb 2025
  
  in Public
  
  That’s the new rows of our data now that we have grouped
  
  i'm not sure what this means
27. mmousset2 11 Feb 2025
  
  in Public
  
  Note. You can write either summarize() (US spelling) or summarise() (British spelling) in R.
  
  i think this is low yeild, but up to you. if you want to keep it, you should make it one of the callouts (for example "tip").
28. mmousset2 11 Feb 2025
  
  in Public
  
  Can you think of a way of using filter() to obtain the same results ?
  
  unclear what you are asking them to do here
29. mmousset2 11 Feb 2025
  
  in Public
  
  pipe %>% operator
  
  i think we decided to use the native pipe (|>); this should be updated throughout
Annotators

mmousset2
Local file Local file

Basic Data Manipulation – {repicentre}

21
1. mmousset2 11 Feb 2025
  
  in Public
  
  Compare the number of rows
  
  [Mathilde]: is nrow() a function they will review in the mini exploration activity? If not, I may add a hint that we saw such a function in the past, because it was a small one we did not use much, so easy to forget (and I would not use it in a reviewing demo to refresh them on R, too small).
2. mmousset2 11 Feb 2025
  
  in Public
  
  reproducability
  
  [Mathilde]: this sounds like a mix between "reproducibility" and "replicability". Are you sure it exists? I would have said reproducibility here.
3. mmousset2 11 Feb 2025
  
  in Public
  
  This is exactly what the pipe operator, |> is for! The pipe has the following basic syntax:
  
  [Mathilde]: Nice explanation of the pipe you did here.
4. mmousset2 11 Feb 2025
  
  in Public
  
  duplicated
  
  [Mathilde]: I am a bit confused as to when it is grammatically correct to say "duplicate observations or entries" instead of "duplicated observations". I always use the later, and I am always a bit surprised when you use the former (in the objectives, or in the dedicated paragraph).
5. mmousset2 11 Feb 2025
  
  in Public
  
  Date
  
  [Mathilde]: same remark as before.
6. mmousset2 11 Feb 2025
  
  in Public
  
  Hold on, have we just sneakily learned how to do a lot of basic data cleaning? I sure think so!
  
  [Mathilde]: this sounds a bit weird. There was nothing sneaky about spending one hour and half doing it^^ Maybe it's just a bit too much and needs reformulating.
7. mmousset2 11 Feb 2025
  
  in Public
  
  easily done
  
  [Mathilde]: consider adding "at least in the case of fully duplicated lines" to this sentence or as a tooltip. Even with the next paragraph, I want them to realise there that two lines, identical for 32 variables, and with a typo difference in ONE variable are not going to be removed.
8. mmousset2 11 Feb 2025
  
  in Public
  
  Date
  
  [Mathilde]: decide to use "date" or "Date" for the format name.
9. mmousset2 11 Feb 2025
  
  in Public
  
  ymd()
  
  [Mathilde]: Make it explicit that it use to obtain the desired for year-month-date format.
10. mmousset2 11 Feb 2025
  
  in Public
  
  , {lubridate}.
  
  [Mathilde]: maybe select more of the text as the tooltip anchor, because it is not obvious to readers that there is anything to read here. (same for ymd just after, and each time we use a tooltip just on the package name)
11. mmousset2 11 Feb 2025
  
  in Public
  
  We might want to keep age in months as well as years, so we won’t reassign that column. But there are some other columns that could stand to be changed. There are a lot of reasons we might want to change a column, two of the most common ones are: The format of a string needs changing The data type of a column is incorrect
  
  [Mathilde]: I feel around here is a right moment to remind them, either in a tooltip or a note below, that because we import the data in R and create new objects, at no point modifyng df_raw modifies our raw data in the data subfolder.
12. mmousset2 11 Feb 2025
  
  in Public
  
  age_years
  
  [Mathilde]: Maybe a nice extra exercice would be to get them to use the round function, first showing them how to use it outside of mutate, and then getting them to update their mutate code to add rounding.
13. mmousset2 11 Feb 2025
  
  in Public
  
  MUAC in cm
  
  [Mathilde]: because the tooltip does not render Markdown well, I'd suggest removint the backticks around function names in them, so as to not confused them since they never see that syntax in the rest of the text.
14. mmousset2 11 Feb 2025
  
  in Public
  
  tmp
  
  [Mathilde]: I would argue that for the French translation we should use another name (temp maybe, I don't think it is reserved in R). I feel that the most complicated english names to parse as a french are the ones where the vowells were removed. We don't do it so much as shortcuting and it's hard to figure out what the remainning letters mean. Even now, I need to read it 2 or 3 times before my brain stops reading it "te - em - pe" and understand it's a shortcuf for "temporary".
15. mmousset2 11 Feb 2025
  
  in Public
  
  Often, we want to keep most of the variables in our dataset and only remove one or two. We can use the aabove syntax to do this, but it can become pretty tedious to write out every column name. In these cases, instead of telling select what to keep, we can use a subtraction sign (-) to tell it what to remove. For example, if we wanted to remove the village_commune column from our dataframe we can use the following:
  
  [Mathilde]: I like that we show it.
16. mmousset2 11 Feb 2025
  
  in Public
  
  daframe
  
  [Mathilde]: Sophie had pointed in session 1 that it was "data frame" with a space. I have no idea why we wrote it otherwise in the past, but I checked the help, and I think she is right. I made the change to session 1, session 2, session 7. I'll make the change to this one as well directly in the code, just explaingin why here.
17. mmousset2 11 Feb 2025
  
  in Public
  
  select(df_raw, id, sex, age)
  
  [Mathilde]: I would print the head of this so that they see directly what it does.
18. mmousset2 11 Feb 2025
  
  in Public
  
  don’t need to be put into quotation marks
  
  [Mathilde]: should we have a tooltip saying that it is due to non standard evaluation, and that the details are out of the scope of this lesson. I feel it is referenced much more often now than in the past in the documentation and online, so maybe it would be good that they have the key word somewhere?
19. mmousset2 11 Feb 2025
  
  in Public
  
  tidyverse
  
  [Mathilde]: should we put this one in the curly braces? I agree that there is the concept of the collection of package, and the actual bundle packages, but do we want to get into this here? We told theme we would use a convention, so unless we explain them why we break it here, I would stick to it.
20. mmousset2 11 Feb 2025
  
  in Public
  
  data-verbs-practice.R
  
  [Mathilde]: In the previous session we used underscores in script name. I think we should we use them here as well for the sake of homogeneity in their folder. We can make a note somewhere that - are appropriate too, but it is good to be consistent within a folder.
21. mmousset2 11 Feb 2025
  
  in Public
  
  his session will work with the raw Moissala linelist data, which can be downloaded here:
  
  [Mathilde]: the link does not work for now. I thought it should since it references an existing page, but maybe I am misunderstanding something?. The culprit is a "blob" in the middle: this adress works: https://github.com/epicentre-msf/repicentre/blob/main/data/raw/moissala_linelist_EN.xlsx
Annotators

mmousset2
Jan 2025
Local file Local file

Import data – Repicentre

2
1. mmousset2 07 Jan 2025
  
  in Public
  
  Importing .xlsx files
  
  [hugo] I still feel strongly about this. I think we should stick to simple, conventional importing with .csv and move the .xlsx import/export into it's own dedicated satellite - happy to hear what people think here. I agree the current session is not so long so .xlsx would fit in there but it's more for organisation purpose. especially because we will probably have a satellite regarding satellites, there are a lot to be said so better not duplicate.
2. mmousset2 07 Jan 2025
  
  in Public
  
  comments. B
  
  [cat] i don't know that this linking is necessary and it adds a breakable dependency.
Annotators

mmousset2

Annotators

Annotators

Annotators

Annotators