76 Matching Annotations
  1. Feb 2025
    1. [Mathilde]: little detail, but do we need the practise at the end of these scripts? It is litteraly going to be the case for any script they produce so it feels redundant.

    2. selecting particular variables of interest and modifying them to better suit our needs. Beyond selecting variables of interest

      [Mathilde]: I would put the bold emphasis on the first part "selecting particular variables of interest", and maybe simplifying the start of the second sentence to "Beyong that" to avoid the repetition that sounds a bit heavy and still keep the mirror of variables vs observations, which is nice.

    3. a more complex question

      [Mathilde]: I do not think it is more complex, nor that we should stress them by saying that. maybe keeps it for when we have several conditions?

    4. Easy

      [Mathilde]: I find it comical that this is easy when != was complex :D. I would say that the condition is more complex, but fortunately the code is still easy.

    5. B

      [Mathilde]: Should we allow a special case after it for the is.na()? It sort of fits there, and we can tell we will use it in the enxt session? At least the table is a bit more complete?

    6. So for example, if we want our case_when() to say that anytime a patient had a MUAC less than 110 we want to have a value of "SAM", the first part of our case when would be muac < 110 ~ "SAM'. Here the left side of the ~ provides the condition and the right side gives the value we want whenever that condition is true. We can add multiple possible outcomes by adding additional lines. So in this case, our next condition might check if the patient is moderately but not severly malnourished using the statement muac < 125 ~ "MAM". The last line, with the argument .default then gives the value we want case_when() to use when none of the above conditions have been met. In this case, we might give the value "Normal". To put this together, if we wanted to use case_when() to create a variable that classifies the malnutrition status of patients using their MUAC, we would write: df_raw |> mutate(malnut = case_when(muac < 110 ~ 'SAM', muac < 125 ~ 'MAM', .default = 'Normal'))

      [Mathilde]: this part is complicated, I think because it describes an example that we don't see yet (and the text description is long and hard to understand) with general information. I advise to move the code example way higher, so that we have the real exemple in front of us before attempting to dissect it with text, not two paragraphs after.

    7. Try running the above code to see if it successfully creates a new column malnut with the malnutrition status of each case. You should get something like this:

      [Mathilde]: I don't think this is very usefull, just copy and pasting the code, especially with that edgy formulation. Maybe ask them an alternative column on MUAC. Aren't there protocol modifications where MSF uses 120 mm? Cannot remember exactly. But that would allow to keep the code very similar, but the exercice a bit more complex than just copy and pasting.

    8. ie:

      [Mathilde]: Ok, I understand that both italicised and non italiscised are accepted; that removing the comas is pretty informal but growing in use (you may consider that most non native speakers might have encountered in formal text, and may find it harder to parse without the italics or the comas). But I think the : is weird.

    9. Be careful. The order of your statements is important here. What case_when() will do is go through each statement from top to bottom and assign the first value that is TRUE. So in our above example, case_when() will ask the following questions in sequence: Does this patient have SAM (is muac < 110)? If so, assign the value "SAM" If the patient didn’t have SAM, do they have MAM (is muac < 125)? If so, assign the value `“MAM” If none of the above conditions were true, assign the default value "Normal"

      [Mathilde]: I would move it before their exercice, when we dissect the example.

    10. No

      [Mathilde]: Should we have a tip afterwards that say that in some cases, if the vector is going to be used in several cleaning steps it can be saved outside of the pipe and show an exemple? It's not really necessary, but I like to remind them that they can create object and use them even in pipes.

    11. inspect the categorical variables in df_raw

      Mathilde: ok, this might take some time if they try to redo it all, and they should have done something similar many times (first atr lenght with Elisabeth, then with summary in session 2, then at the sart of session 3). So maybe "look at your notes about categorical variables, whuch of them need standardizing. Hint, if you did not remember you can use count" or something like that?

    12. case_when()

      [Mathilde]: Just so you know, on my screen the background of the inline code is not very distinguishable from the white backgroud of the page. I don't think I have special settings for colours during the day, so it may be a problem on other screens as well. Maybe make is slightly darker?

    13. : “for this row, is the value of hospitalisation equal to "yes"?”

      [Mathilde]: lots of punctuation and quotes going on. Consider updating to: and testing if the value of hospitalisation equals to "yes" in this row.

    14. extension

      [Mathilde]: Maybe precise that it's ok to have both commands, to have a dataset ready to be imported in R and one to share with colleagues who do not use R and need to be able to open a dataset in Excel or other softwares.

    15. Logic

      [Mathilde]: I am not sure about key words anymore. I feel either they are too precise to be usefull (my problem on session one and two), or too general (here?). I can't really imagine myself looking for something and these keywords being near enough to be usefull in this session for exemple.

    16. Going Further

      [Mathilde]: This is conceptually an exercice for last session, but because you use the unclean variables in this session we cannot do it before: harmonise "Yes" vs "yes" etc. in all columns where this is pertinent.

    17. This will be the focus of our session today

      [Mathilde]: deending on whether you think of bricks of needed stuff, or more general concepts (cleaning, recoding...), this might be true or not for them. Maybe rephrase to insist that they will be usefull thoughout today session?

    Annotators

    1. data-verbs-practice.R

      [Mathilde]: In the previous session we used underscores in script name. I think we should we use them here as well for the sake of homogeneity in their folder. We can make a note somewhere that - are appropriate too, but it is good to be consistent within a folder.

    2. tidyverse

      [Mathilde]: should we put this one in the curly braces? I agree that there is the concept of the collection of package, and the actual bundle packages, but do we want to get into this here? We told theme we would use a convention, so unless we explain them why we break it here, I would stick to it.

    3. don’t need to be put into quotation marks

      [Mathilde]: should we have a tooltip saying that it is due to non standard evaluation, and that the details are out of the scope of this lesson. I feel it is referenced much more often now than in the past in the documentation and online, so maybe it would be good that they have the key word somewhere?

    4. daframe

      [Mathilde]: Sophie had pointed in session 1 that it was "data frame" with a space. I have no idea why we wrote it otherwise in the past, but I checked the help, and I think she is right. I made the change to session 1, session 2, session 7. I'll make the change to this one as well directly in the code, just explaingin why here.

    5. Often, we want to keep most of the variables in our dataset and only remove one or two. We can use the aabove syntax to do this, but it can become pretty tedious to write out every column name. In these cases, instead of telling select what to keep, we can use a subtraction sign (-) to tell it what to remove. For example, if we wanted to remove the village_commune column from our dataframe we can use the following:

      [Mathilde]: I like that we show it.

    6. tmp

      [Mathilde]: I would argue that for the French translation we should use another name (temp maybe, I don't think it is reserved in R). I feel that the most complicated english names to parse as a french are the ones where the vowells were removed. We don't do it so much as shortcuting and it's hard to figure out what the remainning letters mean. Even now, I need to read it 2 or 3 times before my brain stops reading it "te - em - pe" and understand it's a shortcuf for "temporary".

    7. MUAC in cm

      [Mathilde]: because the tooltip does not render Markdown well, I'd suggest removint the backticks around function names in them, so as to not confused them since they never see that syntax in the rest of the text.

    8. age_years

      [Mathilde]: Maybe a nice extra exercice would be to get them to use the round function, first showing them how to use it outside of mutate, and then getting them to update their mutate code to add rounding.

    9. We might want to keep age in months as well as years, so we won’t reassign that column. But there are some other columns that could stand to be changed. There are a lot of reasons we might want to change a column, two of the most common ones are: The format of a string needs changing The data type of a column is incorrect

      [Mathilde]: I feel around here is a right moment to remind them, either in a tooltip or a note below, that because we import the data in R and create new objects, at no point modifyng df_raw modifies our raw data in the data subfolder.

    10. , {lubridate}.

      [Mathilde]: maybe select more of the text as the tooltip anchor, because it is not obvious to readers that there is anything to read here. (same for ymd just after, and each time we use a tooltip just on the package name)

    11. easily done

      [Mathilde]: consider adding "at least in the case of fully duplicated lines" to this sentence or as a tooltip. Even with the next paragraph, I want them to realise there that two lines, identical for 32 variables, and with a typo difference in ONE variable are not going to be removed.

    12. Hold on, have we just sneakily learned how to do a lot of basic data cleaning? I sure think so!

      [Mathilde]: this sounds a bit weird. There was nothing sneaky about spending one hour and half doing it^^ Maybe it's just a bit too much and needs reformulating.

    13. duplicated

      [Mathilde]: I am a bit confused as to when it is grammatically correct to say "duplicate observations or entries" instead of "duplicated observations". I always use the later, and I am always a bit surprised when you use the former (in the objectives, or in the dedicated paragraph).

    14. reproducability

      [Mathilde]: this sounds like a mix between "reproducibility" and "replicability". Are you sure it exists? I would have said reproducibility here.

    15. Compare the number of rows

      [Mathilde]: is nrow() a function they will review in the mini exploration activity? If not, I may add a hint that we saw such a function in the past, because it was a small one we did not use much, so easy to forget (and I would not use it in a reviewing demo to refresh them on R, too small).

    Annotators

    1. Finally, a reminder that as usual summarise() returns a dataframe, which can be further used and modified using mutate() - this means that we should be able to add a variable about the proportion of female in each sub_prefecture using our newly created n_female ! Can you add a mutate() after your summarise() to make this happen ?

      didn't you already do this earlier ?

    2. sub_pref_df <- df_linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE), mean_age_female = mean(age[sex == "f"], na.rm = TRUE), n_death_u6m = sum(outcome[age_group == "< 6 months"] == "dead", na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients ) sub_pref_df

      i wouldn't give them the code for the solution... but i think it can be nice to show the output !

    3. _linelist %>% summarise( .by = sub_prefecture, n_patients = n(), mean_age = mean(age), min_admission = min(date_admission, na.rm = TRUE), n_female = sum(sex == "f", na.rm = TRUE), n_hosp = sum(hospitalisation == "yes", na.rm = TRUE), mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE) ) %>% mutate( prop_female = n_female / n_patients, prop_hosp = n_hosp / n_patients )

      i don't know that i would show this in full pipe, it may be hard for them to see what you're talking about esp since its not added at the end. maybe just show the specific example you are trying to illustrate ?

    4. = min(date_admission, na.rm = TRUE),

      i probably wouldn't put the answers to exercises in these examples. exercises should be separate so that they have to practice

    5. Note. You can write either summarize() (US spelling) or summarise() (British spelling) in R.

      i think this is low yeild, but up to you. if you want to keep it, you should make it one of the callouts (for example "tip").

    Annotators

    Annotators

    1. eople with zero prior experience in R, the linear course will walk you through core R concepts using a case study about measles in Chad. The course covers the following concepts:

      [cat] comment example

  2. Dec 2024
    1. Importing .xlsx files

      [hugo] I still feel strongly about this. I think we should stick to simple, conventional importing with .csv and move the .xlsx import/export into it's own dedicated satellite - happy to hear what people think here. I agree the current session is not so long so .xlsx would fit in there but it's more for organisation purpose. especially because we will probably have a satellite regarding satellites, there are a lot to be said so better not duplicate.

    2. Add the code to create an object called path_data_raw

      [hugo] I would suggest we remove the object here for the variable name. I hear that it makes them practice object, but I feel it's confusing to introduce here("data", "raw") and then suddenly ask them to do here(path_data_raw, "msf_linelist_moissala_2023-09-24.xlsx"). The variable does not add much here and is only truly important with long automated scripts ...

    3. Foreshadowing. File paths actually work a bit differently in Rmarkdown files than they do in R scripts, but this is something we will talk about much later in the course. If you don’t know what RMarkdown is at the moment, don’t worry about it

      [hugo] I would remove this, it's quite advance and such piles on the already huge load of new information they are getting here

    4. OneDrive doesn’t play well with R as it will attempt to constantly synchronize certain project files in a way that can cause errors or memory problems.

      [cat] i would put this somewhere else, either in text of the setup instruction next to it or as an important callout

    5. OneDrive doesn’t play well with R as it will attempt to constantly synchronize certain project files in a way that can cause errors or memory problems.

      [hugo] I think side notes are distracting, I have not even noticed them until now. I would suggest we move this type of information (definitions/deeper concept) to the tooltips

    Annotators