26 Matching Annotations
  1. Nov 2021
    1. If you don't have that information, you can determine which frequencies are important by extracting features with Fast Fourier Transform. To check the assumptions, here is the tf.signal.rfft of the temperature over time. Note the obvious peaks at frequencies near 1/year and 1/day:

      Do a fft with tensorflow

      fft = tf.signal.rfft(df['T (degC)'])
      f_per_dataset = np.arange(0, len(fft))
      n_samples_h = len(df['T (degC)'])
      hours_per_year = 24*365.2524
      years_per_dataset = n_samples_h/(hours_per_year)
      f_per_year = f_per_dataset/years_per_dataset
      plt.step(f_per_year, np.abs(fft))
      plt.ylim(0, 400000)
      plt.xlim([0.1, max(plt.xlim())])
      plt.xticks([1, 365.2524], labels=['1/Year', '1/day'])
      _ = plt.xlabel('Frequency (log scale)')
    2. Now, peek at the distribution of the features. Some features do have long tails, but there are no obvious errors like the -9999 wind velocity value.

      indeed, peek. we are looking at test data too.

      df_std = (df - train_mean) / train_std
      df_std = df_std.melt(var_name='Column', value_name='Normalized')
      plt.figure(figsize=(12, 6))
      ax = sns.violinplot(x='Column', y='Normalized', data=df_std)
      _ = ax.set_xticklabels(df.keys(), rotation=90)
    3. It is important to scale features before training a neural network. Normalization is a common way of doing this scaling: subtract the mean and divide by the standard deviation of each feature. The mean and standard deviation should only be computed using the training data so that the models have no access to the values in the validation and test sets. It's also arguable that the model shouldn't have access to future values in the training set when training, and that this normalization should be done using moving averages.

      moving average to avoid data leak

    4. You'll use a (70%, 20%, 10%) split for the training, validation, and test sets. Note the data is not being randomly shuffled before splitting. This is for two reasons: It ensures that chopping the data into windows of consecutive samples is still possible. It ensures that the validation/test results are more realistic, being evaluated on the data collected after the model was trained.

      Train, Validation, Test: 0.7, 0.2, 0.1

    5. Similarly, the Date Time column is very useful, but not in this string form. Start by converting it to seconds:
      timestamp_s = date_time.map(pd.Timestamp.timestamp)

      and then create "Time of day" and "Time of year" signals:

      day = 24*60*60
      year = (365.2425)*day
      df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
      df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
      df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
      df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))
    6. The last column of the data, wd (deg)—gives the wind direction in units of degrees. Angles do not make good model inputs: 360° and 0° should be close to each other and wrap around smoothly. Direction shouldn't matter if the wind is not blowing.

      transform WD and WS into (u, v)

    7. One thing that should stand out is the min value of the wind velocity (wv (m/s)) and the maximum value (max. wv (m/s)) columns. This -9999 is likely erroneous.
    8. This tutorial uses a weather time series dataset recorded by the Max Planck Institute for Biogeochemistry.
    9. date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S')
    10. df.describe().transpose()
  2. Sep 2021
  3. Aug 2021
    1. It’s common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation. But that’s a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it’s confirmation; if you look more than once, it’s exploration.
    2. It’s possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you’ll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
    3. We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
    4. If you’re routinely working with larger data (10-100 Gb, say), you should learn more about data.table. This book doesn’t teach data.table because it has a very concise interface which makes it harder to learn since it offers fewer linguistic cues. But if you’re working with large data, the performance payoff is worth the extra effort required to learn it.
    5. Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
  4. Nov 2020
  5. Sep 2020
    1. PS is a training ground for identifying tacit knowledge. It starts off with the most basic form: recognizing something you know in the experience of another. Using resonance as your filter, you will often highlight things you “already know,” but never quite were able to express. Everything you read or watch becomes a mirror, prompting what you already know tacitly to emerge into consciousness as explicit knowledge, which you can then write down and make use of.

      NB #DEF

    1. So I think there’s definitely a lot of opportunity there for suggesting possibly related notes, but the act of a person seeing the connection themselves as opposed to some algorithm seeing the two words are connected, I think is pretty important, because that’s where you get more insight.


    1. 2. 写比存重要,它会引诱你思考;但写作也不仅仅是一种倾泻、也应该是一种筛选的过程。我们不是要再造一个「迷你互联网」,而是要提高内容的质量、精炼度与可利用度。我们要写的,正是别人难写之事。3. 构建内容的联系,而不只是[[关键词]]的链接。有意义的联系一定是人为「构建」的,不是工具「生成」的。在两条笔记之间画一条线本身一点意义也没有,这条线也并不会给你增加任何新的见识。真正的联系在笔记里,而不是笔记间的连线,那只是一个提醒。


    2. @宽治:从知识管理的本质上来说,它们不能解决(也就是需要使用者自己通过思考来解决)的问题是:1. 平衡笔记的可理解性与可发现性之间的张力。2. 判断内容的价值并赋予相应的重要等级。3. 理解内容之间的联系并将之清楚地表达出来。4. 发现甚至预见内容拓展的可能性。
    3. 为了方便输出,我用 Keyboard Maestro 做了几个脚本,可以帮我一键把 Roam Research 中的内容输出成为 Textbundle、docx 或者 reveal.js 幻灯格式。这样一来,笔记整理和写作就都可以在 Roam Research 之内无障碍完成。


    4. 「晨间记录」和「晚间思考」(Morning Journal & Evening Reflectin)这两个板块用于早晚的个人记录和总结。「输入(Input)」指我这一天做了什么、学了什么、了解了什么;「输出(Output)」则更关注产出,包括「地标(Landmark)」这样值得铭记的成就和阶段性成果;「个人观察记录」更多是跟我身心状态相关的记录。如果我工作在一个具体而较为宏观的任务上,我就会选择创建对应 Page 并跳转到其中去工作。等待任务完成再回到 Journal 中。此外,如果不生成新页面的话,我会尽量给某个记录添加相应的 Tag,以便索引。


    5. 这么说可能有点抽象,举个具体的例子。原先我会分章节写树状的笔记,但后来发现有些书并不需要全读完,或者我会暂停一段时间接着读。在这种情况下,树状笔记往往会是未完成的状态(比如只有一个章节),看起来就很尴尬。而现在,我会:首先建立这本书的空白笔记。读到值得摘取的段落时新建时间戳笔记,起一个「章节-页码-内容概要」这种容易索引的标题,粘贴进去。另起一行,写摘录的原因和感想!


    1. Children begin to acquire a taste for pickled egg or fermented lentils early — in the womb, even. Compounds from the foods a pregnant woman eats travel through the amniotic fluid to her baby. After birth, babies prefer the foods they were exposed to in utero, a phenomenon scientists call “prenatal flavor learning.”

      [[Prenatal flavor learning]]