Hypothesis

12 Matching Annotations

Jun 2025
apxml.com apxml.com

Data Preprocessing with PyTorch Transforms vs Keras

1
1. jmk412 09 Jun 2025
  
  in Public
  
  Stateless vs. Stateful Preprocessing: Most PyTorch transforms are stateless (e.g., RandomHorizontalFlip) or configured with fixed parameters (e.g., Normalize with pre-defined mean/std). If you need to compute statistics from your data (like the mean and standard deviation for normalization), you typically do this once offline and then hardcode these values into the Normalize transform. This contrasts with Keras's Normalization layer, which has an adapt() method to compute these statistics online from a batch of data.
  
  Additional perspective on preprocessing
  
  data-platform
Visit annotations in context

Tags

data-platform

Annotators

jmk412

URL

apxml.com/courses/pytorch-for-tensorflow-developers/chapter-3-pytorch-data-loading-for-tf-users/preprocessing-pytorch-transforms
www.tensorflow.org www.tensorflow.org

Data preprocessing for ML: options and recommendations | TFX | TensorFlow

2
1. jmk412 09 Jun 2025
  
  in Public
  
  Preprocessing challenges The following are the primary challenges of implementing data preprocessing: Training-serving skew. Training-serving skew refers to a difference between effectiveness (predictive performance) during training and during serving. This skew can be caused by a discrepancy between how you handle data in the training and the serving pipelines. For example, if your model is trained on a logarithmically transformed feature, but it's presented with the raw feature during serving, the prediction output might not be accurate. If the transformations become part of the model itself, it can be straightforward to handle instance-level transformations, as described earlier in Option C: TensorFlow. In that case, the model serving interface (the serving_fn function) expects raw data, while the model internally transforms this data before computing the output. The transformations are the same as those that were applied on the raw training and prediction data points. Full-pass transformations. You can't implement full-pass transformations such as scaling and normalization transformations in your TensorFlow model. In full-pass transformations, some statistics (for example, max and min values to scale numeric features) must be computed on the training data beforehand, as described in Option B: Dataflow. The values then have to be stored somewhere to be used during model serving for prediction to transform the new raw data points as instance-level transformations, which avoids training-serving skew. You can use the TensorFlow Transform (tf.Transform) library to directly embed the statistics in your TensorFlow model. This approach is explained later in How tf.Transform works. Preparing the data up front for better training efficiency. Implementing instance-level transformations as part of the model can degrade the efficiency of the training process. This degradation occurs because the same transformations are repeatedly applied to the same training data on each epoch. Imagine that you have raw training data with 1,000 features, and you apply a mix of instance-level transformations to generate 10,000 features. If you implement these transformations as part of your model, and if you then feed the model the raw training data, these 10,000 operations are applied N times on each instance, where N is the number of epochs. In addition, if you're using accelerators (GPUs or TPUs), they sit idle while the CPU performs those transformations, which isn't an efficient use of your costly accelerators. Ideally, the training data is transformed before training, using the technique described under Option B: Dataflow, where the 10,000 transformation operations are applied only once on each training instance. The transformed training data is then presented to the model. No further transformations are applied, and the accelerators are busy all of the time. In addition, using Dataflow helps you to preprocess large amounts of data at scale, using a fully managed service. Preparing the training data up front can improve training efficiency. However, implementing the transformation logic outside of the model (the approaches described in Option A: BigQuery or Option B: Dataflow) doesn't resolve the issue of training-serving skew. Unless you store the engineered feature in the feature store to be used for both training and prediction, the transformation logic must be implemented somewhere to be applied on new data points coming for prediction, because the model interface expects transformed data. The TensorFlow Transform (tf.Transform) library can help you to address this issue, as described in the following section.
  
  Challenges with data preprocessing
  
  data-platform
2. jmk412 09 Jun 2025
  
  in Public
  
  You preprocess the raw training data using the transformation implemented in the tf.Transform Apache Beam APIs, and run it at scale on Dataflow. The preprocessing occurs in the following phases: Analyze phase: During the analyze phase, the required statistics (like means, variances, and quantiles) for stateful transformations are computed on the training data with full-pass operations. This phase produces a set of transformation artifacts, including the transform_fn graph. The transform_fn graph is a TensorFlow graph that has the transformation logic as instance-level operations. It includes the statistics computed in the analyze phase as constants. Transform phase: During the transform phase, the transform_fn graph is applied to the raw training data, where the computed statistics are used to process the data records (for example, to scale numerical columns) in an instance-level fashion.
  
  Good dichotomy for data preprocessing
  
  data-platform
Visit annotations in context

Tags

data-platform

Annotators

jmk412

URL

tensorflow.org/tfx/guide/tft_bestpractices
Apr 2021
arxiv.org arxiv.org

2012.09353v1.pdf

1
1. lucyparfitt16 23 Apr 2021
  
  in BehSci
  
  Yang, K.-C., Pierri, F., Hui, P.-M., Axelrod, D., Torres-Lugo, C., Bryden, J., & Menczer, F. (2020). The COVID-19 Infodemic: Twitter versus Facebook. ArXiv:2012.09353 [Cs]. http://arxiv.org/abs/2012.09353
  
  is:pdf lang:en COVID-19 Twitter Facebook misinformation COVID-19 infodemic social media low-credibility sources low-credibility content research data analysis platform moderation bias free speech censorship
Visit annotations in context

Tags

data analysis

research

censorship

Twitter

Facebook

misinformation

lang:en

platform

COVID-19

COVID-19 infodemic

moderation

low-credibility content

is:pdf

low-credibility sources

bias

free speech

social media

Annotators

lucyparfitt16

URL

arxiv.org/pdf/2012.09353v1.pdf
Nov 2020
dagster.io dagster.io

Dagster

1
1. justEvan 13 Nov 2020
  
  in Public
  
  self-service data platform
  
  data platform data orchestration
Visit annotations in context

Tags

data platform

data orchestration

Annotators

justEvan

URL

dagster.io/
Jul 2020
osf.io osf.io

Open Science since Covid-19: Open Access + Open Data

1
1. ErikStuchly 13 Jul 2020
  
  in BehSci
  
  Uribe-Tirado, A., del Rio, G., Raiher, S., & Ochoa Gutiérrez, J. (2020). Open Science since Covid-19: Open Access + Open Data [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/a5nqw
  
  is:preprint lang:en COVID-19 open science open access open data publication collaboration compilation platform
Visit annotations in context

Tags

platform

is:preprint

open science

open access

COVID-19

compilation

publication

open data

collaboration

lang:en

Annotators

ErikStuchly

URL

osf.io/preprints/socarxiv/a5nqw/
osf.io osf.io

When Motivation Becomes Desperation: Online Freelancing During the COVID-19 Pandemic

1
1. ErikStuchly 12 Jul 2020
  
  in BehSci
  
  Dunn, M., Stephany, F., Sawyer, S., Munoz, I., Raheja, R., Vaccaro, G., & Lehdonvirta, V. (2020). When Motivation Becomes Desperation: Online Freelancing During the COVID-19 Pandemic [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/67ptf
  
  is:preprint lang:en COVID-19 motivation freelancing remote work market data online labor platform interview public response knowledge work supply demand economic impact occupation gender difference
Visit annotations in context

Tags

knowledge work

economic impact

supply

gender difference

lang:en

freelancing

is:preprint

COVID-19

online labor platform

interview

motivation

remote work

market data

demand

occupation

public response

Annotators

ErikStuchly

URL

osf.io/preprints/socarxiv/67ptf/
Jun 2020
reutersinstitute.politics.ox.ac.uk reutersinstitute.politics.ox.ac.uk

Reuters Institute Digital News Report 2020

1
1. gailelhalaby 21 Jun 2020
  
  in BehSci
  
  Newman, N. (n.d.). Reuters Institute Digital News Report 2020. 112.
  
  lang:en is:report coronav social media news platform internet trust online news social media platform media COVID-19 lockdown data visualisation data journalism media trust
Visit annotations in context

Tags

coronav

online news

lockdown

COVID-19

media trust

media

news platform

internet

trust

social media platform

data visualisation

data journalism

is:report

social media

lang:en

Annotators

gailelhalaby

URL

reutersinstitute.politics.ox.ac.uk/sites/default/files/2020-06/DNR_2020_FINAL.pdf
Apr 2020
en.wikipedia.org en.wikipedia.org

Vendor lock-in - Wikipedia

1
1. TylerRick 20 Apr 2020
  
  in Public
  
  app lock-in platform lock-in vendor lock-in data lock-in
Visit annotations in context

Tags

app lock-in

vendor lock-in

data lock-in

platform lock-in

Annotators

TylerRick

URL

en.wikipedia.org/wiki/Vendor_lock-in
Mar 2020
matomo.org matomo.org

Matomo: Complete Analytics. 100% Yours.

1
1. TylerRick 31 Mar 2020
  
  in Public
  
  Export and migrate your data between hosting options at any time
  
  data freedom platform lock-in app lock-in data freedom: portability
Visit annotations in context

Tags

data freedom

app lock-in

data freedom: portability

platform lock-in

Annotators

TylerRick

URL

matomo.org/
Dec 2019
zapier.com zapier.com

The 11 best to do list apps of 2020

1
1. TylerRick 29 Dec 2019
  
  in Public
  
  Most to-do lists give you no control over your data. Your tasks live inside the app, not in a document you can edit, and syncing is handled by whichever company made the app. If you don't like this, todo.txt is a great alternative.
  
  at mercy of software publisher platform lock-in data lives in app proprietary format open file format simple text-based file format app data stored in text file
Visit annotations in context

Tags

app data stored in text file

simple text-based file format

open file format

data lives in app

proprietary format

at mercy of software publisher

platform lock-in

Annotators

TylerRick

URL

zapier.com/blog/best-todo-list-apps/
wellcomeopenresearch.org wellcomeopenresearch.org

Diffusion of ethical governance policy on sharing of biological materials and related data for biomedical research

1
1. Daniel_Mietchen 24 Dec 2019
  
  in Public
  
  platform
  
  Does it have a name and online presence? The details provided here go beyond what's given in reference 13, but some more detail would still be useful, e.g. to connect the initiative to efforts directed at data management and curation more generally, for instance in the framework of the Research Data Alliance, https://www.rd-alliance.org/ .
  
  data management data sharing platform
Visit annotations in context

Tags

data management

data sharing platform

Annotators

Daniel_Mietchen

URL

wellcomeopenresearch.org/articles/4-170

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL