- Jun 2025
-
-
Stateless vs. Stateful Preprocessing: Most PyTorch transforms are stateless (e.g., RandomHorizontalFlip) or configured with fixed parameters (e.g., Normalize with pre-defined mean/std). If you need to compute statistics from your data (like the mean and standard deviation for normalization), you typically do this once offline and then hardcode these values into the Normalize transform. This contrasts with Keras's Normalization layer, which has an adapt() method to compute these statistics online from a batch of data.
Additional perspective on preprocessing
-
-
www.tensorflow.org www.tensorflow.org
-
Preprocessing challenges The following are the primary challenges of implementing data preprocessing: Training-serving skew. Training-serving skew refers to a difference between effectiveness (predictive performance) during training and during serving. This skew can be caused by a discrepancy between how you handle data in the training and the serving pipelines. For example, if your model is trained on a logarithmically transformed feature, but it's presented with the raw feature during serving, the prediction output might not be accurate. If the transformations become part of the model itself, it can be straightforward to handle instance-level transformations, as described earlier in Option C: TensorFlow. In that case, the model serving interface (the serving_fn function) expects raw data, while the model internally transforms this data before computing the output. The transformations are the same as those that were applied on the raw training and prediction data points. Full-pass transformations. You can't implement full-pass transformations such as scaling and normalization transformations in your TensorFlow model. In full-pass transformations, some statistics (for example, max and min values to scale numeric features) must be computed on the training data beforehand, as described in Option B: Dataflow. The values then have to be stored somewhere to be used during model serving for prediction to transform the new raw data points as instance-level transformations, which avoids training-serving skew. You can use the TensorFlow Transform (tf.Transform) library to directly embed the statistics in your TensorFlow model. This approach is explained later in How tf.Transform works. Preparing the data up front for better training efficiency. Implementing instance-level transformations as part of the model can degrade the efficiency of the training process. This degradation occurs because the same transformations are repeatedly applied to the same training data on each epoch. Imagine that you have raw training data with 1,000 features, and you apply a mix of instance-level transformations to generate 10,000 features. If you implement these transformations as part of your model, and if you then feed the model the raw training data, these 10,000 operations are applied N times on each instance, where N is the number of epochs. In addition, if you're using accelerators (GPUs or TPUs), they sit idle while the CPU performs those transformations, which isn't an efficient use of your costly accelerators. Ideally, the training data is transformed before training, using the technique described under Option B: Dataflow, where the 10,000 transformation operations are applied only once on each training instance. The transformed training data is then presented to the model. No further transformations are applied, and the accelerators are busy all of the time. In addition, using Dataflow helps you to preprocess large amounts of data at scale, using a fully managed service. Preparing the training data up front can improve training efficiency. However, implementing the transformation logic outside of the model (the approaches described in Option A: BigQuery or Option B: Dataflow) doesn't resolve the issue of training-serving skew. Unless you store the engineered feature in the feature store to be used for both training and prediction, the transformation logic must be implemented somewhere to be applied on new data points coming for prediction, because the model interface expects transformed data. The TensorFlow Transform (tf.Transform) library can help you to address this issue, as described in the following section.
Challenges with data preprocessing
-
You preprocess the raw training data using the transformation implemented in the tf.Transform Apache Beam APIs, and run it at scale on Dataflow. The preprocessing occurs in the following phases: Analyze phase: During the analyze phase, the required statistics (like means, variances, and quantiles) for stateful transformations are computed on the training data with full-pass operations. This phase produces a set of transformation artifacts, including the transform_fn graph. The transform_fn graph is a TensorFlow graph that has the transformation logic as instance-level operations. It includes the statistics computed in the analyze phase as constants. Transform phase: During the transform phase, the transform_fn graph is applied to the raw training data, where the computed statistics are used to process the data records (for example, to scale numerical columns) in an instance-level fashion.
Good dichotomy for data preprocessing
Tags
Annotators
URL
-
- Apr 2021
-
arxiv.org arxiv.org
-
Yang, K.-C., Pierri, F., Hui, P.-M., Axelrod, D., Torres-Lugo, C., Bryden, J., & Menczer, F. (2020). The COVID-19 Infodemic: Twitter versus Facebook. ArXiv:2012.09353 [Cs]. http://arxiv.org/abs/2012.09353
-
- Nov 2020
-
dagster.io dagster.ioDagster1
-
self-service data platform
Tags
Annotators
URL
-
- Jul 2020
-
-
Uribe-Tirado, A., del Rio, G., Raiher, S., & Ochoa Gutiérrez, J. (2020). Open Science since Covid-19: Open Access + Open Data [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/a5nqw
-
-
osf.io osf.io
-
Dunn, M., Stephany, F., Sawyer, S., Munoz, I., Raheja, R., Vaccaro, G., & Lehdonvirta, V. (2020). When Motivation Becomes Desperation: Online Freelancing During the COVID-19 Pandemic [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/67ptf
-
- Jun 2020
-
reutersinstitute.politics.ox.ac.uk reutersinstitute.politics.ox.ac.uk
-
Newman, N. (n.d.). Reuters Institute Digital News Report 2020. 112.
-
- Apr 2020
-
en.wikipedia.org en.wikipedia.org
- Mar 2020
-
matomo.org matomo.org
-
Export and migrate your data between hosting options at any time
-
- Dec 2019
-
zapier.com zapier.com
-
Most to-do lists give you no control over your data. Your tasks live inside the app, not in a document you can edit, and syncing is handled by whichever company made the app. If you don't like this, todo.txt is a great alternative.
-
-
wellcomeopenresearch.org wellcomeopenresearch.org
-
platform
Does it have a name and online presence? The details provided here go beyond what's given in reference 13, but some more detail would still be useful, e.g. to connect the initiative to efforts directed at data management and curation more generally, for instance in the framework of the Research Data Alliance, https://www.rd-alliance.org/ .
-