29 Matching Annotations
  1. Dec 2024
    1. coalesce 会降低同一个 stage 计算的并行度,导致 cpu 利用率不高,任务执行时间变长。我们目前有一个实现是需要将最终的结果写成单个 avro 文件,前面的转换过程可能是各种各样的,我们在最后阶段加上 repartition(1).write().format('avro').mode('overwrite').save('path')。最近发现有时前面的转换过程中有排序时,使用 repartition(1) 有时写得单文件顺序不对,使用 coalesce(1) 顺序是对的,但 coalesce(1) 有性能问题。目前想到可以 collect 到 d
  2. Dec 2022
  3. Sep 2022
  4. Jun 2022
  5. May 2022
    1. Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

      spark properties

    1. keep only what resonates in a trusted place thatyou control, and to leave the rest aside

      Though it may lead down the road to the collector's fallacy, one should note down, annotate, or highlight the things that resonate with them. Similar to Marie Kondo's concept in home organization, one ought to find ideas that "spark joy" or move them internally. These have a reasonable ability to be reused or turned into something with a bit of coaxing and work. Collect now to be able to filter later.

    Tags

    Annotators

  6. Apr 2022
    1. E-tivities generally involve the tutor providing a small piece of information, stimulus or challenge, which Salmon refers to as the 'spark'.

      Efetivamente estas e-atividades são mesmo isso, um estimulo importante neste novo ensino. Os alunos precisam de se sentir parte da "sala de aula" e de se sentirem motivados à aprendizagem.

  7. Jan 2022
  8. May 2021
  9. Apr 2021
    1. With Spark 3.1, the Spark-on-Kubernetes project is now considered Generally Available and Production-Ready.

      With Spark 3.1 k8s becomes the right option to replace YARN

  10. Feb 2021
    1. Consider the amount of data and the speed of the data, if low latency is your priority use Akka Streams, if you have huge amounts of data use Spark, Flink or GCP DataFlow.

      For low latency = Akka Streams

      For huge amounts of data = Spark, Flink or GCP DataFlow

  11. Jun 2020
  12. Sep 2019
  13. Mar 2019
  14. Dec 2018
  15. Nov 2018
  16. Oct 2018
  17. Jan 2018
  18. May 2017
  19. Apr 2017
  20. Apr 2014
    1. Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably: The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.) A rich set of programming primitives in connection with that model. Support also for highly-iterative processing, of the kind found in machine learning. Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets). A clever approach to fault-tolerance.

      Spark's advantages:

      • DAG processing model
      • programming primitives for DAG model
      • highly-iterative processing suited for ML
      • RDD in-memory data structures
      • clever approach to fault-tolerance