23 Matching Annotations
  1. Oct 2023
    1. Customers are often left to cobble together disparate services without tight integration in the way Microsoft might provide, for example.All this makes the introduction of Amazon Aurora zero-ETL integration with Amazon Redshift such a jaw-dropper. Let’s be clear: In essence, AWS announced that two of its services now work well together. It’s more than that, of course. Removing the cost and complexity of ETL is a great way to remove the need to build data pipelines. At heart, this is about making two AWS services work exceptionally well together. For another company, this might be considered table stakes, but for AWS, it’s relatively new and incredibly welcome.It’s also a sign of where AWS may be headed: tighter integration between its own services so that customers needn’t take on the undifferentiated heavy lifting of AWS service integration.
    1. One of the places where customers spend the most time building and managing ETL pipelines is between transactional databases and data warehouses, which is where AWS set its sights.
    1. One potential solution is the use of a “one big table” (OBT) strategy, where all the raw data is placed into one table. This strategy has both proponents and detractors, but leveraging large language models may overcome some of its challenges, such as discovery and pattern recognition. Super early startups such as Delphi and GetDot.AI, as well as more established players such as AWS QuickSite, Tableau Ask Data, and ThoughtSpot, are driving this trend.
  2. May 2020
  3. Jan 2020
    1. I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. I think this lesson translates well to the stream processing domain. I’ve written some of my thoughts about capturing and transforming immutable data streams here.

      Great point 👍

      Something i've thought about and emphasized for doing FDF - ability to debug per step or re-run after a given step.

  4. May 2018
  5. May 2014
    1. Collaborate for God's sake!: EVERY organization dealing with data is dealing with these problems. And governments need to work together on this. This is where open source presents invaluable process lessons for government: working collaboratively, and in the open, can float all boats much higher than they currently are. Whether it's putting your scripts on GitHub, asking and answering questions on the Open Data StackExchange, or helping out others on the Socrata support forums, collaboration is a key lever for this government technology problem.

      Collaboration is clearly key, but it's not obvious what that means. The suggestion here is a good first step in an organization:

      • scripts on github
      • asking and answering questions on stackexchange
      • and (for data) joining the Socrata support forums

      What does it take to get organizations on this path?

      And what steps are next once the organization has evolved to this point?