3,386 Matching Annotations
  1. Mar 2021
    1. The urgent argument for turning any company into a software company is the growing availability of data, both inside and outside the enterprise. Specifically, the implications of so-called “big data”—the aggregation and analysis of massive data sets, especially mobile

      Every company is described by a set of data, financial and other operational metrics, next to message exchange and paper documents. What else we find that contributes to the simulacrum of an economic narrative will undeniably be constrained by the constitutive forces of its source data.

    1. a data donation platform that allows users of browsers to donate data on their usage of specific services (eg Youtube, or Facebook) to a platform.

      This seems like a really promising pattern for many data-driven problems. Browsers can support opt-in donation to contribute their data to improve Web search, social media, recommendations, lots of services that implicitly require lots of operational data.

    1. DataBeers Brussels. (2020, October 26). ⏰ Our next #databeers #brussels is tomorrow night and we’ve got a few tickets left! Don’t miss out on some important and exciting talks from: 👉 @svscarpino 👉 Juami van Gils 👉 Joris Renkens 👉 Milena Čukić 🎟️ Last tickets here https://t.co/2upYACZ3yS https://t.co/jEzLGvoxQe [Tweet]. @DataBeersBru. https://twitter.com/DataBeersBru/status/1320743318234562561

    1. Cailin O’Connor. (2020, November 10). New paper!!! @psmaldino look at what causes the persistence of poor methods in science, even when better methods are available. And we argue that interdisciplinary contact can lead better methods to spread. 1 https://t.co/C5beJA5gMi [Tweet]. @cailinmeister. https://twitter.com/cailinmeister/status/1326221893372833793

    1. These methods should be used with caution, however, because important business rules and application logic may be kept in callbacks. Bypassing them without understanding the potential implications may lead to invalid data.
    1. Erich Neuwirth. (2020, November 11). #COVID19 #COVID19at https://t.co/9uudp013px Zu meinem heutigen Bericht sind Vorbemerkungen notwendig. Das EMS - aus dem kommen die Daten über positive Tests—Hat anscheinend ziemliche Probleme. Heute wurden viele Fälle nachgemeldet. In Wien gab es laut diesem [Tweet]. @neuwirthe. https://twitter.com/neuwirthe/status/1326556742113746950

  2. Feb 2021
    1. Data on blockchains are different from data on the Internet, and in one important way in particular. On the Internet most of the information is malleable and fleeting. The exact date and time of its publication isn't critical to past or future information. On a blockchain, the truth of the present relies on the details of the past. Bitcoins moving across the network have been permanently stamped from the moment of their coinage.

      data on blockchain vs internet

    1. Trailblazer will automatically create a new Context object around your custom input hash. You can write to that without interferring with the original context.
    1. Purely functional programming may also be defined by forbidding state changes and mutable data.
    2. Purely functional data structures are persistent. Persistency is required for functional programming; without it, the same computation could return different results.
    1. What this means is: I better refrain from writing a new book and we rather focus on more and better docs.

      I'm glad. I didn't like that the book (which is essentially a form of documentation/tutorial) was proprietary.

      I think it's better to make documentation and tutorials be community-driven free content

    2. ather, data is passed around from operation to operation, from step to step. We use OOP and inheritance solely for compile-time configuration. You define classes, steps, tracks and flows, inherit those, customize them using Ruby’s built-in mechanics, but this all happens at compile-time. At runtime, no structures are changed anymore, your code is executed dynamically but only the ctx (formerly options) and its objects are mutated. This massively improves the code quality and with it, the runtime stability
    1. Miguel Andariego comentó en este momento:

      Sería mejor arrancar con Duckduckgo para no promover del todo al gigante.

    1. Kit Yates. (2021, January 22). Is this lockdown 3.0 as tough as lockdown 1? Here are a few pieces of data from the @IndependentSage briefing which suggest that despite tackling a much more transmissible virus, lockdown is less strict, which might explain why we are only just keeping on top of cases. [Tweet]. @Kit_Yates_Maths. https://twitter.com/Kit_Yates_Maths/status/1352662085356937216

    1. A fairly comprehensive list of problems and limitations that are often encountered with data as well as suggestions about who should be responsible for fixing them (from a journalistic perspective).

    2. Benford’s Law is a theory which states that small digits (1, 2, 3) appear at the beginning of numbers much more frequently than large digits (7, 8, 9). In theory Benford’s Law can be used to detect anomalies in accounting practices or election results, though in practice it can easily be misapplied. If you suspect a dataset has been created or modified to deceive, Benford’s Law is an excellent first test, but you should always verify your results with an expert before concluding your data has been manipulated.

      This is a relatively good explanation of Benford's law.

      I've come across the theory in advanced math, but I'm forgetting where I saw the proof. p-adic analysis perhaps? Look this up.

    3. More journalistic outlets should be publishing data explainers about where their data and analysis come from so that readers can double check it.

    4. There is no worse way to screw up data than to let a single human type it in.
    1. In America, the number of searches at the time of the lockdown in 2020 for boredom rose by 57 percent, loneliness by 16 percent and worry by 12 percent.

      #

    2. in Europe at the time of the lockdown in 2020 for boredom rose by 93 percent, loneliness 40 percent and worry 27 percent

      #

    1. Cytoscape is an open source software platform for visualizing complex networks and integrating these with any type of attribute data. A lot of Apps are available for various kinds of problem domains, including bioinformatics, social network analysis, and semantic web.
  3. Jan 2021
    1. 8/10 a 14/10 - Leitura dos textos do Módulo 814/10, 19:00 a 20:30 -  Encontro presencial para conversar sobre os

      Acredito que as datas estejam eradas. Quais seriam as corretas?

    1. Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics…

      Is data analysis included in data science? If not, what is the relationship between them?

    1. this paper identifies / lists 5 reasons to follow the money in health care. These reasons are applicable to social services or other areas of philanthropy as well.

    1. ReconfigBehSci on Twitter: ‘RT @NatureNews: COVID curbed carbon emissions in 2020—But not by much, and new data show global CO2 emissions have rebounded: Https://t.c…’ / Twitter. (n.d.). Retrieved 20 January 2021, from https://twitter.com/SciBeh/status/1351840770823757824

    1. We could change the definition of Cons to hold references instead, but then we would have to specify lifetime parameters. By specifying lifetime parameters, we would be specifying that every element in the list will live at least as long as the entire list. The borrow checker wouldn’t let us compile let a = Cons(10, &Nil); for example, because the temporary Nil value would be dropped before a could take a reference to it.
    1. Why is CORS important? Currently, client-side scripts (e.g., JavaScript) are prevented from accessing much of the Web of Linked Data due to "same origin" restrictions implemented in all major Web browsers. While enabling such access is important for all data, it is especially important for Linked Open Data and related services; without this, our data simply is not open to all clients. If you have public data which doesn't use require cookie or session based authentication to see, then please consider opening it up for universal JavaScript/browser access. For CORS access to anything other than simple, non auth protected resources
    1. Likewise, privacy is an important issue in BCI ethics since the captured neural signals can be used to gain access to a user’s private information. Ethicists have raised concerns about how BCI data is stored and protected.
    1. Alongside the companies that gather data, there are newly powerful companies that build the tools for organizing, processing, accessing, and visualizing it—companies that don’t take in the traces of our common life but set the terms on which it is sorted and seen. The scraping of publicly available photos, for instance, and their subsequent labeling by low-paid human workers, served to train computer vision algorithms that Palantir can now use to help police departments cast a digital dragnet across entire populations. 

      organizing the mass of information is the real tricky part

  4. Dec 2020
    1. What is a data-originated component? It’s a kind of component that is primarily designed and built for either: displaying, entering, or customizing a given data content itself, rather than focusing on the form it takes. For example Drawer is a non data-originated component, although it may include some. Whereas Table, or Form, or even Feed are good examples of data-originated components.
    1. ever transitioning from teaching high school to teaching the university then coming to the to the community college i've become very fascinated with kind of how students move from one to the other

      Interesting to see trends in data and identify experiences that indicates continuity from high schools in the area to SPSCC (for instance as a Running start), and then to SMU. What are the variety of pathways that students who enrolled at SPSCC decide to, apply, get admission, and funding. What is the percentage of transfers from SPSCC to SMU?

    1. “provenance” — broadly, where did data arise, what inferences were drawn from the data, and how relevant are those inferences to the present situation? While a trained human might be able to work all of this out on a case-by-case basis, the issue was that of designing a planetary-scale medical system that could do this without the need for such detailed human oversight.

      Data Provenance

      The discipline of thinking about:

      (1) where did the data arise? (2) what inferences were drawn (3) how relevant are those inferences to the present situation?

    2. There is a different narrative that one can tell about the current era. Consider the following story, which involves humans, computers, data and life-or-death decisions, but where the focus is something other than intelligence-in-silicon fantasies. When my spouse was pregnant 14 years ago, we had an ultrasound. There was a geneticist in the room, and she pointed out some white spots around the heart of the fetus. “Those are markers for Down syndrome,” she noted, “and your risk has now gone up to 1 in 20.” She further let us know that we could learn whether the fetus in fact had the genetic modification underlying Down syndrome via an amniocentesis. But amniocentesis was risky — the risk of killing the fetus during the procedure was roughly 1 in 300. Being a statistician, I determined to find out where these numbers were coming from. To cut a long story short, I discovered that a statistical analysis had been done a decade previously in the UK, where these white spots, which reflect calcium buildup, were indeed established as a predictor of Down syndrome. But I also noticed that the imaging machine used in our test had a few hundred more pixels per square inch than the machine used in the UK study. I went back to tell the geneticist that I believed that the white spots were likely false positives — that they were literally “white noise.” She said “Ah, that explains why we started seeing an uptick in Down syndrome diagnoses a few years ago; it’s when the new machine arrived.”

      Example of where a global system for inference on healthcare data fails due to a lack of data provenance.

    1. Treemaps are a visualization method for hierarchies based on enclosure rather than connection [JS91]. Treemaps make it easy to spot outliers (for example, the few large files that are using up most of the space on a disk) as opposed to parent-child structure.

      Treemaps visualize enclosure rather than connection. This makes them good visualizations to spot outliers (e.g. large files on a disk) but not for understanding parent-child relationships.

    1. One way to do that is to export them from @sapper/app directly, and rely on the fact that we can reset them immediately before server rendering to ensure that session data isn't accidentally leaked between two users accessing the same server.
    1. ReconfigBehSci @SciBeh (2020) For those who might think this issue isn't settled yet, the piece include below has further graphs indicating just how much "protecting the economy" is associated with "keeping the virus under control" Twitter. Retrieved from: https://twitter.com/i/web/status/1306216113722871808

    1. I haven't met anyone who makes this argument who then says that a one stop convenient, reliable, private and secure online learning environment can’t be achieved using common every day online systems

      Reliable: As a simple example, I'd trust Google to maintain data reliability over my institutional IT support.

      And you'd also need to make the argument for why learning needs to be "private", etc.

    1. And then there was what Lanier calls “data dignity”; he once wrote a book about it, called Who Owns the Future? The idea is simple: What you create, or what you contribute to the digital ether, you own.

      See Tim Berners-Lee's SOLID project.

  5. Nov 2020
    1. Identify, classify, and apply protective measures to sensitive data. Data discovery and data classification solutions help to identify sensitive data and assign classification tags dictating the level of protection required. Data loss prevention solutions apply policy-based protections to sensitive data, such as encryption or blocking unauthorized actions, based on data classification and contextual factors including file type, user, intended recipient/destination, applications, and more. The combination of data discovery, classification, and DLP enable organizations to know what sensitive data they hold and where while ensuring that it's protected against unauthorized loss or exposure.

      [[BEST PRACTICES FOR DATA EGRESS MANAGEMENT AND PREVENTING SENSITIVE DATA LOSS]]

    2. Egress filtering involves monitoring egress traffic to detect signs of malicious activity. If malicious activity is suspected or detected, transfers can be blocked to prevent sensitive data loss. Egress filtering can also limit egress traffic and block attempts at high volume data egress.
    3. Data Egress vs. Data IngressWhile data egress describes the outbound traffic originating from within a network, data ingress, in contrast, refers to the reverse: traffic that originates outside the network that is traveling into the network. Egress traffic is a term used to describe the volume and substance of traffic transferred from a host network to an outside network.

      [[DATA EGRESS VS. DATA INGRESS]]

    4. Data Egress MeaningData egress refers to data leaving a network in transit to an external location. Outbound email messages, cloud uploads, or files being moved to external storage are simple examples of data egress. Data egress is a regular part of network activity, but can pose a threat to organizations when sensitive data is egressed to unauthorized recipients.Examples of common channels for data egress include:EmailWeb uploadsCloud storageRemovable media (USB, CD/DVD, external hard drives)FTP/HTTP transfers

      [[Definition/Data Egress]]

    5. What is Data Egress? Managing Data Egress to Prevent Sensitive Data Loss

      [[What is Data Egress? Managing Data Egress to Prevent Sensitive Data Loss]]

    1. Portable... your .name address works with any email or web service. With our automatic forwarding service on third level domains, you can change email accounts, your ISP, or your job without changing your email address. Any mail sent to your .name address arrives in any email box you choose.
    1. In-depth questionsThe following interview questions enable the hiring manager to gain a comprehensive understanding of your competencies and assess how you would respond to issues that may arise at work:What are the most important skills for a data engineer to have?What data engineering platforms and software are you familiar with?Which computer languages can you use fluently?Do you tend to focus on pipelines, databases or both?How do you create reliable data pipelines?Tell us about a distributed system you've built. How did you engineer it?Tell us about a time you found a new use case for an existing database. How did your discovery impact the company positively?Do you have any experience with data modeling?What common data engineering maxim do you disagree with?Do you have a data engineering philosophy?What is a data-first mindset?How do you handle conflict with coworkers? Can you give us an example?Can you recall a time when you disagreed with your supervisor? How did you handle it?

      deeper dive into [[Data Engineer]] [[Interview Questions]]

    1. to be listed on Mastodon’s official site, an instance has to agree to follow the Mastodon Server Covenant which lays out commitments to “actively moderat[e] against racism, sexism, homophobia and transphobia”, have daily backups, grant more than one person emergency access, and notify people three months in advance of potential closure. These indirect methods are meant to ensure that most people who encounter a platform have a safe experience, even without the advantages of centralization.

      Some of these baseline protections are certainly a good idea. The idea of advance notice of shut down and back ups are particularly valuable.

      I'd not know of the Mastodon Server Covenant before.

    1. The Hierarchy of AnalyticsAmong the many advocates who pointed out the discrepancy between the grinding aspect of data science and the rosier depictions that media sometimes portrayed, I especially enjoyed Monica Rogati’s call out, in which she warned against companies who are eager to adopt AI:Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure).This framework puts things into perspective.

      [[the hierarchy of analytics]]

    1. aberrant behavior

      Is there data on hand that shows these companies actually prevent cheating? How many instances of 'aberrant behavior' actually materialize into cheating offenses?

    2. keystroke biometrics, ID capture, and facial analysis

      I feel like I'm seeing various responses to what data is actually captured. To me, it doesn't seem like they are consistent in their responses about the types of data collected.

    1. Maybe your dbt models depend on source data tables that are populated by Stitch ingest, or by heavy transform jobs running in Spark. Maybe the tables your models build are depended on by analysts building reports in Mode, or ML engineers running experiments using Jupyter notebooks. Whether you’re a full-stack practitioner or a specialized platform team, you’ve probably felt the pain of trying to track dependencies across technologies and concerns. You need an orchestrator.Dagster lets you embed dbt into a wider orchestration graph.

      It can be common for [[data models]] to rely on other sources - where something like [[Dagster]] fits in - is allowing your dbt fit into a wider [[orchestration graph]]

    2. We love dbt because of the values it embodies. Individual transformations are SQL SELECT statements, without side effects. Transformations are explicitly connected into a graph. And support for testing is first-class. dbt is hugely enabling for an important class of users, adapting software engineering principles to a slightly different domain with great ergonomics. For users who already speak SQL, dbt’s tooling is unparalleled.

      when using [[dbt]] the [[transformations]] are [[SQL statements]] - already something that our team knows

    3. What is dbt?dbt was created by Fishtown Analytics to enable data analysts to build well-defined data transformations in an intuitive, testable, and versioned environment.Users build transformations (called models) defined in templated SQL. Models defined in dbt can refer to other models, forming a dependency graph between the transformations (and the tables or views they produce). Models are self-documenting, easy to test, and easy to run. And the dbt tooling can use the graph defined by models’ dependencies to determine the ancestors and descendants of any individual model, so it’s easy to know what to recompute when something changes.

      one of the [[benefits of [[dbt]]]] is that the [[data transformations]] or [[data models]] can refer to other models, and help show the [[dependency graph]] between transformatios

    1. The attribution data modelIn reality, it’s impossible to know exactly why someone converted to being a customer. The best thing that we can do as analysts, is provide a pretty good guess. In order to do that, we’re going to use an approach called positional attribution. This means, essentially, that we’re going to weight the importance of various touches (customer interactions with a brand) based on their position (the order they occur in within the customer’s lifetime).To do this, we’re going to build a table that represents every “touch” that someone had before becoming a customer, and the channel that led to that touch.

      One of the goals of an [[attribution data model]] is to understand why someone [[converted]] to being a customer. This is impossible to do accurately, but this is where analysis comes in.

      There are some [[approaches to attribution]], one of those is [[positional attribution]]

      [[positional attribution]] is that we are weighting the importance of touch points - or customer interactions, based on their position within the customer lifetime.

    2. Marketers have been told that attribution is a data problem -- “Just get the data and you can have full knowledge of what’s working!” -- when really it’s a data modeling problem. The logic of your attribution model, what the data represents about your business, is as important as the data volume. And the logic is going to change based on your business. That’s why so many attribution products come up short.

      attribution isn't a data problem, it's a data modeling problem]] - it's not just the data, but what the data represents about your business.

    1. I increasingly don’t care for the world of centralized software. Software interacts with my data, on my computers. Its about time my software reflected that relationship. I want my laptop and my phone to share my files over my wifi. Not by uploading all my data to servers in another country. Especially if those servers are financed by advertisers bidding for my eyeballs.
  6. Oct 2020
    1. This is until you realize you're probably using at least ten different services, and they all have different purposes, with various kinds of data, endpoints and restrictions. Even if you have the capacity and are willing to do it, it's still damn hard.
    2. Hopefully we can agree that the current situation isn't so great. But I am a software engineer. And chances that if you're reading it, you're very likely a programmer as well. Surely we can deal with that and implement, right? Kind of, but it's really hard to retrieve data created by you.
    1. (d) All calculations shown in this appendix shall be implemented on a site-level basis. Site level concentration data shall be processed as follows: (1) The default dataset for PM2.5 mass concentrations for a site shall consist of the measured concentrations recorded from the designated primary monitor(s). All daily values produced by the primary monitor are considered part of the site record; this includes all creditable samples and all extra samples. (2) Data for the primary monitors shall be augmented as much as possible with data from collocated monitors. If a valid daily value is not produced by the primary monitor for a particular day (scheduled or otherwise), but a value is available from a collocated monitor, then that collocated value shall be considered part of the combined site data record. If more than one collocated daily value is available, the average of those valid collocated values shall be used as the daily value. The data record resulting from this procedure is referred to as the “combined site data record.”
      1. Calculate mean of all collocated NON-primary monitors' values per day
      2. Coalesce primary monitor value with this calculated mean
    1. ​Institutions that were primarily online before the pandemic are also doing well. At colleges where more than 90 percent of students took courses solely online pre-pandemic, enrollments are growing for both undergraduate (6.8 percent) and graduate students (7.2 percent).
    1. If you define a variable outside of your form, you can then set the value of that variable to the handleSubmit function that 🏁 React Final Form gives you, and then you can call that function from outside of the form.
    1. We could freeze the objects in the model but don't for efficiency. (The benefits of an immutable-equivalent data structure will be documented in vtree or blog post at some point)

      first sighting: "immutable-equivalent data"

    2. A VTree is designed to be equivalent to an immutable data structure. While it's not actually immutable, you can reuse the nodes in multiple places and the functions we have exposed that take VTrees as arguments never mutate the trees.
    1. We don't know if the passed in props is a user created object that can be mutated so we must always clone it once.