3,382 Matching Annotations
  1. Nov 2022
    1. https://whatever.scalzi.com/2022/11/25/how-to-weave-the-artisan-web/

      “But Scalzi,” I hear you say, “How do we bring back that artisan, hand-crafted Web?” Well, it’s simple, really, and if you’re a writer/artist/musician/other sort of creator, it’s actually kind of essential:

    1. Our annotators achieve thehighest precision with OntoNotes, suggesting thatmost of the entities identified by crowdworkers arecorrect for this dataset.

      interesting that the mention detection algorithm gives poor precision on OntoNotes and the annotators get high precision. Does this imply that there are a lot of invalid mentions in this data and the guidelines for ontonotes are correct to ignore generic pronouns without pronominals?

    2. an algorithm with high precision on LitBank orOntoNotes would miss a huge percentage of rele-vant mentions and entities on other datasets (con-straining our analysis)

      these datasets have the most limited/constrained definitions for co-reference and what should be marked up so it makes sense that precision is poor in these datasets

    3. Procedure: We first launch an annotation tutorial(paid $4.50) and recruit the annotators on the AMTplatform.9 At the end of the tutorial, each annotatoris asked to annotate a short passage (around 150words). Only annotators with a B3 score (Bagga

      Annotators are asked to complete a quality control exercise and only annotators who achieve a B3 score of 0.9 or higher are invited to do more annotation

    4. Annotation structure: Two annotation ap-proaches are prominent in the literature: (1) a localpairwise approach, annotators are shown a pairof mentions and asked whether they refer to thesame entity (Hladká et al., 2009; Chamberlain et al.,2016a; Li et al., 2020; Ravenscroft et al., 2021),which is time-consuming; or (2) a cluster-basedapproach (Reiter, 2018; Oberle, 2018; Bornsteinet al., 2020), in which annotators group all men-tions of the same entity into a single cluster. InezCoref we use the latter approach, which can befaster but requires the UI to support more complexactions for creating and editing cluster structures.

      ezCoref presents clusters of coreferences all at the same time - this is a nice efficient way to do annotation versus pairwise annotation (like we did for CD^2CR)

    5. owever, these datasets vary widelyin their definitions of coreference (expressed viaannotation guidelines), resulting in inconsistent an-notations both within and across domains and lan-guages. For instance, as shown in Figure 1, whileARRAU (Uryupina et al., 2019) treats generic pro-nouns as non-referring, OntoNotes chooses not tomark them at all

      One of the big issues is that different co-reference datasets have significant differences in annotation guidelines even within the coreference family of tasks - I found this quite shocking as one might expect coreference to be fairly well defined as a task.

    6. Specifically, our work investigates the quality ofcrowdsourced coreference annotations when anno-tators are taught only simple coreference cases thatare treated uniformly across existing datasets (e.g.,pronouns). By providing only these simple cases,we are able to teach the annotators the concept ofcoreference, while allowing them to freely interpretcases treated differently across the existing datasets.This setup allows us to identify cases where ourannotators disagree among each other, but moreimportantly cases where they unanimously agreewith each other but disagree with the expert, thussuggesting cases that should be revisited by theresearch community when curating future unifiedannotation guidelines

      The aim of the work is to examine a simplified subset of co-reference phenomena which are generally treated the same across different existing datasets.

      This makes spotting inter-annotator disagreement easier - presumably because for simpler cases there are fewer modes of failure?

    7. this work, we developa crowdsourcing-friendly coreference annota-tion methodology, ezCoref, consisting of anannotation tool and an interactive tutorial. Weuse ezCoref to re-annotate 240 passages fromseven existing English coreference datasets(spanning fiction, news, and multiple other do-mains) while teaching annotators only casesthat are treated similarly across these datasets

      this paper describes a new efficient coreference annotation tool which simplifies co-reference annotation. They use their tool to re-annotate passages from widely used coreference datasets.

    1. An independent initiative made by Owen Cornec who has also made many other beautiful data visualizations. Wikiverse vividly captures the fact that Wikipedia is a an awe-inspiring universe to explore.

    1. One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.

      Essentially the raw data needs to be vaguely homogenised and put into a single place

    1. Dr. Miho Ohsaki re-examined workshe and her group had previously published and confirmed that the results are indeed meaningless in the sensedescribed in this work (Ohsaki et al., 2002). She has subsequently been able to redefine the clustering subroutine inher work to allow more meaningful pattern discovery (Ohsaki et al., 2003)

      Look into what Dr. Miho Ohsaki changed about the clustering subroutine in her work and how it allowed for "more meaningful pattern discovery"

    2. Eamonn Keogh is an assistant professor of Computer Science at the University ofCalifornia, Riverside. His research interests are in Data Mining, Machine Learning andInformation Retrieval. Several of his papers have won best paper awards, includingpapers at SIGKDD and SIGMOD. Dr. Keogh is the recipient of a 5-year NSF CareerAward for “Efficient Discovery of Previously Unknown Patterns and Relationships inMassive Time Series Databases”.

      Look into Eamonn Keogh's papers that won "best paper awards"

    1. It took me a while to grok where dbt comes in the stack but now that I (think) I have it, it makes a lot of sense. I can also see why, with my background, I had trouble doing so. Just as Apache Kafka isn’t easily explained as simply another database, another message queue, etc, dbt isn’t just another Informatica, another Oracle Data Integrator. It’s not about ETL or ELT - it’s about T alone. With that understood, things slot into place. This isn’t just my take on it either - dbt themselves call it out on their blog:

      Also - just because their "pricing" page caught me off guard and their website isn't that clear (until you click through to the technical docs) - I thought it's worth calling out that DBT appears to be an open-core platform. They have a SaaS offering and also an open source python command-line tool - it seems that these articles are about the latter

    2. Of course, despite what the "data is the new oil" vendors told you back in the day, you can’t just chuck raw data in and assume that magic will happen on it, but that’s a rant for another day ;-)

      Love this analogy - imagine chucking some crude into a black box and hoping for ethanol at the other end. Then, when you end up with diesel you have no idea what happened.

    3. Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.

      absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"

    1. binary string (i.e., a string in which each character in the string is treated as a byte of binary data)
    1. okay so remind you what is a sheath so a sheep is something that allows me to 00:05:37 translate between physical sources or physical realms of data and physical regions so these are various 00:05:49 open sets or translation between them by taking a look at restrictions overlaps 00:06:02 and then inferring

      Fixed typos in transcript:

      Just generally speaking, what can I do with this sheaf-theoretic data structure that I've got? Okay, [I'll] remind you what is a sheaf. A sheaf is something that allows me to translate between physical sources or physical realms of data [in the left diagram] and the data that are associated with those physical regions [in the right diagram]

      So these [on the left] are various open sets [an example being] simplices in a [simplicial complex which is an example of a] topological space.

      And these [on the right] are the data spaces and I'm able to make some translation between [the left and the right diagrams] by taking a look at restrictions of overlaps [a on the left] and inferring back to the union.

      So that's what a sheaf is [regarding data structures]. It's something that allows me to make an inference, an inferential machine.

    1. I also think being able to self-host and export parts of your data to share with others would be great.

      This might be achievable through Holochain application framework. One promising project built on Holochain is Neighbourhoods. Their "Social-Sensemaker Architecture" across "neighbourhoods" is intriguing

    1. with Prisma you never create application models in your programming language by manually defining classes, interfaces, or structs. Instead, the application models are defined in your Prisma schema
    1. high friction and cost of discovering, understanding, trusting, and ultimately using quality data. If not addressed, this problem only exacerbates with data mesh, as the number of places and teams who provide data - domains - increases.

      Encore un lien avec https://frictionlessdata.io/

    1. building common infrastructure

      Solution à la duplication des efforts et des données.

    2. A data product owner makes decisions around the vision and the roadmap for the data products, concerns herself with satisfaction of her consumers and continuously measures and improves the quality and richness of the data her domain owns and produces. She is responsible for the lifecycle of the domain datasets, when to change, revise and retire data and schemas. She strikes a balance between the competing needs of the domain data consumers.

      Ressemble aux rôles et responsabilités de nos intendants de données.

    1. CEO, Mike Tung was on Data science podcast. Seems to be solving problem that Google search doesn't; how seriously should you take the results that come up? What confidence do you have in their truth or falsity?

  2. Oct 2022
    1. only by examining a constellation of metrics in tension can we understand and influence developer productivity

      I love this framing! In my experience companies don't generally acknowledge that metrics can be in tension, which usually means they're only tracking a subset of the metrics they ought to be if they want to have a more complete/realistic understanding of the state of things.

    1. Software engineers typically stay at one job for an average of two years before moving somewhere different. They spend less than half the amount of time at one company compared to the national average tenure of 4.2 years.
    2. The average performance pay rise for most employees is 3% a year. That is minuscule compared to the 14.8% pay raise the average person gets when they switch jobs.
    1. There are a lot of PostgreSQL servers connected to the Internet: we searched shodan.io and obtained a sample of more than 820,000 PostgreSQL servers connected to the Internet between September 1 and September 29. Only 36% of the servers examined had SSL certificates. More than 523,000 PostgreSQL servers listening on the Internet did not use SSL (64%)
    2. At most 15% of the approximately 820,000 PostgreSQL servers listening on the Internet require encryption. In fact, only 36% even support encryption. This puts PostgreSQL servers well behind the rest of the Internet in terms of security. In comparison, according to Google, over 96% of page loads in Chrome on a Mac are encrypted. The top 100 websites support encryption, and 97 of those default to encryption.
    1. one recognizes in the tactile realitythat so many of the cards are on flimsy copy paper, on the verge of disintegration with eachuse.

      Deutsch used flimsy copy paper, much like Niklas Luhmann, and as a result some are on the verge of disintegration through use over time.

      The wear of the paper here, however, is indicative of active use over time as well as potential care in use, a useful historical fact.

    1. En cas de non-respect de la Loi, la Commission d’accès à l’information pourra imposer des sanctionsimportantes, qui pourraient s’élever jusqu’à 25 M$ ou à 4 % du chiffre d’affaires mondial. Cette sanctionsera proportionnelle, notamment, à la gravité du manquement et à la capacité de payer de l’entreprise.ENTREPRISES
    1. Noting the dates of available materials within archives or sources can be useful on bibliography notes for either planning or revisiting sources. (p16,18)

      Similarly one ought to note missing dates, data, volumes, or resources at locations to prevent unfruitfully looking for data in these locations or as a note to potentially look for the missing material in other locations. (p16)

  3. Sep 2022
    1. First, to clarify - what is "code", what is "data"? In this article, when I say "code", I mean something a human has written, that will be read by a machine (another program or hardware). When I say "data", I mean something a machine has written, that may be read by a machine, a human, or both. Therefore, a configuration file where you set logging.level = DEBUG is code, while virtual machine instructions emitted by a compiler are data. Of course, code is data, but I think this over-simplified view (humans write code, machines write data) will serve us best for now...
    1. The authors propose, based on these experiences, that the cause ofa number of unexpected difficulties in human-computer interaction lies in users’ unwillingness orinability to make structure, content, or procedures explicit

      I'm curious if this is because of unwillingness or difficulty.

  4. Aug 2022
    1. In practice, a system in which different parts of the web have different capabilities cannot insist on bidirectional links. Imagine, for example the publisher of a large and famous book to which many people refer but who has no interest in maintaining his end of their links or indeed in knowing who has refered to the book.

      Why it's pointless to insist that links should have been bidirectional: it's unenforceable.

    1. If the key, or the de-vice on which it is stored is compromised, or if avulnerability can be exploited, then the data assetcan be irrevocably stolen

      Another scenario, if the key or the storage-key device is compromised, or if vulnerability exploitation occurs, then data asset can be stolen.

    2. If akey is lost, this invariably means that the secureddata asset is irrevocably lost

      Counterpart, be careful! If a key is lost, the secured data asset is lost

    Tags

    Annotators

    1. Benjy Renton. (2021, November 16). New data update: Drawing from 23 states reporting data, 5.3% of kids ages 5-11 in these states have received their first dose. Vermont leads these states so far in vaccination rates for this age group—17%. The CDC will begin to report data for this group late this week. Https://t.co/LMJXl6lo6Z [Tweet]. @bhrenton. https://twitter.com/bhrenton/status/1460638150322180098

    1. Yaniv Erlich. (2021, December 8). Updated table of Omicron neuts studies with @Pfizer results (which did the worst job in terms of reporting raw data). Strong discrepancy between studies with live vs pseudo. Https://t.co/InQuWMAm4l [Tweet]. @erlichya. https://twitter.com/erlichya/status/1468580675007795204

    1. John Burn-Murdoch. (2021, November 25). Five quick tweets on the new variant B.1.1.529 Caveat first: Data here is very preliminary, so everything could change. Nonetheless, better safe than sorry. 1) Based on the data we have, this variant is out-competing others far faster than Beta and even Delta did 🚩🚩 https://t.co/R2Ac4e4N6s [Tweet]. @jburnmurdoch. https://twitter.com/jburnmurdoch/status/1463956686075580421

    1. The bibliography should be placed nextafter the ta&e of contents, because the instructor alwayswishes to examine it before reading the text of the essay.

      Surprising! particularly since they traditionally come at the end.

      Though for teaching purposes, I can definitely see a professor wanting it up front. I also frequently skim through bibliographies before starting reading works now, though I didn't do this in the past. Reading a bibliography first is an excellent way to establish common context with an author however.

    1. NETGEAR is committed to providing you with a great product and choices regarding our data processing practices. You can opt out of the use of the data described above by contacting us at analyticspolicy@netgear.com

      You may opt out of these data use situations by emailing analyticspolicy@netgear.com.

    2. Marketing. For example, information about your device type and usage data may allow us to understand other products or services that may be of interest to you.

      All of the information above that has been consented to, can be used by NetGear to make money off consenting individuals and their families.

    3. USB device

      This gives Netgear permission to know what you plug into your computer, be it a FitBit, a printer, scanner, microphone, headphones, webcam — anything not attached to your computer.

    1. I like to think of thoughts as streaming information, so I don’t need to tag and categorize them as we do with batched data. Instead, using time as an index and sticky notes to mark slices of info solves most of my use cases. Graph notebooks like Obsidian think of information as batched data. So you have a set of notes (samples) that you try to aggregate, categorize, and connect. Sure there’s a use case for that: I can’t imagine a company wiki presented as streaming info! But I don’t think it aids me in how I usually think. When thinking with pen and paper, I prefer managing streamed information first, then converting it into batched information later— a blog post, documentation, etc.

      There's an interesting dichotomy between streaming information and batched data here, but it isn't well delineated and doesn't add much to the discussion as a result. Perhaps distilling it down may help? There's a kernel of something useful here, but it isn't immediately apparent.

      Relation to stock and flow or the idea of the garden and the stream?

    1. https://app.idx.us/en-US/services/credit-management

      Seems a bit ironic just how much data a credit monitoring wants to help monitor your data on the dark web. So many companies have had data breaches, I can only wonder how long it may be before a company like IDX has a breach of their own databases?

      The credit reporting agencies should opt everyone into these sorts of protections automatically given the number of breaches in the past.

  5. Jul 2022
    1. AI text generator, a boon for bloggers? A test report

      While I wanted to investigate AI text generators further, I ended up writing a testreport.. I was quite stunned because the AI ​​text generator turns out to be able to create a fully cohesive and to-the-point article in minutes. Here is the test report.

    1. List management TweetDeck allows you to manage your Lists easily in one centralized place for all your accounts. You can create Lists in TweetDeck filtered by by your interests or by particular accounts. Any List that you have set up or followed previously can also be added as separate columns in TweetDeck.   To create a List on TweetDeck: From the navigation bar, click on the plus icon  to select Add column, then click on Lists  .Click the Create List button.Select the Twitter account you would like to create the List for.Name the List and give it a description then select if you would like the List to be publicly visible or not (other people can follow your public Lists).Click Save.Add suggested accounts or search for users to add members to your List, then click Done.   To edit a List on TweetDeck: Click on Lists  from the plus icon  in the navigation bar.Select the List you would like to edit.Click Edit.Add or remove List members or click Edit Details to change the List name, description, or account. You can also click Delete List.When you're finished making changes, click Done.     To designate a List to a column: Click on the plus icon  to select Add column.Click on the Lists option from the menu.Select which List you would like to make into a column.Click Add Column.   To use a particular List in search: Add a search column, then click the filter icon  to open the column filter options.Click the  icon to open the User filter. Select By members of List and type the account name followed by the List name. You can only search across your own Lists, or others’ public Lists.

      While you still can, I'd highly encourage you to use TweetDeck's "Export" List function to save plain text lists of the @ names in your... Lists.

    1. The documents highlight the massive scale of location data that government agencies including CBP and ICE received, and how the agencies sought to take advantage of the mobile advertising industry’s treasure trove of data.
    1. Location tracking is just one part of a panoply of data-collection practices that are now center stage in the abortion debate, along with people’s online search histories and information from period-tracking apps.
    1. Documentazione

      Il problema di questa sezione è derubricare i modelli dati come documentazione. Le ontologie di ontopia (parlo di modelli non tanto di dati come i vocabolari controllati) sono machine-readable. Quindi non è solo una questione di documentare la sintassi o il contenuto del dato. È rendere il modello actionable, ossia leggibile e interpretabile dalle macchine stesse. Io potrei benissimo documentare dei dataset con una bella tabellina in Github o con tante tabelline in un bellissimo PDF (documentazione), ma non è la stessa cosa di rendere disponibile un'ontologia per dei dati. Rendere i modelli parte attiva della gestione del dato (come per le ontologie) significa abilitare l'inferenza che avete richiamato sopra in maniera impropria per me, ma anche utilizzarli per explainable AI e tanti altri usi. Questo è un concetto fondamentale che non può essere trattato così in linee guida nazionali. Dovrebbe anzi avere un capitolo suo dedicato, vista l'importanza anche in ottica data quality "compliance" caratteristica di qualità dello standard ISO/IEC 25012.

    2. Nel caso a), il soggetto ha tutti gli elementi per rappresentare il proprio modello dati; viceversa, nei casi b) e c), la stessa amministrazione, in accordo con AgID, valuta l’opportunità di estendere il modello dati a livello nazionale.

      Tutta la parte di modellazione dati, anche attraverso il catalogo nazionale delle ontologie e vocabolari controllati, sembra ora in mano a ISTAT, titolare, insieme al Dipartimento di Trasformazione Digitale di schema.gov.it. Qui però sembra AGID abbia il ruolo di definire i vari modelli. Secondo me questo crea confusione. bisognerebbe coordinarsi anche con le altre amministrazioni per capire bene chi fa cosa. AGID al momento di OntoPiA gestisce solo un'infrastruttura fisica.

    3. Utilizzando il framework RDF, si può costruire un grafo semantico, noto anche come grafo della conoscenza, che può essere percorso dalle macchine risolvendo, cioè dereferenziando, gli URI HTTP. Ciò significa che è possibile estrarre automaticamente informazione e derivare, quindi, contenuto informativo aggiuntivo (inferenza).

      Non è che fate inferenza perché dereferenziate gli URI. Vi suggerisco di leggere bene le linee guida per l'interoperablità semantica attraverso i linked open data che spiega cosìè l'inferenza (e questa sì fa parte di un processo di arricchimento nel mondo linked open data). L'inferenza è una cosa più complessa che si può fare con ragionatori automatici e query sparql. Si possono dedurre nuove informazioni dati dati esistenti e soprattutto dalle ontologie che sono oggetti machine readable!

  6. Jun 2022
    1. Lastly, said datasheet should outline some ethical considerations of the data.

      I think this question speaks to one of the essential aspects of the data. In my interaction with the datasheet, I mostly focused on the absence of the data, but I think I have missed out on this key puzzle piece to the big picture of why the data is not there. I assumed what was responsible for the non-existence of the information without pondering on possible answers to this one key question. It is indeed crucial to look into the current condition of the item or/and collections including the item. If the artwork is not as much preserved as others, it can mean that more efforts need to be done to save it from lacking more data in the future.

    1. Another important distinction is between data and metadata. Here, the term “data” refers to the part of a file or dataset which contains the actual representation of an object of inquiry, while the term “metadata” refers to data about that data: metadata explicitly describes selected aspects of a dataset, such as the time of its creation, or the way it was collected, or what entity external to the dataset it is supposed to represent.

      This part is notably helpful for the understanding of differences that separate "metadata" from "data". I was writing a blog post for my weekly assignment. Knowing that data is the representation of the object and metadata describes information the data helps build the definition of the terms in my schema of knowledge. In many cases, metadata even provides resources that either give insights to how the data is collected or/and introduces possible perspectives as to how the data can be seen/utilized in the future. Data can survive without metadata, but metadata won't exist without the data. However, the data that lacks metadata may stay uncracked and ciphered, leading to the data potentially becoming useless in the fundamental and economic growth of human beings.

    1. Companies need to actually have an ethics panel, and discuss what the issues are and what the needs of the public really are. Any ethics board must include a diverse mix of people and experiences. Where possible, companies should look to publish the results of these ethics boards to help encourage public debate and to shape future policy on data use.

    1. Most of us are familiar with data visualization: charts, graphs, maps and animations that represent complex series of numbers. But visualization is not the only way to explain and present data. Some scientists are trying to sonify storms with global weather data. That could be easier to get a sense of interrelated storm dynamics by hearing them.

    1. È importante notare che nella pratica si ritiene a volte necessario passare da modelli di rappresentazione tradizionali come quello relazionale per la modellazione dei dati operando opportune trasformazioni per poi renderli disponibili secondo i principi dei Linked Open Data. Tuttavia, tale pratica non è necessariamente quella più appropriata: esistono situazioni per cui può essere più conveniente partire da un’ontologia del dominio e che si intende modellare e dall’uso di standard del web semantico per poter governare i processi di gestione dei dati.

      Non trovo utilità in quanto qui scritto onestamente. Molti più sistemi sono ormai linked open data nativi, quindi oltre al fatto che parlare di linked open data in arricchimento è sbagliato, direi di lasciar perdere questo periodo.

    2. utilizzano diversi standard e tecniche, tra cui il framework RDF

      rifraserei in "si basano su diversi standard, tra cui RDF, e spesso usano vocabolari controllati RDF per rappresentare terminologia controllata del dominio applicativo di riferimento"

    3. a formati di dati a quattro stelle come le serializzazioni RDF o il JSON-LD

      JSON-LD è una serializzazione RDF nel mondo JSON. Occhio che qui la traduzione in italiano del documento del publications office non è venuta fuori bene (loro dicono data format such as RDF or JSON-LD che sarebbe anche impreciso. RDF è un modello di rappresentazione del dato nel Web. Le serializzazioni RDF sono tipo Ntriple, RDF/Turtle, RDF/XML, JSON-LD). Tra l'altro nell'allegato tecnico sui formati per i dati aperti, testo preso dalla precedente linee guida, JSON-LD è indicato come serializzazione RDF.

    4. linked data

      Sono open o no?

    5. il linking è una funzionalità molto importante e di fatto può essere considerata una forma particolare di arricchimento. La particolarità consiste nel fatto che l’arricchimento avviene grazie all’interlinking fra dataset di origine diversa, tipicamente fra amministrazioni o istituzioni diverse, ma anche, al limite, all’interno di una stessa amministrazione”

      Qui c'è un problema di fondo proprio concettuale. Il problema è che il paradigma dei Linked Open Data è stato derubricato come arricchimento, che nelle linee guida che si cita qui era solo una fase di un processo generale per la gestione dei dati linked open data. Fare linked open data non vuol solo dire arricchire i dati, ma è possibile gestire un dato fin dalla sua nascita in linked open data nativamente. Questo era lo spirito delle linee guida qui citate. Estrapolando solo una parte avete snaturato un po' tutto. Consiglio di trattare l'argomento com'era trattato nelle precedenti linee guida. Peccato anche che sia sparita la figura della metropolitana che aiutava molto.

    6. Come detto, il collegamento (linking) dei dati può aumentarne il valore creando nuove relazioni e consentendo così nuovi tipi di analisi.

      Comunque, farei uno sforzo in più, con tutto quello che l'italia ha scritto sui linked open data, per scrivere frasi che non siano proprio paro paro la traduzione in italiano del documento in inglese.

    1. The reason these apps are great for such a broad range of use cases is they give users really strong data structures to work within.

      Inside the very specific realm of personal knowledge bases, TiddlyWiki is the killer app when it comes to using blocks and having structured, translatable data behind them.

    1. 80% of data analysis is spent on the process of cleaning and preparing the data

      Imagine having unnecessary and wrong data in your document, you would most likely have to experience the concept of time demarcation -- the reluctance in going through every single row and column to eliminate these "garbage data". Clearly, owning all kinds of data without organizing them feels like stuffing your closet with clothes that you should have donated 5 years ago. It is a time-consuming and soul-destroying process for us. Luckily, in R, we have something in R called "tidyverse" package, which I believe the author talks about in the next paragraph, to make life easier for everyone. I personally use dplyr and ggplot2 when I deal with data cleaning, and they are extremely helpful. WIthout these packages' existence, I have no idea when I will be able to reach the final step of data visualization.

    1. On a new clone of the Canva monorepo, git status takes 10 seconds on average while git fetch can take anywhere from 15 seconds to minutes due to the number of changes merged by engineers.
    2. Over the last 10 years, the code base has grown from a few thousand lines to just under 60 million lines of code in 2022. Every week, hundreds of engineers work across half a million files generating close to a million lines of change (including generated files), tens of thousands of commits, and merging thousands of pull requests.
    1. The goal is to gain “digital sovereignty.”

      the age of borderless data is ending. What we're seeing is a move to digital sovereignty

    1. nothing is permanent in the digital world

      Either ironic or maybe not the best advice when suggesting people might choose something like Notion or Evernote which could disappear with your data...

    Tags

    Annotators

    1. 23.0G com.txt # 23 gigs uncompressed

      23 GB txt file <--- list of all the existing .com domains

    Tags

    Annotators

    URL

    1. https://www.youtube.com/watch?v=bWkwOefBPZY

      Some of the basic outline of this looks like OER (Open Educational Resources) and its "five Rs": Retain, Reuse, Revise, Remix and/or Redistribute content. (To which I've already suggested the sixth: Request update (or revision control).

      Some of this is similar to:

      The Read Write Web is no longer sufficient. I want the Read Fork Write Merge Web. #osb11 lunch table. #diso #indieweb [Tantek Çelik](http://tantek.com/2011/174/t1/read-fork-write-merge-web-osb110

      Idea of collections of learning as collections or "playlists" or "readlists". Similar to the old tool Readlist which bundled articles into books relatively easily. See also: https://boffosocko.com/2022/03/26/indieweb-readlists-tools-and-brainstorming/

      Use of Wiki version histories

      Some of this has the form of a Wiki but with smaller nuggets of information (sort of like Tiddlywiki perhaps, which also allows for creating custom orderings of things which had specific URLs for displaying and sharing them.) The Zettelkasten idea has some of this embedded into it. Shared zettelkasten could be an interesting thing.

      Data is the new soil. A way to reframe "data is the new oil" but as a part of the commons. This fits well into the gardens and streams metaphor.

      Jerry, have you seen Matt Ridley's work on Ideas Have Sex? https://www.ted.com/talks/matt_ridley_when_ideas_have_sex Of course you have: https://app.thebrain.com/brains/3d80058c-14d8-5361-0b61-a061f89baf87/thoughts/3e2c5c75-fc49-0688-f455-6de58e4487f1/attachments/8aab91d4-5fc8-93fe-7850-d6fa828c10a9

      I've heard Jerry mention the idea of "crystallization of knowledge" before. How can we concretely link this version with Cesar Hidalgo's work, esp. Why Information Grows.

      Cross reference Jerry's Brain: https://app.thebrain.com/brains/3d80058c-14d8-5361-0b61-a061f89baf87/thoughts/4bfe6526-9884-4b6d-9548-23659da7811e/notes

    1. Expected to come into force on June 27, India's new data retention law will force VPN companies to keep users' data - like IP addresses, real names and usage patterns - for up to five years. They will also be required to hand this information over to authorities upon request. 

      Some draconian Indian data-retention laws are coming.

  7. May 2022
    1. Recognizing that the CEC hyperthreat operates at micro and macro scales across most forms of human activity and that a whole-of-society approach is required to combat it, the approach to the CEC hyperthreat partly relies on a philosophical pivot. The idea here is that a powerful understanding of the CEC hyperthreat (how it feels, moves, and operates), as well as the larger philosophical and survival-based reasons for hyper-reconfiguration, enables all actors and groups to design their own bespoke solutions. Consequently, the narrative and threat description act as a type of orchestration tool across many agencies. This is like the “shared consciousness” idea in retired U.S. Army general Stanley A. McChrystal’s “team of teams” approach to complexity.7       Such an approach is heavily dependent on exceptional communication of both the CEC hyperthreat and hyper-response pathways, as well as providing an enabling environment in terms of capacity to make decisions, access information and resources. This idea informs Operation Visibility and Knowability (OP VAK), which will be described later.  

      Such an effort will require a supporting worldwide digital ecosystem. In the recent past, major evolutionary transitions (MET) (Robin et al, 2021) of our species have been triggered by radical new information systems such as spoken language, and then inscribed language. Something akin to a Major Competitive Transitions (MCT) may be required to accompany a radical transition to a good anthropocene. (See annotation: https://hyp.is/go?url=https%3A%2F%2Fwww.frontiersin.org%2Farticles%2F10.3389%2Ffevo.2021.711556%2Ffull&group=world)

      If large data is ingested into a public Indyweb, because Indyweb is naturally a graph database, a salience landscape can be constructed of the hyperthreat and data visualized in its multiple dimensions and scales.

      Metaphorically, it can manifest as a hydra with multiple tentacles reach out to multiple scales and dimensions. VR and AR technology can be used to expose the hyperobject and its progression.

      The proper hyperthreat is not climate change alone, although that is the most time sensitive dimension of it, but rather the totality of all blowbacks of human progress...the aggregate of all progress traps that have been allowed to grow, through a myopic prioritization of profit over global wellbeing due to the invisibility of the hyperobject, from molehills into mountains.

    1. I explore how moves towards ‘objective’ data as the basis for decision-making orientated teachers’ judgements towards data in ways that worked to standardise judgement and exclude more multifaceted, situated and values-driven modes of professional knowledge that were characterised as ‘human’ and therefore inevitably biased.

      But, aren't these multifaceted, situated, and values-driven modes also constituted of data? Isn't everything represented by data? Even 'subjective' understanding of the world is articulated as data.

      Is there some 'standard' definition of data that I'm not aware of in the context of this domain?

    2. Recommended by Ben Williamson. Purpose: It may have some relevance for the project with Ben around chat bots and interviews, as well as implications for the introduction of portfolios for assessment.

    1. Each developer on average wastes 30 minutes before and after the meeting to context switch and the time is otherwise non-value adding. (See this study for the cost of context switching).
    1. For example, the idea of “data ownership” is often championed as a solution. But what is the point of owning data that should not exist in the first place? All that does is further institutionalise and legitimate data capture. It’s like negotiating how many hours a day a seven-year-old should be allowed to work, rather than contesting the fundamental legitimacy of child labour. Data ownership also fails to reckon with the realities of behavioural surplus. Surveillance capitalists extract predictive value from the exclamation points in your post, not merely the content of what you write, or from how you walk and not merely where you walk. Users might get “ownership” of the data that they give to surveillance capitalists in the first place, but they will not get ownership of the surplus or the predictions gleaned from it – not without new legal concepts built on an understanding of these operations.
    1. And it’s easy to leave. Unlike on Facebook or Twitter, Substack writers can simply take their email lists and direct connections to their readers with them.

      Owning your audience is key here.

    1. We believe that Facebook is also actively encouraging people to use tools like Buffer Publish for their business or organization, rather than personal use. They are continuing to support the use of Facebook Pages, rather than personal Profiles, for things like scheduling and analytics.

      Of course they're encouraging people to do this. Pushing them to the business side is where they're making all the money.

    1. Manton says owning your domain so you can move your content without breaking URLs is owning your content, whereas I believe if your content still lives on someone else's server, and requires them to run the server and run their code so you can access your content, it's not really yours at all, as they could remove your access at any time.

      This is a slippery slope problem, but people are certainly capable of taking positions along a broad spectrum here.

      The one thing I might worry about--particularly given micro.blog's--size is the relative bus factor of one represented by Manton himself. If something were to happen to him, what recourse has he built into make sure that people could export their data easily and leave the service if the worst were to come to happen? Is that documented somewhere?

      Aside from this the service has one of the most reasonable turn-key solutions for domain and data ownership I've seen out there without running all of your own infrastructure.

    2. First, Manton's business model is for users to not own their content. You might be able to own your domain name, but if you have a hosted Micro.blog blog, the content itself is hosted on Micro.blog servers, not yours. You can export your data, or use an RSS feed to auto-post it to somewhere you control directly, but if you're not hosting the content yourself, how does having a custom domain equal self-hosting your content and truly owning it? Compared to hosting your own blog and auto-posting it to Micro.blog, which won't cost you and won't make Micro.blog any revenue, posting for a hosted blog seems to decrease your ownership.

      I'm not sure that this is the problem that micro.blog is trying to solve. It's trying to solve the problem of how to be online as simply and easily as possible without maintaining the overhead of hosting and managing your own website.

      As long as one can easily export their data at will and redirect their domain to another host, one should be fine. In some sense micro.blog makes it easier than changing phone carriers, which in most cases will abandon one's text messages without jumping through lots of hoops. .

      One step that micro.blog could set up is providing a download dump of all content every six months to a year so that people have it backed up in an accessible fashion. Presently, to my knowledge, one could request this at any time and move when they wished.

    1. The ad lists various data that WhatsApp doesn’t collect or share. Allaying data collection concerns by listing data not collected is misleading. WhatsApp doesn’t collect hair samples or retinal scans either; not collecting that information doesn’t mean it respects privacy because it doesn’t change the information WhatsApp does collect.

      An important logical point. Listing what they don't keep isn't as good as saying what they actually do with one's data.

    1. The main thing Smith has learned over the past seven years is “the importance of ownership.” He admitted that Tumblr initially helped him “build a community around the idea of digital news.” However, it soon became clear that Tumblr was the only one reaping the rewards of its growing community. As he aptly put it, “Tumblr wasn’t seriously thinking about the importance of revenue or business opportunities for their creators.”
    1. Third, the post-LMS world should protect the pedagogical prerogatives and intellectual property rights of faculty members at all levels of employment. This means, for example, that contingent faculty should be free to take the online courses they develop wherever they happen to be teaching. Similarly, professors who choose to tape their own lectures should retain exclusive rights to those tapes. After all, it’s not as if you have to turn over your lecture notes to your old university whenever you change jobs.

      Own your pedagogy. Send just like anything else out there...

    1. And yes, some add-ons exist, but I just wish the feature was native to the browser. And I do not want to rely on a third party service. My quotes are mine only and should not necessary be shared with a server on someone's else machine.

      Ownership of the data is important. One could certainly set up their own Hypothes.is server if they liked.

      I personally take the data from my own Hypothes.is account and dump it into my local Obsidian.md vault for saving, crosslinking, and further thought.

    1. With Alphabet Inc.’s Google, and Facebook Inc. and its WhatsApp messaging service used by hundreds of millions of Indians, India is examining methods China has used to protect domestic startups and take control of citizens’ data.

      Governments owning citizens' data directly?? Why not have the government empower citizens to own their own data?

    1. The highlights you made in FreeTime are preserved in My Clippings.txt, but you can’t see them on the Kindle unless you are in FreeTime mode. Progress between FreeTime and regular mode are tracked separately, too. I now pretty much only use my Kindle in FreeTime mode so that my reading statistics are tracked. If you are a data nerd and want to crunch the data on your own, it is stored in a SQLite file on your device under system > freetime > freetime.db.

      FreeTime mode on the Amazon Kindle will provide you with reading statistics. You can find the raw data as an SQLite file under system > freetime > freetime.db.

    1. I tried very hard in that book, when it came to social media, to be platform agnostic, to emphasize that social media sites come and go, and to always invest first and foremost in your own media. (Website, blog, mailing list, etc.)
    1. Facebook provides some data portability, but makes an odd plea for regulation to make more functionality possible.

      Why do this when they could choose to do the right thing? They don't need to be forced and could certainly try to enforce security. It wouldn't be any worse than unveiling the tons of personal data they've managed not to protect in the past.

    1. Goodreads lost my entire account last week. Nine years as a user, some 600 books and 250 carefully written reviews all deleted and unrecoverable. Their support has not been helpful. In 35 years of being online I've never encountered a company with such callous disregard for their users' data.

      A clarion call for owning your own data.

    1. I like how Dr. Pacheco-Vega outlines some of his research process here.

      Sharing it on Twitter is great, and so is storing a copy on his website. I do worry that it looks like the tweets are embedded via a simple URL method and not done individually, which means that if Twitter goes down or disappears, so does all of his work. Better would be to do a full blockquote embed method, so that if Twitter disappears he's got the text at least. Images would also need to be saved separately.

    1. Common Pitfalls to Avoid When Choosing Your App

      What are the common pitfalls when choosing a note taking application or platform?

      Own your data

      Prefer note taking systems that don't rely on a company's long term existence. While Evernote or OneNote have been around for a while, there's nothing to say they'll be around forever or even your entire lifetime. That shiny new startup note taking company may not gain traction in the market and exist in two years. If your notes are trapped inside a company's infrastructure and aren't exportable to another location, you're simply dead in the water. Make sure you have a method to be able to export and own the raw data of your notes.

      Test driving many

      and not choosing or sticking with one (or even a few)<br /> Don't get stunned into inaction by the number of choices.

      Shiny object syndrome

      is the situation where people focus all attention on something that is new, current or trendy, yet drop this as soon as something new takes its place.<br /> There will always be new and perhaps interesting note taking applications. Some may look fun and you'll be tempted to try them out and fragment your notes. Don't waste your time unless the benefits are manifestly clear and the pathway to exporting your notes is simple and easy. Otherwise you'll spend all your time importing/exporting and managing your notes and not taking and using them. Paper and pencil has been around for centuries and they work, so at a minimum do this. True innovation in this space is exceedingly rare, and even small affordances like the ability to have [[wikilinks]] and/or bi-directional links may save a few seconds here and there, in the long run these can still be done manually and having a system far exceeds the value of having the best system.

      (Relate this to the same effect in the blogosphere of people switching CMSes and software and never actually writing content on their website. The purpose of the tool is using it and not collecting all the tools as a distraction for not using them. Remember which problem you're attempting to solve.)

      Future needs and whataboutisms

      Surely there will be future innovations in the note taking space or you may find some niche need that your current system doesn't solve. Given the maturity of the space even in a pen and paper world, this will be rare. Don't worry inordinately about the future, imitate what has worked for large numbers of people in the past and move forward from there.

      Others? Probably...

    1. Even with data that’s less fraught than our genome, our decisions about what we expose to the world have externalities for the people around us.

      We need to think more about the externalities of our data decisions.

    1. It's the feedback that's motivating A-list bloggers like Digg founder Kevin Rose to shut down their blogs and redirect traffic to their Google+ profiles. I have found the same to be true.

      This didn't work out too well for them did it?

    1. The European Commission has prepared to legislate to require interoperability, and it calls being able to use your data wherever and whenever you like “multi-homing”. (Not many other people like this term, but it describes something important – the ability for people to move easily between platforms

      an interesting neologism to describe something that many want

    1. the decentralised and open source nature of these systems, where anyone can host an instance, may protect their communities from the kinds of losses experienced by users of the many commercial platforms that have gone out of business over the last decades (e.g. Geocities, Wikispaces or Google + to name just a few).

      https://indieweb.org/site-deaths names a large number of others

    1. Subsidiarity, which uses “data cooperatives, collaboratives, and trusts with privacy-preserving and -enhancing techniques for data processing, such as federated learning and secure multiparty computation.”

      Another value of the data cooperative model might be that each individual might not have time to research and administer possible new data-sharing requests/opportunities, and it would be helpful to entrust that work to a cooperative entity that already has one's trust.

    1. A 20-year age difference (for example, from 20 to 40, or from 30 to 50 years old) will, on average, correspond to reading 30 WPM slower, meaning that a 50-year old user will need about 11% more time than a 30-year old user to read the same text.
    2. Users’ age had a strong impact on their reading speed, which dropped by 1.5 WPM for each year of age.
    1. Overall, having spent a significant amount of time building this project, scaling it up to the size it’s at now, as well as analysing the data, the main conclusion is that it is not worth building your own solution, and investing this much time. When I first started building this project 3 years ago, I expected to learn way more surprising and interesting facts. There were some, and it’s super interesting to look through those graphs, however retrospectively, it did not justify the hundreds of hours I invested in this project.I’ll likely continue tracking my mood, as well as a few other key metrics, however will significantly reduce the amount of time I invest in it.

      Words of the author of https://krausefx.com//blog/how-i-put-my-whole-life-into-a-single-database

      It seems as if excessive personal data tracking is not worth it

  8. Apr 2022
    1. ReconfigBehSci [@SciBeh]. (2021, October 1). @alexdefig against this survey data you might set actual uptake figures in France, various Canadian provinces, and Germany after the introduction of passports [Tweet]. Twitter. https://twitter.com/SciBeh/status/1443955929985159174

    1. ReconfigBehSci [@SciBeh]. (2021, October 1). @alexdefig and I didn’t say we should mandate them. I simply pointed out that when considering the impact of passports on uptake we should probably look at actual uptake in response to actual mandates in addition to survey data, which may or may not translate into action, no? [Tweet]. Twitter. https://twitter.com/SciBeh/status/1443958577173917699

    1. ReconfigBehSci [@SciBeh]. (2021, October 1). @alexdefig so, observational data has weaknesses- so does survey data, but it’s there and we should look at it. On your second point, yes, that is important, we should study that, if we have no data we can’t factor it into decision. Third is separate issue/factor to weigh. [Tweet]. Twitter. https://twitter.com/SciBeh/status/1443960096497627141

    1. The combined stuff is available to components using the page store as $page.stuff, providing a mechanism for pages to pass data 'upward' to layouts.

      bidirectional data flow ?! That's a game changer.

      analogue in Rails: content_for

      https://github.com/sveltejs/kit/pull/3252/files

    1. ReconfigBehSci. (2022, January 24). @STWorg @FraserNelson @GrahamMedley no worse- he took Medley’s comment that Sage model the scenarios the government asks them to consider to mean that they basically set out to find the justification for what the government already wanted to do. Complete failure to distinguish between inputs and outputs of a model [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1485625862645075970

    1. Jackie Parchem, MD [@jackie_parchem]. (2021, July 29). @MeadowGood @ACOGPregnancy Some of the docs who stepped up and got vaccinated early when we didn’t have the data we do now. What we all knew: Protecting moms protects babies! All have had their babies by now! @IlanaKrumm @anushkachelliah @gumbo_amando @emergjenncy @JuliaNEM33 https://t.co/h9UJo6h3fQ [Tweet]. Twitter. https://twitter.com/jackie_parchem/status/1420785474499645442

    1. For this reason, the Secretary of State set out a vision1 for health and care to have nationalopen standards for data and interoperability that are mandated throughout the NHS andsocial care.
    1. Nick Sawyer, MD, MBA, FACEP [@NickSawyerMD]. (2022, January 3). The anti-vaccine community created a manipulated version of VARES that misrepresents the VAERS data. #disinformationdoctors use this data to falsely claim that vaccines CAUSE bad outcomes, when the relationship is only CORRELATED. Watch this explainer: Https://youtu.be/VMUQSMFGBDo https://t.co/ruRY6E6blB [Tweet]. Twitter. https://twitter.com/NickSawyerMD/status/1477806470192197633

    1. Carl T. Bergstrom. (2021, August 18). 1. There has been lots of talk about recent data from Israel that seem to suggest a decline in vaccine efficacy against severe disease due to Delta, waning protection, or both. This may have even been a motivation for Biden’s announcement that the US would be adopting boosters. [Tweet]. @CT_Bergstrom. https://twitter.com/CT_Bergstrom/status/1427767356600688646

    1. ReconfigBehSci. (2021, February 1). @islaut1 @richarddmorey I think diff. Is that your first response seemed to indicate the evidence was the search itself (contra Richard) so turning an inference from absence of something into a kind of positive evidence ('the search’). Let’s call absence of evidence “not E”. 1/2 [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1356215051238191104

    1. The Lancet. (2021, April 16). Quantity > quality? The magnitude of #COVID19 research of questionable methodological quality reveals an urgent need to optimise clinical trial research—But how? A new @LancetGH Series discusses challenges and solutions. Read https://t.co/z4SluR3yuh 1/5 https://t.co/94RRVT0qhF [Tweet]. @TheLancet. https://twitter.com/TheLancet/status/1383027527233515520

    1. Dr Nisreen Alwan 🌻. (2020, March 14). Our letter in the Times. ‘We request that the government urgently and openly share the scientific evidence, data and modelling it is using to inform its decision on the #Covid_19 public health interventions’ @richardhorton1 @miriamorcutt @devisridhar @drannewilson @PWGTennant https://t.co/YZamKCheXH [Tweet]. @Dr2NisreenAlwan. https://twitter.com/Dr2NisreenAlwan/status/1238726765469749248