3,183 Matching Annotations
  1. Last 7 days
    1. https://whatever.scalzi.com/2022/11/25/how-to-weave-the-artisan-web/

      “But Scalzi,” I hear you say, “How do we bring back that artisan, hand-crafted Web?” Well, it’s simple, really, and if you’re a writer/artist/musician/other sort of creator, it’s actually kind of essential:

    1. Our annotators achieve thehighest precision with OntoNotes, suggesting thatmost of the entities identified by crowdworkers arecorrect for this dataset.

      interesting that the mention detection algorithm gives poor precision on OntoNotes and the annotators get high precision. Does this imply that there are a lot of invalid mentions in this data and the guidelines for ontonotes are correct to ignore generic pronouns without pronominals?

    2. an algorithm with high precision on LitBank orOntoNotes would miss a huge percentage of rele-vant mentions and entities on other datasets (con-straining our analysis)

      these datasets have the most limited/constrained definitions for co-reference and what should be marked up so it makes sense that precision is poor in these datasets

    3. Procedure: We first launch an annotation tutorial(paid $4.50) and recruit the annotators on the AMTplatform.9 At the end of the tutorial, each annotatoris asked to annotate a short passage (around 150words). Only annotators with a B3 score (Bagga

      Annotators are asked to complete a quality control exercise and only annotators who achieve a B3 score of 0.9 or higher are invited to do more annotation

    4. Annotation structure: Two annotation ap-proaches are prominent in the literature: (1) a localpairwise approach, annotators are shown a pairof mentions and asked whether they refer to thesame entity (Hladká et al., 2009; Chamberlain et al.,2016a; Li et al., 2020; Ravenscroft et al., 2021),which is time-consuming; or (2) a cluster-basedapproach (Reiter, 2018; Oberle, 2018; Bornsteinet al., 2020), in which annotators group all men-tions of the same entity into a single cluster. InezCoref we use the latter approach, which can befaster but requires the UI to support more complexactions for creating and editing cluster structures.

      ezCoref presents clusters of coreferences all at the same time - this is a nice efficient way to do annotation versus pairwise annotation (like we did for CD^2CR)

    5. owever, these datasets vary widelyin their definitions of coreference (expressed viaannotation guidelines), resulting in inconsistent an-notations both within and across domains and lan-guages. For instance, as shown in Figure 1, whileARRAU (Uryupina et al., 2019) treats generic pro-nouns as non-referring, OntoNotes chooses not tomark them at all

      One of the big issues is that different co-reference datasets have significant differences in annotation guidelines even within the coreference family of tasks - I found this quite shocking as one might expect coreference to be fairly well defined as a task.

    6. Specifically, our work investigates the quality ofcrowdsourced coreference annotations when anno-tators are taught only simple coreference cases thatare treated uniformly across existing datasets (e.g.,pronouns). By providing only these simple cases,we are able to teach the annotators the concept ofcoreference, while allowing them to freely interpretcases treated differently across the existing datasets.This setup allows us to identify cases where ourannotators disagree among each other, but moreimportantly cases where they unanimously agreewith each other but disagree with the expert, thussuggesting cases that should be revisited by theresearch community when curating future unifiedannotation guidelines

      The aim of the work is to examine a simplified subset of co-reference phenomena which are generally treated the same across different existing datasets.

      This makes spotting inter-annotator disagreement easier - presumably because for simpler cases there are fewer modes of failure?

    7. this work, we developa crowdsourcing-friendly coreference annota-tion methodology, ezCoref, consisting of anannotation tool and an interactive tutorial. Weuse ezCoref to re-annotate 240 passages fromseven existing English coreference datasets(spanning fiction, news, and multiple other do-mains) while teaching annotators only casesthat are treated similarly across these datasets

      this paper describes a new efficient coreference annotation tool which simplifies co-reference annotation. They use their tool to re-annotate passages from widely used coreference datasets.

    1. An independent initiative made by Owen Cornec who has also made many other beautiful data visualizations. Wikiverse vividly captures the fact that Wikipedia is a an awe-inspiring universe to explore.

    1. One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.

      Essentially the raw data needs to be vaguely homogenised and put into a single place

    1. Dr. Miho Ohsaki re-examined workshe and her group had previously published and confirmed that the results are indeed meaningless in the sensedescribed in this work (Ohsaki et al., 2002). She has subsequently been able to redefine the clustering subroutine inher work to allow more meaningful pattern discovery (Ohsaki et al., 2003)

      Look into what Dr. Miho Ohsaki changed about the clustering subroutine in her work and how it allowed for "more meaningful pattern discovery"

    2. Eamonn Keogh is an assistant professor of Computer Science at the University ofCalifornia, Riverside. His research interests are in Data Mining, Machine Learning andInformation Retrieval. Several of his papers have won best paper awards, includingpapers at SIGKDD and SIGMOD. Dr. Keogh is the recipient of a 5-year NSF CareerAward for “Efficient Discovery of Previously Unknown Patterns and Relationships inMassive Time Series Databases”.

      Look into Eamonn Keogh's papers that won "best paper awards"

    1. It took me a while to grok where dbt comes in the stack but now that I (think) I have it, it makes a lot of sense. I can also see why, with my background, I had trouble doing so. Just as Apache Kafka isn’t easily explained as simply another database, another message queue, etc, dbt isn’t just another Informatica, another Oracle Data Integrator. It’s not about ETL or ELT - it’s about T alone. With that understood, things slot into place. This isn’t just my take on it either - dbt themselves call it out on their blog:

      Also - just because their "pricing" page caught me off guard and their website isn't that clear (until you click through to the technical docs) - I thought it's worth calling out that DBT appears to be an open-core platform. They have a SaaS offering and also an open source python command-line tool - it seems that these articles are about the latter

    2. Of course, despite what the "data is the new oil" vendors told you back in the day, you can’t just chuck raw data in and assume that magic will happen on it, but that’s a rant for another day ;-)

      Love this analogy - imagine chucking some crude into a black box and hoping for ethanol at the other end. Then, when you end up with diesel you have no idea what happened.

    3. Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.

      absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"

  2. Nov 2022
    1. binary string (i.e., a string in which each character in the string is treated as a byte of binary data)
    1. okay so remind you what is a sheath so a sheep is something that allows me to 00:05:37 translate between physical sources or physical realms of data and physical regions so these are various 00:05:49 open sets or translation between them by taking a look at restrictions overlaps 00:06:02 and then inferring

      Fixed typos in transcript:

      Just generally speaking, what can I do with this sheaf-theoretic data structure that I've got? Okay, [I'll] remind you what is a sheaf. A sheaf is something that allows me to translate between physical sources or physical realms of data [in the left diagram] and the data that are associated with those physical regions [in the right diagram]

      So these [on the left] are various open sets [an example being] simplices in a [simplicial complex which is an example of a] topological space.

      And these [on the right] are the data spaces and I'm able to make some translation between [the left and the right diagrams] by taking a look at restrictions of overlaps [a on the left] and inferring back to the union.

      So that's what a sheaf is [regarding data structures]. It's something that allows me to make an inference, an inferential machine.

    1. I also think being able to self-host and export parts of your data to share with others would be great.

      This might be achievable through Holochain application framework. One promising project built on Holochain is Neighbourhoods. Their "Social-Sensemaker Architecture" across "neighbourhoods" is intriguing

    1. with Prisma you never create application models in your programming language by manually defining classes, interfaces, or structs. Instead, the application models are defined in your Prisma schema
    1. high friction and cost of discovering, understanding, trusting, and ultimately using quality data. If not addressed, this problem only exacerbates with data mesh, as the number of places and teams who provide data - domains - increases.

      Encore un lien avec https://frictionlessdata.io/

    1. building common infrastructure

      Solution à la duplication des efforts et des données.

    2. A data product owner makes decisions around the vision and the roadmap for the data products, concerns herself with satisfaction of her consumers and continuously measures and improves the quality and richness of the data her domain owns and produces. She is responsible for the lifecycle of the domain datasets, when to change, revise and retire data and schemas. She strikes a balance between the competing needs of the domain data consumers.

      Ressemble aux rôles et responsabilités de nos intendants de données.

    1. CEO, Mike Tung was on Data science podcast. Seems to be solving problem that Google search doesn't; how seriously should you take the results that come up? What confidence do you have in their truth or falsity?

  3. Oct 2022
    1. only by examining a constellation of metrics in tension can we understand and influence developer productivity

      I love this framing! In my experience companies don't generally acknowledge that metrics can be in tension, which usually means they're only tracking a subset of the metrics they ought to be if they want to have a more complete/realistic understanding of the state of things.

    1. Software engineers typically stay at one job for an average of two years before moving somewhere different. They spend less than half the amount of time at one company compared to the national average tenure of 4.2 years.
    2. The average performance pay rise for most employees is 3% a year. That is minuscule compared to the 14.8% pay raise the average person gets when they switch jobs.
    1. There are a lot of PostgreSQL servers connected to the Internet: we searched shodan.io and obtained a sample of more than 820,000 PostgreSQL servers connected to the Internet between September 1 and September 29. Only 36% of the servers examined had SSL certificates. More than 523,000 PostgreSQL servers listening on the Internet did not use SSL (64%)
    2. At most 15% of the approximately 820,000 PostgreSQL servers listening on the Internet require encryption. In fact, only 36% even support encryption. This puts PostgreSQL servers well behind the rest of the Internet in terms of security. In comparison, according to Google, over 96% of page loads in Chrome on a Mac are encrypted. The top 100 websites support encryption, and 97 of those default to encryption.
    1. one recognizes in the tactile realitythat so many of the cards are on flimsy copy paper, on the verge of disintegration with eachuse.

      Deutsch used flimsy copy paper, much like Niklas Luhmann, and as a result some are on the verge of disintegration through use over time.

      The wear of the paper here, however, is indicative of active use over time as well as potential care in use, a useful historical fact.

    1. En cas de non-respect de la Loi, la Commission d’accès à l’information pourra imposer des sanctionsimportantes, qui pourraient s’élever jusqu’à 25 M$ ou à 4 % du chiffre d’affaires mondial. Cette sanctionsera proportionnelle, notamment, à la gravité du manquement et à la capacité de payer de l’entreprise.ENTREPRISES
    1. Noting the dates of available materials within archives or sources can be useful on bibliography notes for either planning or revisiting sources. (p16,18)

      Similarly one ought to note missing dates, data, volumes, or resources at locations to prevent unfruitfully looking for data in these locations or as a note to potentially look for the missing material in other locations. (p16)

  4. Sep 2022
    1. First, to clarify - what is "code", what is "data"? In this article, when I say "code", I mean something a human has written, that will be read by a machine (another program or hardware). When I say "data", I mean something a machine has written, that may be read by a machine, a human, or both. Therefore, a configuration file where you set logging.level = DEBUG is code, while virtual machine instructions emitted by a compiler are data. Of course, code is data, but I think this over-simplified view (humans write code, machines write data) will serve us best for now...
    1. The authors propose, based on these experiences, that the cause ofa number of unexpected difficulties in human-computer interaction lies in users’ unwillingness orinability to make structure, content, or procedures explicit

      I'm curious if this is because of unwillingness or difficulty.

  5. Aug 2022
    1. In practice, a system in which different parts of the web have different capabilities cannot insist on bidirectional links. Imagine, for example the publisher of a large and famous book to which many people refer but who has no interest in maintaining his end of their links or indeed in knowing who has refered to the book.

      Why it's pointless to insist that links should have been bidirectional: it's unenforceable.

    1. If the key, or the de-vice on which it is stored is compromised, or if avulnerability can be exploited, then the data assetcan be irrevocably stolen

      Another scenario, if the key or the storage-key device is compromised, or if vulnerability exploitation occurs, then data asset can be stolen.

    2. If akey is lost, this invariably means that the secureddata asset is irrevocably lost

      Counterpart, be careful! If a key is lost, the secured data asset is lost

    Tags

    Annotators

    1. Benjy Renton. (2021, November 16). New data update: Drawing from 23 states reporting data, 5.3% of kids ages 5-11 in these states have received their first dose. Vermont leads these states so far in vaccination rates for this age group—17%. The CDC will begin to report data for this group late this week. Https://t.co/LMJXl6lo6Z [Tweet]. @bhrenton. https://twitter.com/bhrenton/status/1460638150322180098

    1. Yaniv Erlich. (2021, December 8). Updated table of Omicron neuts studies with @Pfizer results (which did the worst job in terms of reporting raw data). Strong discrepancy between studies with live vs pseudo. Https://t.co/InQuWMAm4l [Tweet]. @erlichya. https://twitter.com/erlichya/status/1468580675007795204

    1. John Burn-Murdoch. (2021, November 25). Five quick tweets on the new variant B.1.1.529 Caveat first: Data here is very preliminary, so everything could change. Nonetheless, better safe than sorry. 1) Based on the data we have, this variant is out-competing others far faster than Beta and even Delta did 🚩🚩 https://t.co/R2Ac4e4N6s [Tweet]. @jburnmurdoch. https://twitter.com/jburnmurdoch/status/1463956686075580421

    1. The bibliography should be placed nextafter the ta&e of contents, because the instructor alwayswishes to examine it before reading the text of the essay.

      Surprising! particularly since they traditionally come at the end.

      Though for teaching purposes, I can definitely see a professor wanting it up front. I also frequently skim through bibliographies before starting reading works now, though I didn't do this in the past. Reading a bibliography first is an excellent way to establish common context with an author however.

    1. NETGEAR is committed to providing you with a great product and choices regarding our data processing practices. You can opt out of the use of the data described above by contacting us at analyticspolicy@netgear.com

      You may opt out of these data use situations by emailing analyticspolicy@netgear.com.

    2. Marketing. For example, information about your device type and usage data may allow us to understand other products or services that may be of interest to you.

      All of the information above that has been consented to, can be used by NetGear to make money off consenting individuals and their families.

    3. USB device

      This gives Netgear permission to know what you plug into your computer, be it a FitBit, a printer, scanner, microphone, headphones, webcam — anything not attached to your computer.

    1. I like to think of thoughts as streaming information, so I don’t need to tag and categorize them as we do with batched data. Instead, using time as an index and sticky notes to mark slices of info solves most of my use cases. Graph notebooks like Obsidian think of information as batched data. So you have a set of notes (samples) that you try to aggregate, categorize, and connect. Sure there’s a use case for that: I can’t imagine a company wiki presented as streaming info! But I don’t think it aids me in how I usually think. When thinking with pen and paper, I prefer managing streamed information first, then converting it into batched information later— a blog post, documentation, etc.

      There's an interesting dichotomy between streaming information and batched data here, but it isn't well delineated and doesn't add much to the discussion as a result. Perhaps distilling it down may help? There's a kernel of something useful here, but it isn't immediately apparent.

      Relation to stock and flow or the idea of the garden and the stream?

    1. https://app.idx.us/en-US/services/credit-management

      Seems a bit ironic just how much data a credit monitoring wants to help monitor your data on the dark web. So many companies have had data breaches, I can only wonder how long it may be before a company like IDX has a breach of their own databases?

      The credit reporting agencies should opt everyone into these sorts of protections automatically given the number of breaches in the past.

    1. those provisions cannot be interpreted as meaning that the processing of personal data that are liable indirectly to reveal sensitive information concerning a natural person is excluded from the strengthened protection regime prescribed by those provisions, if the effectiveness of that regime and the protection of the fundamental rights and freedoms of natural persons that it is intended to ensure are not to be compromised.

      And here's the key element for indirect/inferred data. In order for Article 9 to matter, it must also include data that infers SCD.

    2. collecting and checking the content of declarations of private interests, of personal data that are liable to disclose indirectly the political opinions, trade union membership or sexual orientation of a natural person constitutes processing of special categories of personal data, for the purpose of those provisions.

      Second question: If you collect it, can you infer from it?

  6. Jul 2022
    1. AI text generator, a boon for bloggers? A test report

      While I wanted to investigate AI text generators further, I ended up writing a testreport.. I was quite stunned because the AI ​​text generator turns out to be able to create a fully cohesive and to-the-point article in minutes. Here is the test report.

    1. List management TweetDeck allows you to manage your Lists easily in one centralized place for all your accounts. You can create Lists in TweetDeck filtered by by your interests or by particular accounts. Any List that you have set up or followed previously can also be added as separate columns in TweetDeck.   To create a List on TweetDeck: From the navigation bar, click on the plus icon  to select Add column, then click on Lists  .Click the Create List button.Select the Twitter account you would like to create the List for.Name the List and give it a description then select if you would like the List to be publicly visible or not (other people can follow your public Lists).Click Save.Add suggested accounts or search for users to add members to your List, then click Done.   To edit a List on TweetDeck: Click on Lists  from the plus icon  in the navigation bar.Select the List you would like to edit.Click Edit.Add or remove List members or click Edit Details to change the List name, description, or account. You can also click Delete List.When you're finished making changes, click Done.     To designate a List to a column: Click on the plus icon  to select Add column.Click on the Lists option from the menu.Select which List you would like to make into a column.Click Add Column.   To use a particular List in search: Add a search column, then click the filter icon  to open the column filter options.Click the  icon to open the User filter. Select By members of List and type the account name followed by the List name. You can only search across your own Lists, or others’ public Lists.

      While you still can, I'd highly encourage you to use TweetDeck's "Export" List function to save plain text lists of the @ names in your... Lists.

    1. The documents highlight the massive scale of location data that government agencies including CBP and ICE received, and how the agencies sought to take advantage of the mobile advertising industry’s treasure trove of data.
    1. Location tracking is just one part of a panoply of data-collection practices that are now center stage in the abortion debate, along with people’s online search histories and information from period-tracking apps.
    1. Documentazione

      Il problema di questa sezione è derubricare i modelli dati come documentazione. Le ontologie di ontopia (parlo di modelli non tanto di dati come i vocabolari controllati) sono machine-readable. Quindi non è solo una questione di documentare la sintassi o il contenuto del dato. È rendere il modello actionable, ossia leggibile e interpretabile dalle macchine stesse. Io potrei benissimo documentare dei dataset con una bella tabellina in Github o con tante tabelline in un bellissimo PDF (documentazione), ma non è la stessa cosa di rendere disponibile un'ontologia per dei dati. Rendere i modelli parte attiva della gestione del dato (come per le ontologie) significa abilitare l'inferenza che avete richiamato sopra in maniera impropria per me, ma anche utilizzarli per explainable AI e tanti altri usi. Questo è un concetto fondamentale che non può essere trattato così in linee guida nazionali. Dovrebbe anzi avere un capitolo suo dedicato, vista l'importanza anche in ottica data quality "compliance" caratteristica di qualità dello standard ISO/IEC 25012.

    2. Nel caso a), il soggetto ha tutti gli elementi per rappresentare il proprio modello dati; viceversa, nei casi b) e c), la stessa amministrazione, in accordo con AgID, valuta l’opportunità di estendere il modello dati a livello nazionale.

      Tutta la parte di modellazione dati, anche attraverso il catalogo nazionale delle ontologie e vocabolari controllati, sembra ora in mano a ISTAT, titolare, insieme al Dipartimento di Trasformazione Digitale di schema.gov.it. Qui però sembra AGID abbia il ruolo di definire i vari modelli. Secondo me questo crea confusione. bisognerebbe coordinarsi anche con le altre amministrazioni per capire bene chi fa cosa. AGID al momento di OntoPiA gestisce solo un'infrastruttura fisica.

    3. Utilizzando il framework RDF, si può costruire un grafo semantico, noto anche come grafo della conoscenza, che può essere percorso dalle macchine risolvendo, cioè dereferenziando, gli URI HTTP. Ciò significa che è possibile estrarre automaticamente informazione e derivare, quindi, contenuto informativo aggiuntivo (inferenza).

      Non è che fate inferenza perché dereferenziate gli URI. Vi suggerisco di leggere bene le linee guida per l'interoperablità semantica attraverso i linked open data che spiega cosìè l'inferenza (e questa sì fa parte di un processo di arricchimento nel mondo linked open data). L'inferenza è una cosa più complessa che si può fare con ragionatori automatici e query sparql. Si possono dedurre nuove informazioni dati dati esistenti e soprattutto dalle ontologie che sono oggetti machine readable!

  7. Jun 2022
    1. Lastly, said datasheet should outline some ethical considerations of the data.

      I think this question speaks to one of the essential aspects of the data. In my interaction with the datasheet, I mostly focused on the absence of the data, but I think I have missed out on this key puzzle piece to the big picture of why the data is not there. I assumed what was responsible for the non-existence of the information without pondering on possible answers to this one key question. It is indeed crucial to look into the current condition of the item or/and collections including the item. If the artwork is not as much preserved as others, it can mean that more efforts need to be done to save it from lacking more data in the future.

    1. Another important distinction is between data and metadata. Here, the term “data” refers to the part of a file or dataset which contains the actual representation of an object of inquiry, while the term “metadata” refers to data about that data: metadata explicitly describes selected aspects of a dataset, such as the time of its creation, or the way it was collected, or what entity external to the dataset it is supposed to represent.

      This part is notably helpful for the understanding of differences that separate "metadata" from "data". I was writing a blog post for my weekly assignment. Knowing that data is the representation of the object and metadata describes information the data helps build the definition of the terms in my schema of knowledge. In many cases, metadata even provides resources that either give insights to how the data is collected or/and introduces possible perspectives as to how the data can be seen/utilized in the future. Data can survive without metadata, but metadata won't exist without the data. However, the data that lacks metadata may stay uncracked and ciphered, leading to the data potentially becoming useless in the fundamental and economic growth of human beings.