106 Matching Annotations
  1. Last 7 days
    1. Figure 3. The average drop in log probability (perturbation discrep-ancy) after rephrasing a passage is consistently higher for model-generated passages than for human-written passages. Each plotshows the distribution of the perturbation discrepancy d (x, pθ , q)for human-written news articles and machine-generated arti-cles; of equal word length from models GPT-2 (1.5B), GPT-Neo-2.7B (Black et al., 2021), GPT-J (6B; Wang & Komatsuzaki (2021))and GPT-NeoX (20B; Black et al. (2022)). Human-written arti-cles are a sample of 500 XSum articles; machine-generated textis generated by prompting each model with the first 30 tokens ofeach XSum article, sampling from the raw conditional distribution.Discrepancies are estimated with 100 T5-3B samples.

      quite striking here is the fact that more powerful/larger models are more capable of generating unusual or "human-like" responses - looking at the overlap in log likelihoods

    2. if we apply small perturbations to a passagex ∼ pθ , producing ̃x, the quantity log pθ (x) − log pθ ( ̃x)should be relatively large on average for machine-generatedsamples compared to human-written text.

      By applying small changes to text sample x, we should be able to find the log probs of x and the perturbed example and there should be a fairly big delta for machine generated examples.

    3. As in prior work, we study a ‘white box’ setting (Gehrmannet al., 2019) in which the detector may evaluate the log prob-ability of a sample log pθ (x). The white box setting doesnot assume access to the model architecture or parameters.While most public APIs for LLMs (such as GPT-3) enablescoring text, some exceptions exist

      The authors assume white-box access to the log probability of a sample \(log p_{\Theta}(x)\) but do not require access to the model's actual architecture or weights.

    4. Empirically, we find predictive entropy to be positively cor-related with passage fake-ness more often that not; there-fore, this baseline uses high average entropy in the model’spredictive distribution as a signal that a passage is machine-generated.

      this makes sense and aligns with the gltr - humans add more entropy to sentences by making unusual choices in vocabulary that a model would not.

    5. We find that supervised detectors can provide similardetection performance to DetectGPT on in-distribution datalike English news, but perform significantly worse than zero-shot methods in the case of English scientific writing andfail altogether for German writing. T

      supervised detection methods fail on out of domain examples whereas detectgpt seems to be robust to changes in domain.

    6. ex-tending DetectGPT to use ensembles of models for scoring,rather than a single model, may improve detection in theblack box setting

      DetectGPT could be extended to use ensembles of models allowing iot to work in black box settings where the log probs are unknown

    7. hile in this work, we use off-the-shelfmask-filling models such as T5 and mT5 (for non-Englishlanguages), some domains may see reduced performanceif existing mask-filling models do not well represent thespace of meaningful rephrases, reducing the quality of thecurvature estimate.

      The approach requires access to language models that can meaningfully and accurately rephrase (perturbate) the outputs from the model under evaluation. If these things do not align then it may not work well.

    8. For models be-hind APIs that do provide probabilities (such as GPT-3),evaluating probabilities nonetheless costs money.

      This does cost money to do for paid APIs and requires that log probs are made available.

    9. We simulate human re-vision by replacing 5 word spans of the text with samplesfrom T5-3B until r% of the text has been replaced, andreport performance as r varies.

      I question the trustworthiness of this simulation - human edits are probably going to be more sporadic and random.

    10. Figure 5. We simulate human edits to machine-generated text byreplacing varying fractions of model samples with T5-3B gener-ated text (masking out random five word spans until r% of text ismasked to simulate human edits to machine-generated text). Thefour top-performing methods all generally degrade in performancewith heavier revision, but DetectGPT is consistently most accurate.Experiment is conducted on the XSum dataset

      DetectGPT shows 95% AUROC for texts that have been modified by about 10% and this drops off to about 85% when text is changed up to 24%.

    11. DetectGPT’s performancein particular is mostly unaffected by the change in languagefrom English to Germa

      Performance of this method is robust against changes between languages (e.g. English to German)

    12. ecause the GPT-3 API does not provideaccess to the complete conditional distribution for each to-ken, we cannot compare to the rank, log rank, and entropy-based prior methods

      GPT-3 api does not expose the cond probs for each token so we can't compare to some of the prior methods. That seems to suggest that this method can be used with limited knowledge about the probabilities.

    13. improving detection offake news articles generated by 20B parameterGPT-NeoX

      The authors test their approach on GPT-NeoX. The question would be whether we can get hold of the log probs from ChatGPT to do the same

    14. his approach, which we call DetectGPT,does not require training a separate classifier, col-lecting a dataset of real or generated passages, orexplicitly watermarking generated text. It usesonly log probabilities computed by the model ofinterest and random perturbations of the passagefrom another generic pre-trained language model(e.g, T5)

      The novelty of this approach is that it is cheap to set up as long as you have the log probabilities generated by the model of interest.

    15. See ericmitchell.ai/detectgptfor code, data, and other project information.

      Code and data available at https://ericmitchell.ai/detectgpt

  2. Jan 2023
    1. The potential size of this market is hard to grasp — somewhere between all software and all human endeavors

      I don't think "all" software needs or all human endeavors benefit from generative AI. Especially when you consider the associated prerequisitve internet access or huge processing requirements.

    2. Other hardware options do exist, including Google Tensor Processing Units (TPUs); AMD Instinct GPUs; AWS Inferentia and Trainium chips; and AI accelerators from startups like Cerebras, Sambanova, and Graphcore. Intel, late to the game, is also entering the market with their high-end Habana chips and Ponte Vecchio GPUs. But so far, few of these new chips have taken significant market share. The two exceptions to watch are Google, whose TPUs have gained traction in the Stable Diffusion community and in some large GCP deals, and TSMC, who is believed to manufacture all of the chips listed here, including Nvidia GPUs (Intel uses a mix of its own fabs and TSMC to make its chips).

      Look at market share for tensorflow and pytorch which both offer first-class nvidia support and likely spells out the story. If you are getting in to AI you go learn one of those frameworks and they tell you to install CUDA

    3. Commoditization. There’s a common belief that AI models will converge in performance over time. Talking to app developers, it’s clear that hasn’t happened yet, with strong leaders in both text and image models. Their advantages are based not on unique model architectures, but on high capital requirements, proprietary product interaction data, and scarce AI talent. Will this serve as a durable advantage?

      All current generation models have more-or-less the same architecture and training regimes. Differentiation is in the training data and the number of hyper-parameters that the company can afford to scale to.

    4. In natural language models, OpenAI dominates with GPT-3/3.5 and ChatGPT. But relatively few killer apps built on OpenAI exist so far, and prices have already dropped once.

      OpenAI have already dropped prices on their GPT-3/3.5 models and relatively few apps have emerged. This could be because companies are reluctant to build their core offering around a third party API

    5. Vertical integration (“model + app”). Consuming AI models as a service allows app developers to iterate quickly with a small team and swap model providers as technology advances. On the flip side, some devs argue that the product is the model, and that training from scratch is the only way to create defensibility — i.e. by continually re-training on proprietary product data. But it comes at the cost of much higher capital requirements and a less nimble product team.

      There's definitely a middle ground of taking an open source model that is suitably mature and fine-tuning it for a specific use case. You could start without a moat and build one over time through collecting use data (similar to network effect)

    6. Many apps are also relatively undifferentiated, since they rely on similar underlying AI models and haven’t discovered obvious network effects, or data/workflows, that are hard for competitors to duplicate.

      Companies that rely on underlying AI models without adding value via model improvements are going to find that they have no moat.

    7. We’re also not going deep here on MLops or LLMops tooling, which is not yet highly standardized and will be addressed in a future post.

      first mention of LLMops I've seen in the wild

    8. Over the last year, we’ve met with dozens of startup founders and operators in large companies who deal directly with generative AI. We’ve observed that infrastructure vendors are likely the biggest winners in this market so far, capturing the majority of dollars flowing through the stack. Application companies are growing topline revenues very quickly but often struggle with retention, product differentiation, and gross margins. And most model providers, though responsible for the very existence of this market, haven’t yet achieved large commercial scale.

      Infrastructure vendors are laughing all the way to the bank because companies are dumping millions on GPUs. Meanwhile, the people building apps on top of these models are struggling. We've seen this sort of gold-rush before and infrastructure providers are selling the shovels.

    1. Here I’ve summarized Christian Tietze’s process, which I’m presently adopting / adapting:

      Andy is Adapting the approach of zettelkasten writer Christian Tietze

    2. You need to take a step back and form a picture of the overall structure of the ideas. Concretely, you might do that by clustering your scraps into piles and observing the structure that emerges. Or you might sketch a mind map or a visual outline.

      Andy suggests taking a step back and clustering annotations into piles or using a mind map or visualisations to identify common themes.

      I wonder if this is a bit overkill for the number of notes I tend to take or a sign that I'm not taking enough notes?

      What tools are out there that could integrate with my stack and help me do this.

  3. Dec 2022
    1. Happiness is pushed to some later date in the future while your present self battles with the misery of the current moment.

      Journey before destination, don't get caught up in the future, you'll miss the now. Instead, rest in motion

    2. Positive fantasies allow you to indulge in the desired future mentally…You can taste the sensations of what it’s like to achieve your goal in the present — this depletes your energy to pursue your desired future.

      It's easy to get caught up fantasising about what you could achieve rather than actually taking action to achieve it.

    1. My goal was simply to scale this ladder over time. I worked the list 5 people at a time, starting at the bottom. I engaged relentlessly with those accounts until they noticed me and began engaging back.

      Interesting approach and these people are going to be great candidates for picking up new knowledge and self learning from too!

    2. Don’t try to convince everyone that what you say, feel, think, or have done is better than everyone else.

      This is pretty normal for those of us who are academically inclined so it shouldn't be too much of a stretch - after all a lot of the time what we're doing is thinking about other peoples' works critically

    3. My goal with my content is to make it so recognizable that you would know it was me even if it didn't have my name on it. The same style. The same thought process. The same character.

      building a recognisable tone of voice can help with repeat visitors

    1. Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPT’s output—well okay, we’re not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, it’s robust against those sorts of interventions.

      this mechanism can be defeated by paraphrasing the output with another model

    2. Anyway, we actually have a working prototype of the watermarking scheme, built by OpenAI engineer Hendrik Kirchner. It seems to work pretty well—empirically, a few hundred tokens seem to be enough to get a reasonable signal that yes, this text came from GPT. In principle, you could even take a long text and isolate which parts probably came from GPT and which parts probably didn’t.

      Scott's team hsas already developed a prototype watermarking scheme at OpenAI and it works pretty well

    3. So then to watermark, instead of selecting the next token randomly, the idea will be to select it pseudorandomly, using a cryptographic pseudorandom function, whose key is known only to OpenAI.

      Watermarking by applying cryptographic pseudorandom functions to the model output instead of true random (true pseudo-random)

    4. Eventually GPT will say, “oh, I know what game we’re playing! it’s the ‘give false answers’ game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.

      this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model

    5. (3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so there’s a big point of intersection there. The idea of interpretability is, why don’t we exploit the fact that we actually have complete access to the code of the AI—or if it’s a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we don’t exploit that, then aren’t we trying to make AI safe with our hands tied behind our backs?

      Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful

    6. “AI alignment”

      AI Alignment is terminator situation. This versus AI Ethics which is more the concern around current models being racist etc.

    7. And famously, self-driving cars have taken a lot longer than many people expected a decade ago. This is partly because of regulatory barriers and public relations: even if a self-driving car actually crashes less than a human does, that’s still not good enough, because when it does crash the circumstances are too weird. So, the AI is actually held to a higher standard. But it’s also partly just that there was a long tail of really weird events. A deer crosses the road, or you have some crazy lighting conditions—such things are really hard to get right, and of course 99% isn’t good enough here.

      I think the emphasis is wrong here. The regulation is secondary. The long tail of weird events is the more important thing.

    8. Okay, but one thing that’s been found empirically is that you take commonsense questions that are flubbed by GPT-2, let’s say, and you try them on GPT-3, and very often now it gets them right. You take the things that the original GPT-3 flubbed, and you try them on the latest public model, which is sometimes called GPT-3.5 (incorporating an advance called InstructGPT), and again it often gets them right. So it’s extremely risky right now to pin your case against AI on these sorts of examples! Very plausibly, just one more order of magnitude of scale is all it’ll take to kick the ball in, and then you’ll have to move the goal again.

      the stochastic parrots argument could be defeated as models get bigger and more complex

    1. If my interpretation of the Retrieval quadrant is correct, it will become much more difficult to be an average, or even above average, writer. Only the best will flourish. Perhaps we will see a rise in neo-generalists.

      This is probably true of average or poor software engineers given that GPT-3 can produce pretty reasonable code snippets

    1. For many intellectual tasks, the people with the least skill overestimate themselves the most, a pattern popularly known as the Dunning–Kruger effect (DKE). The dominant account of this effect depends on the idea that assessing the quality of one's performance (metacognition) requires the same mental resources as task performance itself (cognition). Unskilled people are said to suffer a dual burden: they lack the cognitive resources to perform well, and this deprives them of metacognitive insight into their failings. In this Registered Report, we applied recently developed methods for the measurement of metacognition to a matrix reasoning task, to test the dual-burden account. Metacognitive sensitivity (information exploited by metacognition) tracked performance closely, so less information was exploited by the metacognitive judgements of poor performers; but metacognitive efficiency (quality of metacognitive processing itself) was unrelated to performance. Metacognitive bias (overall tendency towards high or low confidence) was positively associated with performance, so poor performers were appropriately less confident—not more confident—than good performers. Crucially, these metacognitive factors did not cause the DKE pattern, which was driven overwhelmingly by performance scores. These results refute the dual-burden account and suggest that the classic DKE is a statistical regression artefact that tells us nothing much about metacognition.

      The Dunning-Kruger effect (DKE) seems to be a statistical regression artefact that doesn't actually explain whether people who are good at a task are able to estimate their own abilities at the task

    1. AI training data is filled with racist stereotypes, pornography, and explicit images of rape, researchers Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe found after analyzing a data set similar to the one used to build Stable Diffusion.

      That is horrifying. You'd think that authors would attempt to remove or filter this kind of material. There are, after all models out there that are trained to find it. It makes me wonder what awful stuff is in the GPT-3 dataset too.

    1. Throughout the 80s and 90s, private equity firms and hedge funds gobbled up local news enterprises to extract their real estate. They didn’t give a shit about journalism; they just wanted prime real estate that they could develop. And news organizations had it in the form of buildings in the middle of town. So financiers squeezed the news orgs until there was no money to be squeezed and then they hung them out to dry.

      Wild that driving functional organisations into the ground could just be the cost of doing business

    2. Perceptions of failure don’t always lead to shared ideas of how to learn from these lessons.

      Really good insight that I hadn't really considered before. If normally opposing parties reach the same end goal then nobody wants to think about why, we'd rather just take the win.

    1. every country is going to need to reconsider its policies on misinformation. It’s one thing for the occasional lie to slip through; it’s another for us all to swim in a veritable ocean of lies. In time, though it would not be a popular decision, we may have to begin to treat misinformation as we do libel, making it actionable if it is created with sufficient malice and sufficient volume.

      What to do then when our government reps are already happy to perpetuate "culture wars" and empty talking points?

    2. anyone skilled in the art can now replicate their recipe.

      Well anyone skilled enough who has $500k for the gpu bill and access to and the means to store the corpus... So corporations I guess... Yey!

    1. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher

      By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference.

    1. Amazon is hated on the right as a bulwark of progressivism. For instance, to pick a random example, GOP icon Tucker Carlson recently characterized the firm’s behavior as ‘modern-day book burning.’ And you can find an endless number of right-wing critiques. Conservatives distrust Amazon.

      That is really interesting. Amazon is not exactly renowned as an m upholder of progressive values by the left either.

    1. Whether you want to call them mottos, memes, or manifestos, words can be the building blocks of how we think and transmit ideas. You can also gauge how well someone is grasping your concepts—or at least making an effort to—by the language they’re responding to you with as well.

      You can use the way that a person responds to your concepts as a metric for how well they understand you. If they don't understand chances are they will retreat back to jargon to try to hide the fact that they're struggling. If they're getting on well they might have an insightful way to extend your metaphor

    1. Of course, the closest you can get is having the activity available in your own living space, but as unused home treadmills and exercise bikes demonstrate, this has its pitfalls. There could be something about a thing always being available that means there’s never any urgency.

      There seems to be a minimum at which hyperbolic discounting stops working because things are too easy to access

    2. You may have heard of hyperbolic discounting from behavioral economics: people will generally disproportionally, i.e. hyperbolically, discount the value of something the farther off it is. The average person judges $15 now as equivalent to $30 in 3-months (an annual rate of return of 277%!).

      this is fascinating and must relate to delayed gratification

    1. It’s always worth gathering information, nurturing other projects, and putting together some backup plans. You’ll need to define what success means to you for each of them, because you won’t make overnight progress; instead, you’re best served picking projects that you can learn critical lessons from, even if you fail

      It's interesting because this way of thinking is eminently compatible with the zettelkasten way of thinking e.g. don't necessarily set out with a hypothesis in mind that you're trying to prove but rather explore until something interesting emerges.

    1. “… there are about 25 billion car trips per year, and with some 27 million cars, this suggests an average of just under 18 trips per car every week. Since the duration of the average car trip is about 20 minutes, the typical car is only on the move for 6 hours in the week: for the remaining 162 hours it is stationary – parked.”

      This may be napkin maths but this is pretty shocking to think about. There must be a better way!

  4. Nov 2022
    1. Extractive summarization may be regarded as acontextual bandit as follows. Each document is acontext, and each ordered subset of a document’ssentences is a different action

      We can represent extractive summarization as a bandit problem by treating the document as the context and possible reorderings of sentences as actions an agent could take

    2. andit is a decision-making formal-ization in which an agent repeatedly chooses oneof several actions, and receives a reward based onthis choice.

      Definition for contextual bandit: an agent that repeatedly choses one of several actions and receives a reward based on this choice.

    1. increasing body of research analytically exploresthe consequences of the research impact agenda on academic work,including the risks posed to research quality (Chubb and Reed2018), prioritising of short-term impacts rather than more concep-tual impacts (Greenhalgh and Fahy 2015; Meagher and Martin2017), ethical risks (Smith and Stewart 2017), and a focus on indi-vidual academics rather than on the broader context of research-based policy change (Dunlop 2018)

      Lots of papers write about the effect that the UK's focus on comprehensive impact affects the quality of research and individual researchers

    1. Unsur-prisingly, therefore, existing research documents various ways in which REF impact has becomeembedded within university governance, including via the broadening of career progression criteria(Bandola-Gill 2019)

      REF has become embedded within university governance - including career progression criteria (for researchers presumably)

    2. RANDreport that had been commissioned by HEFCE (Grant et al. 2010)

      interesting ties here between REF and ResearchFish - both came out of RAND

    1. while there aregroups potentially benefiting from the case studies relating to their field of research (egwriters benefiting from studies in Panel D, engineers benefiting from studies in PanelB), there are mentions of these potential beneficiaries across all the panels

      The beneficiaries of research named by REF impact case studies are heterogeneous across all UOAs

    2. With the benefit of hindsight, our analysis would have been much easierif the case studies had greater structure and used standardized definitions. Giventhat the case studies spanned a 20-year period, organization names have changed inthat time and keyword searches were not sophisticated enough to capture some keyinformation.

      I found similar in my 2017 work. I'd guess that modern vector-based analyses and entity linking approaches could help a lot with reconciling these issues now.

    3. Topic modelling was used to determine common topics across the wholecorpus. Sixty-five topics were found (of which 60 were used) using theApache Mallet Toolkit Latent Dirichlet Allocation (LDA) algorithm.

      The authors used LDA with k=60 across full text case studies. The Apache Mallet implementation was used.

    4. any effect on, change or benefit to the economy, society,culture, public policy or services, health, the environment or quality of life, beyondacademia’ (REF, 2011).

      the REF definition of impact as it pertains to comprehensive impact (and as opposed to academic impact)

    1. A blog post is a very long and complex search query to find fascinating people and make them route interesting stuff to your inbox.

      This is a really cool take on blogging. By writing about interesting people and stuff you are increasing your chances of meeting someone cool and indeed increasing your luck

    1. Research funders and providers are having to compete with other public services, and,as such, must be able to advocate the need for funding of research. Leaders within thesector must have compelling arguments to ‘make the case’ for research. For example,the Research Councils each publish an annual impact report which describe the waysin which they are maximising the impacts of their investments. These reports includeillustrations of how their research and training has made a contribution to the economyand society.10 The analysis of Researchfish and other similar data can support thedevelopment of these cases

      For research councils, being able to illustrate how their research impacts the economy and society helps them to compete for and justify their continued funding.

    2. Research outputs (and outcomes and impact) are gathered through a ‘questionset’ developed by funding institutions through a consultative process. This set of16 questions contains 175 sub-questions as illustrated in Figure 3 (the full set ofquestions are available in Annex A). A researcher, or one of their delegates, can add,edit and delete entries, and crucially, attribute entries to research grants and awards

      RF allows researchers to input fine-grained information about the research that they have done and this information is passed back to the funding bodies.

    3. The term ‘impact’ is currently used widely in research, especially with the inclusion ofnon-academic impact as part of the latest Research Excellence Framework (REF)

      RF use similar definition of impact to that of REF

    1. look at the economicimpact of research – taking an area of research(often cardiovascular disease), calculating thetotal investment in research and comparing it tothe total payback in terms of monetarised healthbenefit and other economic effects.

      Interesting to see that the authors considers these macro level economic indicators "broad and shallow" but it does make sense. Ideally we want to understand individual contributions of works to economic impact.

    2. However, knowledge production isnormally only an intermediate aim: the ultimateobjective of most medical research is to improvehealth and prosperity.

      Exactly! Measuring citation counts doesn't help us understand whether research actually helped people

    3. Much broad and shallow evaluation is based onbibliometrics (examining the quality of researchpublications) to assess the amount and quality ofknowledge produced

      here the authors are discussing the fact that a lot of analysis/evaluation of research is done via bibliometrics (citation-based impact metrics) and they consider this kind of evaluation to be "broad and shallow"

    1. Matthew Hindman, in his book "The Internet Trap" <http://assets.press.princeton.edu/chapters/s13236.pdf>, notes that most research on the internet has focused on its supposedly decentralized nature, leaving us with little language to really grapple with the concentrated, oligopolistic state of today's online economy, where the vast majority of attention and revenue accrue to a tiny number of companies

      This is a really nice summary - "the internet" is still talked about as if it is still 1999 whereas in reality today's internet can be equated to "where I consume services from FAANG" for most people

    1. Nim in Action book

      todo: procure this

    2. This isn't a highly scientific post full of esoteric details and language feature matrices. It's about making the best choice for what I can be the most productive in for my target market and product requirements.

      this post is more about the author's needs and requirements. It does not attempt to be objective

    1. Annotations are the first step of getting useful insights into my notes. This makes it a prerequisite to be able to capture annotations in my note making tool Obsidian, otherwise Hypothes.is is just another silo you’re wasting time on. Luckily h. isn’t meant as a silo and has an API. Using the API and the Hypothes.is-to-Obsidian plugin all my annotations are available to me locally.

      This is key - exporting annotations via the API to either public commonplace books (Chris A Style) or to a private knowledge store seems to be pretty common.

    2. In the same category of integrating h. into my pkm workflows, falls the interaction between h. and Zotero, especially now that Zotero has its own storage of annotations of PDFs in my library. It might be of interest to be able to share those annotations, for a more complete overview of what I’m annotating. Either directly from Zotero, or by way of my notes in Obsidian (Zotero annotatins end up there in the end)

      I've been thinking about this exact same flow. Given that I'm mostly annotating scientific papers I got from open access journals I was wondering whether there might be some way to syndicate my zotero annotations back to h via a script.

    1. Whatever your thing is, make the thing you wish you had found when you were learning. Don’t judge your results by “claps” or retweets or stars or upvotes - just talk to yourself from 3 months ago

      Completely agree, this is a great intrinsic metric to measure the success of your work by.

    2. a habit of creating learning exhaust:

      not sure I love the metaphor but I can definitely see the advantages of leaving your learnings "out there" for others to see and benefit from

    1. I love the IndieWeb and its tools, but it has always bothered me that at some point they basically require you to have a webdevelopment background.

      Yeah this is definitely a concern and a major barrier for adoption at the moment.

    1. First, to experiment personally with AP itself, and if possible with the less known Activities that AP could support, e.g. travel and check-ins. This as an extension of my personal site in areas that WordPress, OPML and RSS currently can’t provide to me. This increases my own agency, by adding affordances to my site. This in time may mean I won’t be hosting or self-hosting my personal Mastodon instance. (See my current fediverse activities)

      Interesting for me to explore and understand too. How does AP compare to micropub which can be used for similar purposes? As far as I can tell it is much more heavyweight

    1. For example, the design pattern A Place to Wait asks that we create comfortable accommodation and ambient activity whenever someone needs to wait; benches, cafes, reading rooms, miniature playgrounds, three-reel slot machines (if we happen to be in the Las Vegas airport). This solves the problem of huddles of people awkwardly hovering in liminal space; near doorways, taking up sidewalks, anxiously waiting for delayed flights or dental operations or immigration investigations without anything to distract them from uncertain fates.

      Amazing to think how ubiquitous waiting rooms are and how we take them for granted

    1. Misleading Templates There is no consistent re-lation between the performance of models trainedwith templates that are moderately misleading (e.g.{premise} Can that be paraphrasedas "{hypothesis}"?) vs. templates that areextremely misleading (e.g., {premise} Isthis a sports news? {hypothesis}).T0 (both 3B and 11B) perform better givenmisleading-moderate (Figure 3), ALBERT andT5 3B perform better given misleading-extreme(Appendices E and G.4), whereas T5 11B andGPT-3 perform comparably on both sets (Figure 2;also see Table 2 for a summary of statisticalsignificances.) Despite a lack of pattern between

      Their misleading templates really are misleading

      {premise} Can that be paraphrased as "{hypothesis}"

      {premise} Is this a sports news? {hypothesis}

    2. Insum, notwithstanding prompt-based models’impressive improvement, we find evidence ofserious limitations that question the degree towhich such improvement is derived from mod-els understanding task instructions in waysanalogous to humans’ use of task instructions.

      although prompts seem to help NLP models improve their performance, the authors find that this performance is still present even when prompts are deliberately misleading which is a bit weird

    3. Suppose a human is given two sentences: “Noweapons of mass destruction found in Iraq yet.”and “Weapons of mass destruction found in Iraq.”They are then asked to respond 0 or 1 and receive areward if they are correct. In this setup, they wouldlikely need a large number of trials and errors be-fore figuring out what they are really being re-warded to do. This setup is akin to the pretrain-and-fine-tune setup which has dominated NLP in recentyears, in which models are asked to classify a sen-tence representation (e.g., a CLS token) into some

      This is a really excellent illustration of the difference in paradigm between "normal" text model fine tuning and prompt-based modelling

    1. Antibiotic resistance has become a growingworldwide concern as new resistance mech-anisms are emerging and spreading globally,and thus detecting and collecting the cause– Antibiotic Resistance Genes (ARGs), havebeen more critical than ever. In this work,we aim to automate the curation of ARGs byextracting ARG-related assertive statementsfrom scientific papers. To support the researchtowards this direction, we build SCIARG, anew benchmark dataset containing 2,000 man-ually annotated statements as the evaluationset and 12,516 silver-standard training state-ments that are automatically created from sci-entific papers by a set of rules. To set upthe baseline performance on SCIARG, weexploit three state-of-the-art neural architec-tures based on pre-trained language modelsand prompt tuning, and further ensemble themto attain the highest 77.0% F-score. To the bestof our knowledge, we are the first to leveragenatural language processing techniques to cu-rate all validated ARGs from scientific papers.Both the code and data are publicly availableat https://github.com/VT-NLP/SciARG.

      The authors use prompt training on LLMs to build a classifier that can identify statements that describe whether or not micro-organisms have antibiotic resistant genes in scientific papers.

    1. Our annotators achieve thehighest precision with OntoNotes, suggesting thatmost of the entities identified by crowdworkers arecorrect for this dataset.

      interesting that the mention detection algorithm gives poor precision on OntoNotes and the annotators get high precision. Does this imply that there are a lot of invalid mentions in this data and the guidelines for ontonotes are correct to ignore generic pronouns without pronominals?

    2. an algorithm with high precision on LitBank orOntoNotes would miss a huge percentage of rele-vant mentions and entities on other datasets (con-straining our analysis)

      these datasets have the most limited/constrained definitions for co-reference and what should be marked up so it makes sense that precision is poor in these datasets

    3. Procedure: We first launch an annotation tutorial(paid $4.50) and recruit the annotators on the AMTplatform.9 At the end of the tutorial, each annotatoris asked to annotate a short passage (around 150words). Only annotators with a B3 score (Bagga

      Annotators are asked to complete a quality control exercise and only annotators who achieve a B3 score of 0.9 or higher are invited to do more annotation

    4. Annotation structure: Two annotation ap-proaches are prominent in the literature: (1) a localpairwise approach, annotators are shown a pairof mentions and asked whether they refer to thesame entity (Hladká et al., 2009; Chamberlain et al.,2016a; Li et al., 2020; Ravenscroft et al., 2021),which is time-consuming; or (2) a cluster-basedapproach (Reiter, 2018; Oberle, 2018; Bornsteinet al., 2020), in which annotators group all men-tions of the same entity into a single cluster. InezCoref we use the latter approach, which can befaster but requires the UI to support more complexactions for creating and editing cluster structures.

      ezCoref presents clusters of coreferences all at the same time - this is a nice efficient way to do annotation versus pairwise annotation (like we did for CD^2CR)

    5. owever, these datasets vary widelyin their definitions of coreference (expressed viaannotation guidelines), resulting in inconsistent an-notations both within and across domains and lan-guages. For instance, as shown in Figure 1, whileARRAU (Uryupina et al., 2019) treats generic pro-nouns as non-referring, OntoNotes chooses not tomark them at all

      One of the big issues is that different co-reference datasets have significant differences in annotation guidelines even within the coreference family of tasks - I found this quite shocking as one might expect coreference to be fairly well defined as a task.

    6. Specifically, our work investigates the quality ofcrowdsourced coreference annotations when anno-tators are taught only simple coreference cases thatare treated uniformly across existing datasets (e.g.,pronouns). By providing only these simple cases,we are able to teach the annotators the concept ofcoreference, while allowing them to freely interpretcases treated differently across the existing datasets.This setup allows us to identify cases where ourannotators disagree among each other, but moreimportantly cases where they unanimously agreewith each other but disagree with the expert, thussuggesting cases that should be revisited by theresearch community when curating future unifiedannotation guidelines

      The aim of the work is to examine a simplified subset of co-reference phenomena which are generally treated the same across different existing datasets.

      This makes spotting inter-annotator disagreement easier - presumably because for simpler cases there are fewer modes of failure?

    7. this work, we developa crowdsourcing-friendly coreference annota-tion methodology, ezCoref, consisting of anannotation tool and an interactive tutorial. Weuse ezCoref to re-annotate 240 passages fromseven existing English coreference datasets(spanning fiction, news, and multiple other do-mains) while teaching annotators only casesthat are treated similarly across these datasets

      this paper describes a new efficient coreference annotation tool which simplifies co-reference annotation. They use their tool to re-annotate passages from widely used coreference datasets.

    1. One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.

      Essentially the raw data needs to be vaguely homogenised and put into a single place

    1. n recent years, the neural network based topic modelshave been proposed for many NLP tasks, such as infor-mation retrieval [11], aspect extraction [12] and sentimentclassification [13]. The basic idea is to construct a neuralnetwork which aims to approximate the topic-word distri-bution in probabilistic topic models. Additional constraints,such as incorporating prior distribution [14], enforcing di-versity among topics [15] or encouraging topic sparsity [16],have been explored for neural topic model learning andproved effective.

      Neural topic models are often trained to mimic the behaviours of probabilistic topic models - I should come back and look at some of the works:

      • R. Das, M. Zaheer, and C. Dyer, “Gaussian LDA for topic models with word embeddings,”
      • P. Xie, J. Zhu, and E. P. Xing, “Diversity-promoting bayesian learning of latent variable models,”
      • M. Peng, Q. Xie, H. Wang, Y. Zhang, X. Zhang, J. Huang, and G. Tian, “Neural sparse topical coding,”
    2. e argue that mutual learningwould benefit sentiment classification since it enriches theinformation required for the training of the sentiment clas-sifier (e.g., when the word “incredible” is used to describe“acting” or “movie”, the polarity should be positive)

      By training a topic model that has "similar" weights to the word vector model the sentiment task can also be improved (as per the example "incredible" should be positive when used to describe "acting" or "movie" in this context

    3. . However, such a framework is not applicablehere since the learned latent topic representations in topicmodels can not be shared directly with word or sentencerepresentations learned in classifiers, due to their differentinherent meanings

      Latent word vectors and topic models learn different and entirely unrelated representations

    1. “The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”

      Interesting metaphor for why humans are happy to trust outputs from generative models

    2. Elicit is really impressive. It searches academic papers, providing summary abstracts as well as structured analyses of papers. For example, it tries to identify the outcomes analysed in the paper or the conflicts of interest of the authors, as well as easily tracks citations. (See a similar search on “technology transitions”. Log in required.)

      https://elicit.org/ - another academic search engine

    1. I only know a handful of people directly making money from blogging (via ads, subscriptions etc) but I know many more who: Got a better career because of blogging (new job, better pay etc) Negotiated better contracts (e.g. with a publisher or platform) because they had “an audience” Sold their own courses / ebooks / books / merchandise / music Blogging is this kind of engine that opens up economic opportunity and advantage. Being visible in the networked economy has real value.

      Making money from blogging isn't just about selling ads or subscriptions a direct thing. It can be indirect too. Eg selling courses or books.

    1. I’ve been using this phrase “the next most useful thing” as a guiding light for my consulting work - I’m obsessed with being useful not just right. I’ve always rejected the fancy presentation in favor of the next most useful thing, and I simply took my eye off the ball with this one. I’m not even sure the client views this project as a real disappointment, there was still some value in it, but I’m mad at myself personally for this one. A good reminder not to take your eye off the ball. And to push your clients beyond what they tell you the right answer is.

      The customer is not always right (just in matters of taste). Part of consultancy is providing stewardship and pushing back, just like any role I guess

    2. Being self-employed feels a bit like being on an extended road trip. Untethered and free, but lonely and unsupported too. Ultimate freedoms combined with shallow roots.

      That's a super insightful take on the self employment thing that people probably don't consider that much when deciding whether to take the leap

    1. The actual reward state is not one where you're lazing around doing nothing. It's one where you're keeping busy, where you're doing things that stimulate you, and where you're resting only a fraction of the time. The preferred ground state is not one where you have no activity to partake in, it's one where you're managing the streams of activity precisely, and moving through them at the right pace: not too fast, but also not too slow. For that would be boring

      Doing nothing at all is boring. When we "rest" we are actually just doing activities that we find interesting rather than those we find dull or stressful.

    2. the work that needs to be done is not a finite list of tasks, it is a neverending stream. Clothes are always getting worn down, food is always getting eaten, code is always in motion. The goal is not to finish all the work before you; for that is impossible. The goal is simply to move through the work. Instead of struggling to reach the end of the stream, simply focus on moving along it.

      This is true and worth remembering. It is very easy to fall into the mindset of "I'll rest when I'm finished"

    1. It took me a while to grok where dbt comes in the stack but now that I (think) I have it, it makes a lot of sense. I can also see why, with my background, I had trouble doing so. Just as Apache Kafka isn’t easily explained as simply another database, another message queue, etc, dbt isn’t just another Informatica, another Oracle Data Integrator. It’s not about ETL or ELT - it’s about T alone. With that understood, things slot into place. This isn’t just my take on it either - dbt themselves call it out on their blog:

      Also - just because their "pricing" page caught me off guard and their website isn't that clear (until you click through to the technical docs) - I thought it's worth calling out that DBT appears to be an open-core platform. They have a SaaS offering and also an open source python command-line tool - it seems that these articles are about the latter

    2. Of course, despite what the "data is the new oil" vendors told you back in the day, you can’t just chuck raw data in and assume that magic will happen on it, but that’s a rant for another day ;-)

      Love this analogy - imagine chucking some crude into a black box and hoping for ethanol at the other end. Then, when you end up with diesel you have no idea what happened.

    3. Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.

      absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"

    1. But - as the overall network has grown exponentially the network topology has changed. Digg, Reddit, Hacker News etc all still exist but the audience you can reach with a “homepage” hit there has become much smaller relative to the overall size of the network. And getting a homepage hit there is harder than ever because the volume of content has increased exponentially

      A similar dynamic can now be observed in the mass migration from twitter to mastodon. People who were successful at using the big "homepage" of twitter are likely to be a bit thrown by the fediverse but it represents an opportunity to connect with a smaller but more specialised audience.

    1. Many people report writers block with blogs, particularly after a big successful post, because it’s almost impossible to consistently pump out bangers.

      Certainly true, people go through peaks and troughs of productivity like seasons