69 Matching Annotations
  1. Apr 2024
    1. The same LM can be a much more or less capable agent depending on the enhancements added. The researchers created and tested four different agents built on top of GPT-4 and Anthropic’s Claude:

      While today’s LMs agents don't pose a serious risk, we should be on the lookout for improved autonomous capabilities as LMs get more capable and reliable.

    2. The latest GPT-4 model from OpenAI, which is trained on human preferences using a technique called RLHFEstimated final training run compute cost: ~$50mModel version: gpt-4-0613

      ~$50m = estimated training cost of GPT-4

  2. Mar 2024
    1. optimal kapitelinddeling af romanen ''barndommens gade'' til undervisningsbrug imellemtrinne

      "optimal"

    Tags

    Annotators

  3. Jan 2024
  4. Nov 2023
    1. 方法:

      基础介绍:

      考虑到现有模型还没有探索,什么样的Instruction数据集是更有效的,而且什么因素导致了好的Instruction data,暂未有人探索。 考虑到这些问题,作者探索什么是好的visual Instruction这个问题。基于这个目标,作者首先对现有的 visual Instruction set进行了评估,目标是发现关键因素。

      作者主要从task type和Instruction characteristic两个方面来评估。作者选择了六个典型的Instruction dataset,使用两个典型的BLIP2和MiniGPT-4来评估。根据实验结果,作者发现: 1. 对于task type,视觉推理任务对于提升模型的image caption和quetison answering任务很重要。 2. 对于Instruction characteristic,提升Instruction的复杂度更加有帮助对于提升性能,相比task的多样性,以及整合细粒度的标注信息。

      基于上述发现,作者开始构建复杂的视觉推理指令集用于改善模型。

      首先最直接的方法是通过chatgpt和gpt4来优化指令集,基于图像的标注。因为指令集跨跨模态的特性,LLMs可能会过于简单甚至包含本来图片中不存在的物体。 考虑到上述问题,作者提出了一个系统的多阶段的方法,来自动生成visual Instruction数据集。

      输入一张图,根据可以获得标注,caption或者object,作者采用了一种先生成,再复杂化,再在重组的pipeline来生成Instruction。具体的,作者首先,使用特殊的prompt指导prompt来生成一个初始指令。然后使用迭代的方式,复杂化-->验证的方式,来逐步提升Instruction的复杂程度,同时保证质量。最后,将Instruction重组成多种形式,在下游任务重,获得更好的适应性。

      前提条件:

      视觉指令收集:

      任务类型,之前的指令微调的数据集,都是利用带有标注的图片。主要包括一下三个任务类型: 1. Image Caption,生成文本描述 2. VQA任务:需要模型根据问题生成关于图片的回答 3. Visual reasoning:需要模型基于图片内容进行推理。

      为了研究任务类型的影响, 作者考虑一个最常用的指令微调数据集LLaVA-Instruct。作者将其划分成三个子数据集,LLaVA-Caption, LLaVA-VQA and LLaVA-Reasoning。

      指令特性: 指令的特性包括。 * 任务的多样性,已经有工作发现,提升工作的多样性,对于zero-shot能力是有帮助的。可以通过和不同的任务整合来获得此类能力。 * 指令的复杂程度,这是一个被广泛应用的策略,提升LLMs指令集的复杂程度。作者同样使用复杂的多模态做任务,例如,多跳的推理任务,来提升MLLMs的指令遵循能力。 * 细粒度的空间感知。对于MLLMs而言,感知细粒度的空间信息对图片中的特定物体,是必要的。基于这个目标。空间位置的标注可以包括在有文本的指令集中。

    1. Salesforce promotes Einstein GPT as the world’s first generative AI tool for CRM. Built on the GPT-3 (Generative Pre-trained Transformer) architecture and integrated in all of Salesforce Clouds as well as Tableau, MuleSoft, and Slack, Einstein GPT is capable of generating natural language responses to customer queries, creating personalized content, and even drafting entire email messages on behalf of sales representatives.

      Curious to see how AI automation solutions may complement with the Experience Cloud Products

  5. Oct 2023
  6. Aug 2023
    1. As summarized by ChatGPT:

      This text explores the concept of "Text Fucking," a form of digital text manipulation primarily focused on Apple platforms. The author discusses their interest in accessibility and their personal authority on the subject. They define "Text Fucking" as the manipulation and destruction of digital text, and they emphasize the potential positive outcomes of this practice. The article covers various applications, including text editing apps, automation tools, Siri Shortcuts, and a text formatting app called "Text Case." The author shares their experience with automation, including automating tasks through tools like IFTTT, and they showcase various Siri Shortcuts they've created for text manipulation purposes. The article also highlights the use of Drafts, a versatile app that supports the author's experimentation with Text Fucking.

  7. Jun 2023
    1. We use the same model and architecture as GPT-2

      What do they mean by "model" here? If they have retrained on more data, with a slightly different architecture, then the model weights after training must be different.

    1. Examples include press releases, short reports, and analysis plans — documents that were reported as realistic for the type of writing these professionals engaged in as part of their work.

      Have in mind the genres tested.

      Looking from a perspective of "how might we use such tools in UX" we're better served by looking at documents that UX generates through the lens of identifying parallels to the study's findings for business documents.

      To use AI to generate drafts, we'll want to look at AI tools built into design tools UXers use to create drafts. Those tools are under development but still developing.

    2. the estimates of how users divided their times between different stages of document generation were based on self-reported numbers

      The numbers for how users divided their time may not be reliable as they're self-reported.

      Still leaves me curious about the accuracy of reported brainstorming time.

    3. the productivity and quality improvements are likely due to a switch in the business professionals’ time allocation: less time spent on cranking out initial draft text and more time spent polishing the final result.

      This points to AI providing the best time savings in draft generation, which fits with the idea of having the AI generate the drafts based on the professional's queries.

      For UX designers, this points to AI in a design tool being most useful when it generates drafts (sketches) that the designer then revises. Where UX deliverables don't compare easily to written deliverables is the contextual factors that influence the design, like style guides or design systems. Design too AI assistants don't yet factor those in, though it seems likely it will, if provided style guides and design systems in a format it can read.

      Given a draft of sufficient quality that it doesn't require longer to revise than a draft the designer would create on their own, getting additional time to refine sounds great.

      I'm not sure what to make of the reduced time to brainstorm when using AI. Without additional information, it's hard not to assume that the AI tool may be influencing the direction of brainstorming as professionals think through the queries they'll use to get the AI to generate the most useful draft possible.

  8. Apr 2023
  9. Mar 2023
    1. Still, we can look for telltale signs. Another symptom of memorization is that GPT is highly sensitive to the phrasing of the question. Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable.

      OpenAI has memorised MBA tests- when these are rephrased or certain details are changed, the system fails to answer

    2. In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation.

      GPT4 knows the link to the coding exams that it was evaluated against but doesn't have "internet access" so it appears to have memorised this as well

    3. To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

      OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training

    1. OpenChatKit은 다양한 응용 프로그램을위한 특수 및 범용 챗봇을 모두 생성 할 수있는 강력한 오픈 소스 기반을 제공합니다. 우리는 협력 법과 온 토코교육 데이터 세트를 작성합니다. 모델 릴리스 그 이상으로 이것은 오픈 소스 프로젝트의 시작입니다. 우리는 지역 사회 공헌으로 지속적인 개선을위한 도구와 프로세스를 발표하고 있습니다.Together는 오픈 소스 기초 모델이보다 포괄적이고 투명하며 강력하며 능력이 있다고 생각합니다. 우리는 공개하고 있습니다 OpenChatKit 0.15 소스 코드, 모델 가중치 및 교육 데이터 세트에 대한 전체 액세스 권한이있는 Apache-2.0 라이센스에 따라. 이것은 커뮤니티 중심의 프로젝트이며, 우리는 그것이 어떻게 발전하고 성장하는지 보게되어 기쁩니다!유용한 챗봇은 자연 언어로 된 지침을 따르고 대화 상자에서 컨텍스트를 유지하며 응답을 조정해야합니다. OpenChatKit은이베이스에서 특수 제작 된 챗봇을 도출하기위한 기본 봇과 빌딩 블록을 제공합니다.이 키트에는 4 가지 주요 구성 요소가 있습니다:100 % 탄소 음성 계산에 대한 4,300 만 건 이상의 명령으로 EleutherAI의 GPT-NeoX-20B에서 채팅을 위해 미세 조정 된 명령 조정 된 대용량 언어 모델;작업을 정확하게 수행하기 위해 모델을 미세 조정하는 사용자 정의 레시피;추론시 문서 저장소, API 또는 기타 실시간 업데이트 정보 소스의 정보로 봇 응답을 보강 할 수있는 확장 가능한 검색 시스템;봇이 응답하는 질문을 필터링하도록 설계된 GPT-JT-6B로 미세 조정 된 조정 모델.OpenChatKit에는 사용자가 피드백을 제공하고 커뮤니티 구성원이 새로운 데이터 세트를 추가 할 수 있도록하는 도구가 포함되어 있습니다. 시간이 지남에 따라 LLM을 개선 할 수있는 개방형 교육 데이터 모음에 기여합니다.

      OpenChatKit은 다양한 응용 프로그램을위한 특수 및 범용 챗봇을 모두 생성 할 수있는 강력한 오픈 소스 기반을 제공합니다. 우리는 협력 법과 온 토코교육 데이터 세트를 작성합니다. 모델 릴리스 그 이상으로 이것은 오픈 소스 프로젝트의 시작입니다. 우리는 지역 사회 공헌으로 지속적인 개선을위한 도구와 프로세스를 발표하고 있습니다.

      Together는 오픈 소스 기초 모델이보다 포괄적이고 투명하며 강력하며 능력이 있다고 생각합니다. 우리는 공개하고 있습니다 OpenChatKit 0.15 소스 코드, 모델 가중치 및 교육 데이터 세트에 대한 전체 액세스 권한이있는 Apache-2.0 라이센스에 따라. 이것은 커뮤니티 중심의 프로젝트이며, 우리는 그것이 어떻게 발전하고 성장하는지 보게되어 기쁩니다!

      유용한 챗봇은 자연 언어로 된 지침을 따르고 대화 상자에서 컨텍스트를 유지하며 응답을 조정해야합니다. OpenChatKit은이베이스에서 특수 제작 된 챗봇을 도출하기위한 기본 봇과 빌딩 블록을 제공합니다.

      이 키트에는 4 가지 주요 구성 요소가 있습니다:

      100 % 탄소 음성 계산에 대한 4,300 만 건 이상의 명령으로 EleutherAI의 GPT-NeoX-20B에서 채팅을 위해 미세 조정 된 명령 조정 된 대용량 언어 모델;

      작업을 정확하게 수행하기 위해 모델을 미세 조정하는 사용자 정의 레시피;

      추론시 문서 저장소, API 또는 기타 실시간 업데이트 정보 소스의 정보로 봇 응답을 보강 할 수있는 확장 가능한 검색 시스템;

      봇이 응답하는 질문을 필터링하도록 설계된 GPT-JT-6B로 미세 조정 된 조정 모델.

  10. Feb 2023
  11. Jan 2023
    1. Figure 3. The average drop in log probability (perturbation discrep-ancy) after rephrasing a passage is consistently higher for model-generated passages than for human-written passages. Each plotshows the distribution of the perturbation discrepancy d (x, pθ , q)for human-written news articles and machine-generated arti-cles; of equal word length from models GPT-2 (1.5B), GPT-Neo-2.7B (Black et al., 2021), GPT-J (6B; Wang & Komatsuzaki (2021))and GPT-NeoX (20B; Black et al. (2022)). Human-written arti-cles are a sample of 500 XSum articles; machine-generated textis generated by prompting each model with the first 30 tokens ofeach XSum article, sampling from the raw conditional distribution.Discrepancies are estimated with 100 T5-3B samples.

      quite striking here is the fact that more powerful/larger models are more capable of generating unusual or "human-like" responses - looking at the overlap in log likelihoods

    2. if we apply small perturbations to a passagex ∼ pθ , producing ̃x, the quantity log pθ (x) − log pθ ( ̃x)should be relatively large on average for machine-generatedsamples compared to human-written text.

      By applying small changes to text sample x, we should be able to find the log probs of x and the perturbed example and there should be a fairly big delta for machine generated examples.

    3. As in prior work, we study a ‘white box’ setting (Gehrmannet al., 2019) in which the detector may evaluate the log prob-ability of a sample log pθ (x). The white box setting doesnot assume access to the model architecture or parameters.While most public APIs for LLMs (such as GPT-3) enablescoring text, some exceptions exist

      The authors assume white-box access to the log probability of a sample \(log p_{\Theta}(x)\) but do not require access to the model's actual architecture or weights.

    4. Empirically, we find predictive entropy to be positively cor-related with passage fake-ness more often that not; there-fore, this baseline uses high average entropy in the model’spredictive distribution as a signal that a passage is machine-generated.

      this makes sense and aligns with the gltr - humans add more entropy to sentences by making unusual choices in vocabulary that a model would not.

    5. We find that supervised detectors can provide similardetection performance to DetectGPT on in-distribution datalike English news, but perform significantly worse than zero-shot methods in the case of English scientific writing andfail altogether for German writing. T

      supervised detection methods fail on out of domain examples whereas detectgpt seems to be robust to changes in domain.

    6. ex-tending DetectGPT to use ensembles of models for scoring,rather than a single model, may improve detection in theblack box setting

      DetectGPT could be extended to use ensembles of models allowing iot to work in black box settings where the log probs are unknown

    7. hile in this work, we use off-the-shelfmask-filling models such as T5 and mT5 (for non-Englishlanguages), some domains may see reduced performanceif existing mask-filling models do not well represent thespace of meaningful rephrases, reducing the quality of thecurvature estimate.

      The approach requires access to language models that can meaningfully and accurately rephrase (perturbate) the outputs from the model under evaluation. If these things do not align then it may not work well.

    8. For models be-hind APIs that do provide probabilities (such as GPT-3),evaluating probabilities nonetheless costs money.

      This does cost money to do for paid APIs and requires that log probs are made available.

    9. We simulate human re-vision by replacing 5 word spans of the text with samplesfrom T5-3B until r% of the text has been replaced, andreport performance as r varies.

      I question the trustworthiness of this simulation - human edits are probably going to be more sporadic and random.

    10. Figure 5. We simulate human edits to machine-generated text byreplacing varying fractions of model samples with T5-3B gener-ated text (masking out random five word spans until r% of text ismasked to simulate human edits to machine-generated text). Thefour top-performing methods all generally degrade in performancewith heavier revision, but DetectGPT is consistently most accurate.Experiment is conducted on the XSum dataset

      DetectGPT shows 95% AUROC for texts that have been modified by about 10% and this drops off to about 85% when text is changed up to 24%.

    11. DetectGPT’s performancein particular is mostly unaffected by the change in languagefrom English to Germa

      Performance of this method is robust against changes between languages (e.g. English to German)

    12. ecause the GPT-3 API does not provideaccess to the complete conditional distribution for each to-ken, we cannot compare to the rank, log rank, and entropy-based prior methods

      GPT-3 api does not expose the cond probs for each token so we can't compare to some of the prior methods. That seems to suggest that this method can be used with limited knowledge about the probabilities.

    13. improving detection offake news articles generated by 20B parameterGPT-NeoX

      The authors test their approach on GPT-NeoX. The question would be whether we can get hold of the log probs from ChatGPT to do the same

    14. his approach, which we call DetectGPT,does not require training a separate classifier, col-lecting a dataset of real or generated passages, orexplicitly watermarking generated text. It usesonly log probabilities computed by the model ofinterest and random perturbations of the passagefrom another generic pre-trained language model(e.g, T5)

      The novelty of this approach is that it is cheap to set up as long as you have the log probabilities generated by the model of interest.

    15. See ericmitchell.ai/detectgptfor code, data, and other project information.

      Code and data available at https://ericmitchell.ai/detectgpt

    1. Educators are now administering the Turing test in reverse: What are questions that only humans can answer well? What kinds of thinking does writing make possible for us? 
    2. GPT-3 threatens to “[undermine] the kind of writing intensive course that had served as the backbone of [his] teaching for two decades.” “I was less worried about whether GPT-3 is genuinely intelligent,” Symons writes, “and more worried about whether the development of these tools would make us less intelligent.” 
  12. Dec 2022
    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
    1. natural-language processing is going to force engineers and humanists together. They are going to need each other despite everything. Computer scientists will require basic, systematic education in general humanism: The philosophy of language, sociology, history, and ethics are not amusing questions of theoretical speculation anymore. They will be essential in determining the ethical and creative use of chatbots, to take only an obvious example.
    2. The extraordinary ignorance on questions of society and history displayed by the men and women reshaping society and history has been the defining feature of the social-media era.
    1. Emergent abilities are not present in small models but can be observed in large models.

      Here’s a lovely blog by Jason Wei that pulls together 137 examples of ’emergent abilities of large language models’. Emergence is a phenomenon seen in contemporary AI research, where a model will be really bad at a task at smaller scales, then go through some discontinuous change which leads to significantly improved performance.

    1. Houston, we have a Capability Overhang problem: Because language models have a large capability surface, these cases of emergent capabilities are an indicator that we have a ‘capabilities overhang’ – today’s models are far more capable than we think, and our techniques available for exploring the models are very juvenile. We only know about these cases of emergence because people built benchmark datasets and tested models on them. What about all the capabilities we don’t know about because we haven’t thought to test for them? There are rich questions here about the science of evaluating the capabilities (and safety issues) of contemporary models. 
    1. As the metaphor suggests, though, the prospect of a capability overhang isn’t necessarily good news. As well as hidden and emerging capabilities, there are hidden and emerging threats. And these dangers, like our new skills, are almost too numerous to name.
    2. There’s a concept in AI that I’m particularly fond of that I think helps explain what’s happening. It’s called “capability overhang” and refers to the hidden capacities of AI: skills and aptitudes latent within systems that researchers haven’t even begun to investigate yet. You might have heard before that AI models are “black boxes” — that they’re so huge and complex that we don’t fully understand how they operate or come to specific conclusions. This is broadly true and is what creates this overhang.
    1. Which is why I wonder if this may be the end of using writing as a benchmark for aptitude and intelligence.
    2. Perhaps there are reasons for optimism, if you push all this aside. Maybe every student is now immediately launched into that third category: The rudiments of writing will be considered a given, and every student will have direct access to the finer aspects of the enterprise. Whatever is inimitable within them can be made conspicuous, freed from the troublesome mechanics of comma splices, subject-verb disagreement, and dangling modifiers.
    3. I’ve also long held, for those who are interested in writing, that you need to learn the basic rules of good writing before you can start breaking them—that, like Picasso, you have to learn how to reliably fulfill an audience’s expectations before you get to start putting eyeballs in people’s ears and things.
  13. Nov 2022
    1. “In literacy education, particularly for developing writers, instructors are looking for the level of desirable difficulty, or the point at which you are working yourself just as hard so that you don’t break but you also improve,” Laffin told Motherboard. “Finding the right, appropriate level of desirable difficulty level of instruction makes their capacity to write grow. So if you are doing compensation techniques that go beyond finding that level of desirable difficulty and instructing at that place, then you’re not helping them grow as a writer.”
  14. Aug 2022
  15. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
  16. Apr 2022
    1. # Input Input: 123, Output: Input: 121, Output: Input: 111, Output: Input: 123454321, Output: Input 123123, Output: # Instruction Output true if input is a palindrome # Output Input: 123, Output: false Input: 121, Output: true Input: 111, Output: true Input: 123454321, Output: true Input 123123, Output: false

      Example of using GPT-3 for programming

  17. Nov 2021
    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
  18. Jun 2021
    1. When creating a BIOS Boot Partition on a GPT system, you should make sure that it is at least 31 KiB in size.

      This is important. If not set this, the OS won't be detected when grub is used with GPT system.

  19. Apr 2021
  20. Feb 2021
  21. Jul 2020