3,162 Matching Annotations
  1. Last 7 days
    1. We also ran evaluations of model latency and classification performance under varying false positive rates for the following LLMs by OpenAI: GPT-4o, GPT-4o-mini, and o3-mini.

      sentences describing methods the authors used; one sentence at a time

    2. We ensured each list was 30 items long as our pilot studies suggested this was long enough that manual detection starts to become unwieldy (users need to scroll up and down the document), but short enough that participants could become familiar in a short period.

      sentences describing methods the authors used; one sentence at a time

    3. We adapted two intent specifications from our evals: Mars Game Design Document and Financial Advice AI Agent Memory, as these tasks mapped to the two paradigmatic types covered in Sections 2 and 2.1 (design documents, and AI memory of the user).

      sentences describing methods the authors used; one sentence at a time

    4. We chose OpenAI's ChatGPT Canvas as a baseline for five reasons: (i) it is a popular, commercially available tool, hence it is likely familiar to users; (ii) it provides a document editing view, where users can select text and ask GPT to rewrite it, or chat with an AI to make global edits; (iii) it employs a similar class of model (GPT-4o); (iv) it supports similar editing features as SemanticCommit like inline text selection, conflict highlighting, and a diff view, while adding free-form editing; and (v) similar interfaces like Anthropic Artifacts tended to rewrite the specification entirely, and did not offer Canvas's "diff" view to allow for a fair comparison.

      sentences describing methods the authors used; one sentence at a time

    5. Our explorations went through substantial iterations and prompt prototyping over a period of eight months, evolving in response to two pilot studies and progressing from a card-based interface to a list of texts.

      sentences describing methods the authors used; one sentence at a time

    6. We iterated on prompts using ChainForge [5] by setting up an evaluation pipeline against our datasets, which allowed us to observe the effects of prompt changes and model choices.

      sentences describing methods the authors used; one sentence at a time

    7. For qualitative analysis, the first author performed open coding on participant responses and audio transcripts to identify themes, which were used to interpret the qualitative results.

      sentences describing methods the authors used; one sentence at a time

    8. In the post-task surveys, we collected self-reported NASA Task Load Index (TLX) scores, Likert-scale ratings for ease of use, and responses on how well the AI helped participants identify, understand, and resolve semantic conflicts.

      sentences describing methods the authors used; one sentence at a time

    9. We run end-to-end on our four eval datasets using GPT-4o and GPT-4o-mini and report the mean ± stddev for accuracy, precision, recall, and F1 scores for the three approaches in Figure 5.

      sentences describing methods the authors used; one sentence at a time

    10. We compare our end-to-end system against two simpler methods: (i) DropAllDocs, which adds all documents to the context for conflict classification; and (ii) InkSync [56] which generates a JSON list of string-replace operations.

      sentences describing methods the authors used; one sentence at a time

    11. Through a within-subjects study with 12 participants comparing SemanticCommit to a chat-with-document baseline (OpenAI Canvas), we find differences in workflow: half of our participants adopted a workflow of impact analysis when using SemanticCommit, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature.

      sentences describing methods the authors used; one sentence at a time

    12. We compare our end-to-end system against two simpler methods: (i) DropAllDocs, which adds all documents to the context for conflict classification; and (ii) InkSync [56] which generates a JSON list of string-replace operations.

      sentences describing methods the authors used; one sentence at a time

    13. In the post-task surveys, we collected self-reported NASA Task Load Index (TLX) scores, Likert-scale ratings for ease of use, and responses on how well the AI helped participants identify, understand, and resolve semantic conflicts.

      sentences describing methods the authors used; one sentence at a time

    14. Our explorations went through substantial iterations and prompt prototyping over a period of eight months, evolving in response to two pilot studies and progressing from a card-based interface to a list of texts.

      sentences describing methods the authors used; one sentence at a time

    15. These semantic conflicts require dedicated support to detect, visualize, and resolve. Semantic conflict resolution interfaces must go beyond visualizing what changes were made, to what changes could be made, where they should be made, and what the effects might be. This resembles feedforward: affordances that help the user foresee the impact of an action [67, 93].

      sentences describing connections to theory; one sentence at a time

    16. This reflects the principle of feedforward [67, 93] in communication theory—"a needed prescription or plan for a feedback, to which the actual feedback may or may not confirm" [79]—where a communicator provides "the context of what one was planning to talk about" [64, p. 179-80] in order to "pre-test the impact of [its output]" on the listener [34, p. 65].

      sentences describing connections to theory; one sentence at a time

    1. We consider common sequences of chunk roles to be alignable structures that could be used to support users in identifying structural similarities and differences across sentences in different abstracts, in line with Structure-Mapping Theory [17].

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    2. Like prior Structural Mapping Theory (SMT)-informed work in text corpora representation, AbstractExplorer's features have enabled some users to see more of both the overview and the details at the same time, facilitating abstraction without losing context.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    3. This ordering prioritizes dominant structural patterns (largest groups first) while exposing fine-grained variations (via length-sorted triplets), mirroring how humans compare sentences, if SMT is an accurate description in this domain of comparative close reading.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    4. Structural mappings between objects are part of the cognitive process of comparison according to the Structure-Mapping Theory [17], and juxtaposition can facilitate humans in recognizing particular possible structural mappings between objects [75].

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    5. In SMT terminology, rendering and arranging according to corresponding chunks reify "commonalities in structure," while variation within corresponding chunks are "alignable differences" that users are predicted to notice.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    6. The prior SMT-informed tools in Section 2.3 for both code and natural language corpora suggest that the cognitive process of comparing texts may be no exception to the cognitive processes SMT predicts.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    7. SMT posits that visual alignment helps people perceive relational similarities and differences more clearly, thereby improving their ability to make meaningful comparisons and understand underlying patterns [28, 38, 47].

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    8. Structural Mapping Theory (SMT) is a long-standing well-vetted theory from Cognitive Science that describes how humans attend to and try to compare objects by finding mental representations of them that can be structurally mapped to each other (analogies).

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    9. This SMT-informed approach, which AbstractExplorer shares, tries to give this mental machinery "a leg up," letting users perhaps skip some steps by accepting reified cross-document relationships identified by the computer.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    10. The human perceptual, comparative mental machinery that SMT describes is part of what enables humans to form more abstract structured mental models from concrete examples, among other critical knowledge tasks.

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    11. These examples of text-centric lossless techniques do not abstract away or summarize; they strategically re-organize and re-render the existing text to help enhance readers' own perceptual cognition, informed by Structural Mapping Theory (SMT) [17].

      sentences that mention theory, explicitly or implicitly; one sentence at a time

    12. We process this data in a three-stage pipeline (Figure 6). In the first stage, Sentence Segmentation and Categorization, abstracts are split into individual sentences using the NLTK package, and each sentence is classified into one of the five pre-defined aspects as listed in Section 4.1.1. Classification is performed by prompting an LLM (see prompt used in Appendix D.1) with the sentence and its full abstract.

      sentence relating to methodology

    13. Then, we segment sentences within each aspect into grammar-preserving chunks (see prompt used in Appendix D.2). This results in grammatically coherent chunks that are the basis of structure patterns. After identifying chunk boundaries, we again prompt an LLM to generate labels for chunks in a human-in-the-loop approach: starting from an initial set of labels for chunk roles, when a new label is generated, a researcher from the research team examines the new label and merges it with existing labels if appropriate, controlling for the total number of labels.

      sentence relating to methodology

    14. In this study, we allowed participants to experience views of same-aspect sentences (Section 4.1.1) with different combinations of highlighting, ordering, and alignment (as described in Section 4.1.2 and Section 4.1.4) enabled or not, in order to understand which and/or what combinations most effectively supported users' ability to skim and read laterally across documents.

      sentence relating to methodology

    15. Inspired by GP-TSM [24], AbstractExplorer first segments sentences into grammar-preserving chunks—segments that respect grammatical boundaries, i.e., an LLM judges that the sentence can be truncated at that chunk boundary without breaking the grammatical integrity of the preceding text. Each chunk is then classified by an LLM as having one of nine pre-defined roles, each of which has its own assigned color.

      sentence relating to methodology

    16. We conducted a qualitative analysis of user study transcripts and survey responses using a Grounded Theory approach [8]. First, the lead researcher collected a list of participants' behaviors, approaches, reflections on their experience, and feedback about the interface. The researcher then systematically coded this data, revisiting the data multiples times and refining the codes to ensure consistency and coherence. Through this process, high-level themes were identified and organized using affinity diagramming. Once the thematic structure was finalized, the researcher gathered supporting evidence for each theme and synthesized the findings, which were reviewed by the research team to ensure agreement on the results.

      sentence describing how analysis was performed on data collected by the authors of this paper

    17. Interviews were video and audio recorded. We transcribed the audio using OpenAI's Whisper automatic speech recognition system and anonymized the transcript before analysis. We analyzed the interview data using thematic analysis [1]. First, two members of the research team independently coded four (25% of collected data) randomly chosen participant data to generate low-level codes. The inter-coder reliability between the coders was 0.88 using Krippendorff's alpha [37]. The two coders then met together to cross-check, resolve coding conflicts, and consolidate the codes into a codebook across two sessions. Using the codebook, the two coders analyzed six randomly selected participant data each. The research team then met, discussed the analysis outcomes, and finalized themes over three sessions.

      sentence describing how analysis was performed on data collected by the authors of this paper

    18. Activity log data, which revealed how participants actually used the interface, echoed the above findings. According to the log data, participants spent most of their reading time (66.31%) with vertical alignment on the second element in structure pairs, followed by alignment on the first element (29.19%), and left-justified alignment (5.13%). Highlighting usage showed a similar preference: 91.13% of time with all chunks highlighted, 8.25% with partial highlighting, and minimal time (0.63%) without highlights.

      sentence describing how analysis was performed on data collected by the authors of this paper

    19. In this section, we present findings on how AbstractExplorer supports comparative close reading at scale by integrating quantitative survey responses and log data with qualitative analysis of transcripts and open-ended responses. The qualitative analysis process is described in detail in Appendix H.

      sentence describing how analysis was performed on data collected by the authors of this paper

    20. Throughout the two tasks, we also collected detailed interaction logs including counts of user-defined aspects created, duration of highlighting usage, and time allocation across the three possible alignment options.

      sentence describing how analysis was performed on data collected by the authors of this paper

    21. Both gaze data and the semi-structured interviews revealed that lower NFC participants were more willing to be guided by the three features and took advantage of them consciously.

      sentence describing how analysis was performed on data collected by the authors of this paper

    22. Using a two-tailed Mann-Whitney U Test, we found that participants who reported their lowest perceived cognitive load when all three features were enabled had significantly lower NFC than participants who reported their lowest cognitive load level when skimming with no features enabled—in the baseline interface (p=0.03).

      sentence describing how analysis was performed on data collected by the authors of this paper

    23. For simplicity of analysis, we denote participants with NFC scores above the overall participants' median NFC of 5.42 (IQR = 0.583) as higher NFC, and lower NFC otherwise.

      sentence describing how analysis was performed on data collected by the authors of this paper

    24. To contrast participants' gaze patterns in each condition, we used a Tobii Pro Spark eye-tracker placed below the desktop monitor used by all subjects; Tobii Pro Lab software recorded each participant's gaze over time in each condition.

      sentence describing how analysis was performed on data collected by the authors of this paper

    25. We collected 80 sentences from our abstracts dataset labeled by our system as "Methodology/Contribution." Participants viewed the same 80 sentences in each condition—often with a different subset of sentences initially visible due to ordering changes—but only had two minutes to look at them in each condition.

      sentence describing how analysis was performed on data collected by the authors of this paper

    26. After obtaining an expanded set of high-level chunk labels, we assign them to each of the sentence chunks by using LLMs in a multiclass classification few-shot learning task, with the initial labels and assignment as examples (see prompt used in Appendix D.3).

      sentence describing how analysis was performed on data collected by the authors of this paper

    27. Then, we segment sentences within each aspect into grammarpreserving chunks (see prompt used in Appendix D.2). This results in grammatically coherent chunks that are the basis of structure patterns. After identifying chunk boundaries, we again prompt an LLM to generate labels for chunks in a human-in-the-loop approach: starting from an initial set of labels for chunk roles, when a new label is generated, a researcher from the research team examines the new label and merges it with existing labels if appropriate, controlling for the total number of labels.

      sentence describing how analysis was performed on data collected by the authors of this paper

    28. We process this data in a three-stage pipeline (Figure 6). In the first stage, Sentence Segmentation and Categorization, abstracts are split into individual sentences using the NLTK package, and each sentence is classified into one of the five pre-defined aspects as listed in Section 4.1.1. Classification is performed by prompting an LLM (see prompt used in Appendix D.1) with the sentence and its full abstract.

      sentence describing how analysis was performed on data collected by the authors of this paper

  2. ontheline.trincoll.edu ontheline.trincoll.edu