3,441 Matching Annotations
  1. Jul 2019
    1. Every time your child opens the email, that person knows generally where they are (or specifically, if they have other info to triangulate against).
    1. In contrast to such pseudonymous social networking, Facebook is notable for its longstanding emphasis on real identities and social connections.

      Lack of anonymity also increases Facebook's ability to properly link shadow profiles purchased from other data brokers.

    1. our sum of squares is 41.187941.187941.1879

      Just considering the Y, and not the X. Calculating the residuals from the average/mean Y.

    1. in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance

      Use standardization, not min-max scaling, for clustering and PCA.

    2. As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.
    1. driven by data—where schools use data to identify a problem, select a strategy to address the problem, set a target for improvement, and iterate to make the approach more effective and improve student achievement.

      Gates data model.

    1. many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
  2. Jun 2019
  3. varsellcm.r-forge.r-project.org varsellcm.r-forge.r-project.org
    1. missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random.

      VarSelLCM package

    1. Success ina data science project comes not from access to any one exotic tool, but from having quantifiablegoals, good methodology, crossdiscipline interactions, and a repeatable workflow.

    Tags

    Annotators

    1. Academicsarealsoatfaulthere:arecentanalysisof29millionpapersinover15,000peer-reviewedtitlespublishedaroundthetimeoftheZikaandEbolaepidemicsfoundthatlessthan1%exploredthegenderedimpactoftheoutbreaks

      How do we prevent this pattern here at Georgia Tech? There is a very obvious gender gap, especially in STEM where bad data in medicine and engineering are collected? What are some mini steps we can take to encourage pursuing data for different backgrounds? Education is always first, starting with class similar to this one informing people about how gender plays a role. Perhaps then we can create projects exploring the issue in data related to each person's major.

    Tags

    Annotators

    1. However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

      Use min-max scaling for image processing & neural networks.

    2. The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0μ=0\mu = 0 and σ=1σ=1\sigma = 1 where μμ\mu is the mean (average) and σσ\sigma is the standard deviation from the mean
  4. May 2019
    1. 1RWDOOPRYLHVKDYHWREHGRFXPHQWDULHVDQGQRWDOOYLVXDOL]DWLRQKDVWREHWUDGLWLRQDOFKDUWVDQGJUDSKV

      This is an interesting fact, usually when I think of visualization and data I go to the classic default charts and data. I'll have to keep this iin mind.

    2. 7KHEDVHRIWKHJUDSKLFLVVLPSO\DOLQHFKDUW+RZHYHUGHVLJQHOHPHQWVKHOSWHOOWKHVWRU\EHWWHU/DEHOLQJDQGSRLQWHUVSURYLGHFRQWH[WDQGKHOS\RXVHHZK\WKHGDWDLVLQWHUHVWLQJDQGOLQHZLGWKDQGFRORUGLUHFW\RXUH\HVWRZKDW¶VLPSRUWDQW

      I really like this because I don't see it often and it actually does draw my eye to the data and capture my interest.

    1. Virtually all BPMs have utilities for creating simple, data-gathering forms. And in many types of workflows, these simple forms may be adequate. However, in any workflow that includes complex document assembly (such as loan origination workflows), BPM forms are not likely to get the job done. Automating the assembly of complex documents requires ultra-sophisticated data-gathering forms, which can only be designed and created after the documents themselves have been automated. Put another way, you won't know which questions need to be asked to generate the document(s) until you've merged variables and business logic into the documents themselves. The variables you merge into the document serve as question fields in the data gathering forms. And here's the key point - since you have to use the document assembly platform to create interviews that are sophisticated enough to gather data for your complex documents, you might as well use the document assembly platform to generate all data-gathering forms in all of your workflows.
    1. El ritmo de las actividades de diseño e instalación de redes comunitarias en veredas del municipio de Fusagasugá se ve acrecentado por las convocatorias internas de investigación de la Universidad de Cundinamarca que a lo largo del tiempo de vida de Red FusaLibrehan sido un músculo financiero que les permite acelerar los proc

      Interesante vínculo entre comunidad y universidad. En nuestro caso, no hemos logrado un vínculo permanente y si bien algunos dineros de convocatorias de investigación universitaria y convocatorias internacionales permitieron pagar parte de los Data Weeks, junto con una contribución menor de algunos asistentes, en general ha sido un proyecto financiado con recursos propios y préstamos familiares.

    1. Developing economies’ copper demand has steadily grown over the last decades, fueling economic and social improvement. By 2011, China already represented 40% of the demand.

      Why does China need so much.

    2. Codelco is a state-owned Chilean mining company and the world’s largest copper producer. Based on their annual report and USGS statistics, they produced ~10% of the world’s copper in 2015 and own 8% of global reserves. They are also a large producer of greenhouse gas emissions. Last year, Codelco produced 3,2 t CO2e/millions tmf from both indirect and direct effects, and in 2011 it consumed 12% of the total national electricity supply.

      Goddamn they should start recylcling

    1. Methodology The classic OSINT methodology you will find everywhere is strait-forward: Define requirements: What are you looking for? Retrieve data Analyze the information gathered Pivoting & Reporting: Either define new requirements by pivoting on data just gathered or end the investigation and write the report.

      Etienne's blog! Amazing resource for OSINT; particularly focused on technical attacks.

  5. Apr 2019
    1. Powered by Data wrote 4 of the resources on this page. "Measuring Outcomes" is about admin data. "Understanding the Philanthropic Landscape" is about open data - sp. open grants data. "Effective Giving" is an intro. And "Emerging Data Practices" is a tech backgrounder from June 2015.

    1. Instead of encouraging more “data-sharing”, the focus should be the cultivation of “data infrastructure”,¹⁴ maintained for the public good by institutions with clear responsibilities and lines of accountability.

  6. Mar 2019
  7. www.archivogeneral.gov.co www.archivogeneral.gov.co
    1. Normalización de las entradas descriptivas: Personas, Lugares, Instituciones (utilización de Linked Open Data (LOD) cuando sea posible.

      ¿Qué sistema de organización de conocimiento se los posibilita? ¿Qué están usando para enlazar datos y en qué formato?

    1. The government needs to place tough restrictions on data collection and storage by businesses to limit the amount of damage in the event of a cyber breach.

      I find it hard to imagine how this could be usefully implemented. How is monitoring of data collection going to be done?

      Even simpler ideas, like the Do Not Call registry, have difficulty clamping down on businesses that breach regulations.

    1. Mithering about the unmodellable. "Sometime late last year I went to the Euro IA conference with Anya and Silver to give a talk on the domain modelling work we've been doing in UK Parliament."

    1. DXtera Institute is a non-profit, collaborative member-based consortium dedicated to transforming student and institutional outcomes in higher education.

      DXtera Institute is a non-profit, collaborative member-based consortium dedicated to transforming student and institutional outcomes in higher education. We specialize in helping higher education professionals drive more efficient access to information and insights for effective decision-making and realize long-term cost savings, by simplifying and removing barriers to systems integration and improving data aggregation and control.

      With partners across the U.S. and Europe, our consortium includes some of the brightest minds in education and technology, all working together to solve critical higher education issues on a global scale.

    1. Data journalism produced by two of the nation’s most prestigious news organizations — The New York Times and The Washington Post — has lacked transparency, often failing to explain the methods journalists or others used to collect or analyze the data on which the articles were based, a new study finds. In addition, the news outlets usually did not provide the public with access to that data

      While this is a worthwhile topic, I would like to see more exploration of data journalism in the 99.99999 percent of news organizations that are NOT the New York Times or the Washington Post and don't have the resources to publish so many data stories despite the desperate need for them across the nation. Also, why no digital news outlets included?

    2. Worse yet, it wouldn’t surprise me if we saw more unethical people publish data as a strategic communication tool, because they know people tend to believe numbers more than personal stories. That’s why it’s so important to have that training on information literacy and methodology.”

      Like the way unethical people use statistics in general? This should be a concern, especially as government data, long considered the gold standard of data, undergoes attacks that would skew the data toward political ends. (see the census 2020)

    3. fall short of the ideal of data journalism

      Is this the ideal of data journalism? Where is this ideal spelled out, and is there any sign that the NYT and WaPo have agreed to abide by this ideal?

  8. Feb 2019
    1. set; if this is higher, the tree 2can be considered to fit the data less well

      To test the fit between data and more than one alternative tree, you can just do a bootstrap analysis, and map the results on a neighbour-net splits graph based on the same data.

      Note that the phangorn library includes functions to transfer information between trees/tree samples and trees and networks:<br/> Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution (DOI:10.1111/2041-210X.12760.)[http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12760/full] – the basic functions and script templates are provided in the associated vignette.

    1. These models are emerging, which is why its exciting to be involved in the ground floor of this sector, however some models clearly make sense already and thats largely because they closely follow the models free software itself has shaped. If you want status, then you can make a name for yourself by leading a team to write the docs ala free software itself, if you want money then build the reputation for the documentation team and contract out your knowledge (eg. extend the docs on contract ala free software).

      Creo que hay que conectarlo con modelos de microfinanciación y tiendas independientes tipo Itch.io y que el experimento debería ser progresivo pero dejar un mapa posible de su propio futuro. Algo así intentaremos en la edición 13a del Data Week.

    1. Dissecting Flavivirus Biology in Salivary Gland Cultures from Fed and Unfed Ixodes scapularis (Black-Legged Tick)

      Data worth viewing: a tick trachea with viral infection in its salivary glands.

    1. !..�P'�r\0CA \= e,;4 ��'-"-'

      Could empirical data made up of experiences present in the form of an ethnography? Or autoethnography? I'm not sure if this is what you were getting at here, but it is a thought that came to mind!

  9. Jan 2019
    1. Nyhan and Reifler also found that presenting challenging information in a chart or graph tends to reduce disconfirmation bias. The researchers concluded that the decreased ambiguity of graphical information (as opposed to text) makes it harder for test subjects to question or argue against the content of the chart.

      Amazingly important double-edged finding for discussions of data visualization!

  10. demandlab.weebly.com demandlab.weebly.com
    1. y bosses want to see quick wins, but I know we can achieve big w

      add "My data (database) quality sucks"

    1. You may not access or use the Site in any manner that could damage or overburden any MIT server, or any network connected to any MIT server. You may not use the Site in any manner that would interfere with any other party’s use of the Site.

      Vamos a realizar pequeños scrapping, que no sobrecargarán el servidor, así que estamos cumpliendo con esta parte y de hecho, después de que trabajemos, permitiran repartir la carga del servidor, pues una copia estará en nuestros servidores.

    1. Adoption of good practice to generate high quality data will depend on sharing the burden of capacity building in some way. That in turn, can-not happen until there is a framework that provides sufficient trust to allow the sharing and compar-ison of data and its management.

      harkening to the 'data trust' concept being discussed from U.S. Mellon-funded projects, also co-authored by the authors of this paper.

  11. Dec 2018
    1. Outliers : All data sets have an expected range of values, and any actual data set also has outliers that fall below or above the expected range. (Space precludes a detailed discussion of how to handle outliers for statistical analysis purposes, see: Barnett & Lewis, 1994 for details.) How to clean outliers strongly depends on the goals of the analysis and the nature of the data.

      Outliers can be signals of unanticipated range of behavior or of errors.

    2. Understanding the structure of the data : In order to clean log data properly, the researcher must understand the meaning of each record, its associated fi elds, and the interpretation of values. Contextual information about the system that produced the log should be associated with the fi le directly (e.g., “Logging system 3.2.33.2 recorded this fi le on 12-3-2012”) so that if necessary the specifi c code that gener-ated the log can be examined to answer questions about the meaning of the record before executing cleaning operations. The potential misinterpretations take many forms, which we illustrate with encoding of missing data and capped data values.

      Context of the data collection and how it is structured is also a critical need.

      Example, coding missing info as "0" risks misinterpretation rather than coding it as NIL, NDN or something distinguishable from other data

    3. Data transformations : The goal of data-cleaning is to preserve the meaning with respect to an intended analysis. A concomitant lesson is that the data-cleaner must track all transformations performed on the data .

      Changes to data during clean up should be annotated.

      Incorporate meta data about the "chain of change" to accompany the written memo

    4. Data Cleaning A basic axiom of log analysis is that the raw data cannot be assumed to correctly and completely represent the data being recorded. Validation is really the point of data cleaning: to understand any errors that might have entered into the data and to transform the data in a way that preserves the meaning while removing noise. Although we discuss web log cleaning in this section, it is important to note that these principles apply more broadly to all kinds of log analysis; small datasets often have similar cleaning issues as massive collections. In this section, we discuss the issues and how they can be addressed. How can logs possibly go wrong ? Logs suffer from a variety of data errors and distortions. The common sources of errors we have seen in practice include:

      Common sources of errors:

      • Missing events

      • Dropped data

      • Misplaced semantics (encoding log events differently)

    5. In addition, real world events, such as the death of a major sports fi gure or a political event can often cause people to interact with a site differently. Again, be vigilant in sanity checking (e.g., look for an unusual number of visitors) and exclude data until things are back to normal.

      Important consideration for temporal event RQs in refugee study -- whether external events influence use of natural disaster metaphors.

    6. Recording accurate and consistent time is often a challenge. Web log fi les record many different timestamps during a search interaction: the time the query was sent from the client, the time it was received by the server, the time results were returned from the server, and the time results were received on the client. Server data is more robust but includes unknown network latencies. In both cases the researcher needs to normalize times and synchronize times across multiple machines. It is common to divide the log data up into “days,” but what counts as a day? Is it all the data from midnight to midnight at some common time reference point or is it all the data from midnight to midnight in the user’s local time zone? Is it important to know if people behave differently in the morning than in the evening? Then local time is important. Is it important to know everything that is happening at a given time? Then all the records should be converted to a common time zone.

      Challenges of using time-based log data are similar to difficulties in the SBTF time study using Slack transcripts, social media, and Google Sheets

    7. Log Studies collect the most natural observations of people as they use systems in whatever ways they typically do, uninfl uenced by experimenters or observers. As the amount of log data that can be collected increases, log studies include many different kinds of people, from all over the world, doing many different kinds of tasks. However, because of the way log data is gathered, much less is known about the people being observed, their intentions or goals, or the contexts in which the observed behaviors occur. Observational log studies allow researchers to form an abstract picture of behavior with an existing system, whereas experimental log stud-ies enable comparisons of two or more systems.

      Benefits of log studies:

      • Complement other types of lab/field studies

      • Provide a portrait of uncensored behavior

      • Easy to capture at scale

      Disadvantages of log studies:

      • Lack of demographic data

      • Non-random sampling bias

      • Provide info on what people are doing but not their "motivations, success or satisfaction"

      • Can lack needed context (software version, what is displayed on screen, etc.)

      Ways to mitigate: Collecting, Cleaning and Using Log Data section

    8. Two common ways to partition log data are by time and by user. Partitioning by time is interesting because log data often contains signifi cant temporal features, such as periodicities (including consistent daily, weekly, and yearly patterns) and spikes in behavior during important events. It is often possible to get an up-to-the- minute picture of how people are behaving with a system from log data by compar-ing past and current behavior.

      Bookmarked for time reference.

      Mentions challenges of accounting for time zones in log data.

    9. An important characteristic of log data is that it captures actual user behavior and not recalled behaviors or subjective impressions of interactions.

      Logs can be captured on client-side (operating systems, applications, or special purpose logging software/hardware) or on server-side (web search engines or e-commerce)

    10. Table 1 Different types of user data in HCI research

    11. Large-scale log data has enabled HCI researchers to observe how information diffuses through social networks in near real-time during crisis situations (Starbird & Palen, 2010 ), characterize how people revisit web pages over time (Adar, Teevan, & Dumais, 2008 ), and compare how different interfaces for supporting email organi-zation infl uence initial uptake and sustained use (Dumais, Cutrell, Cadiz, Jancke, Sarin, & Robbins, 2003 ; Rodden & Leggett, 2010 ).

      Wide variety of uses of log data

    12. Behavioral logs are traces of human behavior seen through the lenses of sensors that capture and record user activity.

      Definition of log data

    1. Ethnographic findings are not privileged, just particular: another country heard from. To regard them as anything more (or anything less) than that distorts both them and their implications, which are far profounder than mere primitivity, for social theory.

      This tension exists in HCI as well.

      Interpreted data vs empirical data and how each is systematically analyzed.

  12. Nov 2018
    1. One way to think about "core" biodiversity data is as a network of connected entities, such as taxa, taxonomic names, publications, people, species, sequences, images, collections, etc. (Fig. 1)
    1. “It’s about embracing the inscrutable nature of human interactions,” says Chang. Evidence-based medicine was a massive improvement over intuition-based medicine, he says, but it only covers traditionally quantifiable data, or those things that are easy to measure. But we’re now quantifying information that was considered qualitative a generation ago.

      Biggest challenges to redesigning the health care system in a way that would work better for patients and improve health

    2. “Our biggest opportunity is leaning into that. It’s either embracing the qualitative nature of that and designing systems that can act just on the qualitative nature of their experience, or figuring how to quantitate some of those qualitative measures,” says Chang. “That’ll get us much further, because the real value in health care systems is in the human interactions. My relationship with you as a doctor and a patient is far more valuable than the evidence that some trial suggests.”

      Biggest challenges to redesigning the health care system in a way that would work better for patients and improve health

    1. The Chinese place a higher value on community good versus individual rights, so most feel that, if social credit will bring a safer, more secure, more stable society, then bring it on
    1. Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
    1. For the second, we could try to detect inconsistencies, eitherby inspecting samples of the class hierarchy

      Yes, that's what I do when doing quality work on the taxonomy (with the tool wdtaxonomy)

    2. Possible relations between Items

      This only includes properties of data-type item?! It should be made more clear because the majority of Wikidata classes has other data types.

    3. A KG typically spans across several domains and is built on topof a conceptual schema, orontology, which defines what types of entities (classes) are allowed inthe graph, alongside the types ofpropertiesthey can have

      Wikidata differs from typical KG as it is not build on top of classes (entity types). Any item (entity) can be connected by any property. Wikidata's only strict "classes" in the sense of KG classes are its data types (item, lemma, monolingual string...).

    Tags

    Annotators

    1. Entscheidend ist, dass sie Herren des Verfahrens bleiben - und eine Vision für das neue Maschinenzeitalter entwickeln.

      Es sieht für mich nicht eigentlich so aus als wären wir jemals die "Herren des Verfahrens" gewesen. Und auch darum geht es ja bei Marx. Denke ich.

    1. Does the widespread and routine collection of student data in ever new and potentially more-invasive forms risk normalizing and numbing students to the potential privacy and security risks?

      What happens if we turn this around - given a widespread and routine data collection culture which normalizes and numbs students to risk as early as K-8, what are our responsibilities (and strategies) to educate around this culture? And how do our institutional practices relate to that educational mission?

  13. Oct 2018
    1. As a recap, Chegg discovered on September 19th a data breach dating back to April that "an unauthorized party" accessed a data base with access to "a Chegg user’s name, email address, shipping address, Chegg username, and hashed Chegg password" but no financial information or social security numbers. The company has not disclosed, or is unsure of, how many of the 40 million users had their personal information stolen.

    1. tl;dr: data engineer = software, coding, cleaning data sets data architects = structure the technology to manage data models and database admin data scientist = stats + math models business analysts = communication and domain expertise

    1. research publications are not research data

      they could be, if used as part of a text mining corpus, for example

  14. Sep 2018
    1. I love the voice of their help page. Someone very opinionated (in a good way) is building this product. I particularly like this quote: Your data is a liability to us, not an asset.
    1. End-Users

      Because Grafoscopio was used in critical digital literacy workshops, dealing with data activism and journalism, the intended users are people who don't know how to program necessarily, but are not afraid of learning to code to express their concerns (as activists, journalists and citizens in general) and if fact are wiling to do so.

      Tool adaptation was "natural" of the workshops, because the idea was to extend the tool so it can deal with authentic problems at hand (as reported extensively in the PhD thesis) and digital citizenship curriculum was build in the events as a memory of how we deal with the problems. But critical digital literacy is a long process, so coding as a non-programmers knowledge in service of wider populations able to express in code, data and visualizations citizen concerns is a long time process.

      Visibility, scalability and sustainablitiy of such critical digital literacy endeavors where communities and digital tools change each other mutually is still an open problem, even more considering their location in the Global South (despite addressing contextualized global problems).

    1. In October 2014 the Open Knowledge Foundation recommends the Creative Commons CC0 license to dedicate content to the public domain,[51][52] and the Open Data Commons Public Domain Dedication and License (PDDL) for data.[53]
    1. predictive analysis

      Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events.

  15. Aug 2018
    1. this possibility of increased ownership and agency over technology and a somewhat romantic idea I have that this can transfer to inspire ownership and agency over learning
    1. A file containing personal information of 14.8 million Texas residents was discovered on an unsecured server. It is not clear who owns the server, but the data was likely compiled by Data Trust, a firm created by the GOP.