3,501 Matching Annotations
  1. Apr 2020
    1. Data Erasure and Storage Time The personal data of the data subject will be erased or blocked as soon as the purpose of storage ceases to apply. The data may be stored beyond that if the European or national legislator has provided for this in EU regulations, laws or other provisions to which the controller is subject. The data will also be erased or blocked if a storage period prescribed by the aforementioned standards expires, unless there is a need for further storage of the data for the conclusion or performance of a contract.
    2. The data is stored in log files to ensure the functionality of the website. In addition, the data serves us to optimize the website and to ensure the security of our information technology systems. An evaluation of the data for marketing purposes does not take place in this context. The legal basis for the temporary storage of the data and the log files is Art. 6 para. 1 lit. f GDPR. Our legitimate interests lie in the above-mentioned purposes.
    1. Someone, somewhere has screwed up to the extent that data got hacked and is now in the hands of people it was never intended to be. No way, no how does this give me license to then treat that data with any less respect than if it had remained securely stored and I reject outright any assertion to the contrary. That's a fundamental value I operate under
    1. One mistake that we made when creating the import/export experience for Blogger was relying on one HTTP transaction for an import or an export. HTTP connections become fragile when the size of the data that you're transferring becomes large. Any interruption in that connection voids the action and can lead to incomplete exports or missing data upon import. These are extremely frustrating scenarios for users and, unfortunately, much more prevalent for power users with lots of blog data.
    2. At Google, our attitude has always been that users should be able to control the data they store in any of our products, and that means that they should be able to get their data out of any product. Period. There should be no additional monetary cost to do so, and perhaps most importantly, the amount of effort required to get the data out should be constant, regardless of the amount of data. Individually downloading a dozen photos is no big inconvenience, but what if a user had to download 5,000 photos, one at a time, to get them out of an application? That could take weeks of their time.
    1. In addition to strings, Redis supports lists, sets, sorted sets, hashes, bit arrays, and hyperloglogs. Applications can use these more advanced data structures to support a variety of use cases. For example, you can use Redis Sorted Sets to easily implement a game leaderboard that keeps a list of players sorted by their rank.

      redis support more data structure memcached is k-v

      memCached is not highly available, beause lack of replication support like redis

    1. Adapting Tricia Wang’s notion of “thick data,”

      https://dscout.com/people-nerds/people-are-your-data-tricia-wang So what is thick data? It’s a term you coined, right? "Thick data is simply my re-branding of qualitative data, or design thinking, or user research, gut intuition, tribal knowledge, sensible thinking—I’ve heard it called so many things. Being open to the unknown means you will embrace data that is thick. Ostensibly it’s data that has yet to be quantified, is data that you may not even know that you need to collect and you don’t know it until you’re in the moment being open to it, and open to the invitation of the unknown in receiving that."

    1. Google's move to release location data highlights concerns around privacy. According to Mark Skilton, director of the Artificial Intelligence Innovation Network at Warwick Business School in the UK, Google's decision to use public data "raises a key conflict between the need for mass surveillance to effectively combat the spread of coronavirus and the issues of confidentiality, privacy, and consent concerning any data obtained."
  2. Mar 2020
    1. of the six lawful, GDPR-compliant ways companies can get the green light to process individual personal data, consent is the “least preferable.” According to guidelines in Article 29 Working Party from the European Commission, "a controller must always take time to consider whether consent is the appropriate lawful ground for the envisaged processing or whether another ground should be chosen instead." 
    2. “It is unfortunate that a lot of companies are blindly asking for consent when they don’t need it because they have either historically obtained the consent to contact a user,” said digital policy consultant Kristina Podnar. “Or better yet, the company has a lawful basis for contact. Lawful basis is always preferable to consent, so I am uncertain why companies are blindly dismissing that path in favor of consent.”
    1. For example, personal information. Each governmental ministry might have any amount of information on a given voter. Finance knows about your tax, health knows about your medicare claims, justice knows about your criminal record, the RTA the status of your drivers licence. But they don’t know what the OTHER ministries know about you. The data is segregated. Just because one group of people knows about a given part of your life or existence, doesn’t mean that they can get access to everything else. You, by design, do not have one single file that has everything on you. The data is segregated.
    1. Users have the right to access their personal data and information about how their personal data is being processed. If the user requests it, data controllers must provide an overview of the categories of data being processed, a copy of the actual data and details about the processing. The details should include the purpose, how the data was acquired and with whom it was shared.
    1. Data privacy now a global movementWe’re pleased to say we’re not the only ones who share this philosophy, web browsing companies like Brave have made it possible so you can browse the internet more privately; and you can use a search engine like DuckDuckGo to search with the freedom you deserve.
  3. www.graphitedocs.com www.graphitedocs.com
    1. When you think about data law and privacy legislations, cookies easily come to mind as they’re directly related to both. This often leads to the common misconception that the Cookie Law (ePrivacy directive) has been repealed by the General Data Protection Regulation (GDPR), which in fact, it has not. Instead, you can instead think of the ePrivacy Directive and GDPR as working together and complementing each other, where, in the case of cookies, the ePrivacy generally takes precedence.
    1. Decision point #2 – Do you send any data to third parties, directly or inadvertently? <img class="alignnone size-full wp-image-10174" src="https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart.png" alt="GDPR cookie consent flowchart" width="1451" height="601" srcset="https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart.png 1451w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-300x124.png 300w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-981x406.png 981w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-761x315.png 761w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-611x253.png 611w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-386x160.png 386w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-283x117.png 283w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-600x249.png 600w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-1024x424.png 1024w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-50x21.png 50w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-250x104.png 250w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-241x100.png 241w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-400x166.png 400w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-350x145.png 350w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-840x348.png 840w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-860x356.png 860w, https://www.jeffalytics.com/wp-content/uploads/7deb832d95678dc21cc23208d76f4144_Flowchart-1030x427.png 1030w" sizes="(max-width: 1451px) 100vw, 1451px" /> Remember, inadvertently transmitting data to third parties can occur through the plugins you use on your website. You don't necessarily have to be doing this proactively. If the answer is “Yes,” then to comply with GDPR, you should use a cookie consent popup.
    1. GDPR introduces a list of data subjects’ rights that should be obeyed by both data processors and data collectors. The list includes: Right of access by the data subject (Section 2, Article 15). Right to rectification (Section 3, Art 16). Right to object to processing (Section 4, Art 21). Right to erasure, also known as ‘right to be forgotten’ (Section 3, Art 17). Right to restrict processing (Section 3, Art 18). Right to data portability (Section 3, Art 20).
    1. 15. Data sharing: practices and needs of the transport research community In view of the growing importance of data sharing in transportresearch and based on recommendations of a dedicated expert group on the Transport Research Cloud a new study will establish:1. What are the needs and objections of transports researchers in relation to data sharing:2. What are the training requirementsneeded for the transport research community to facilitate data sharing and3. What potential user communities would expect from a Transport Research Cloud?Type of Action: Provision of technical/scientific services by the Joint Research CentreIndicative timetable: Second quarter of 2020Indicative budget: EUR 0.20 million from the 2020 budget
    1. Earlier this year it began asking Europeans for consent to processing their selfies for facial recognition purposes — a highly controversial technology that regulatory intervention in the region had previously blocked. Yet now, as a consequence of Facebook’s confidence in crafting manipulative consent flows, it’s essentially figured out a way to circumvent EU citizens’ fundamental rights — by socially engineering Europeans to override their own best interests.
    2. The deceitful obfuscation of commercial intention certainly runs all the way through the data brokering and ad tech industries that sit behind much of the ‘free’ consumer Internet. Here consumers have plainly been kept in the dark so they cannot see and object to how their personal information is being handed around, sliced and diced, and used to try to manipulate them.
    1. Pulse Pulse Rates and Exercise

      Students in a Stat2 class recorded resting pulse rates (in class), did three "laps" walking up/down a nearby set of stairs, and then measured their pulse rate after the exercise. They provided additional information about height, weight, exercise, and smoking habits via a survey.

    1. Not only are public transport datasets useful for benchmarking route planning systems, they are also highly useful for benchmarking geospatial [13, 14] and temporal [15, 16] RDF systems due to the intrinsic geospatial and temporal properties of public transport datasets. While synthetic dataset generators already exist in the geospatial and temporal domain [17, 18], no systems exist yet that focus on realism, and specifically look into the generation of public transport datasets. As such, the main topic that we address in this work, is solving the need for realistic public transport datasets with geospatial and temporal characteristics, so that they can be used to benchmark RDF data management and route planning systems. More specifically, we introduce a mimicking algorithm for generating realistic public transport data, which is the main contribution of this work.
    1. Right now, if you want to know what data Facebook has about you, you don’t have the right to ask them to give you all of the data they have on you, and the right to know what they’ve done with it. You should have that right. You should have the right to know and have access to your data.
  4. Feb 2020
    1. Research ethics concerns issues, such as privacy, anonymity, informed consent and the sensitivity of data. Given that social media is part of society’s tendency to liquefy and blur the boundaries between the private and the public, labour/leisure, production/consumption (Fuchs, 2015a: Chapter 8), research ethics in social media research is par-ticularly complex.
    2. One important aspect of critical social media research is the study of not just ideolo-gies of the Internet but also ideologies on the Internet. Critical discourse analysis and ideology critique as research method have only been applied in a limited manner to social media data. Majid KhosraviNik (2013) argues in this context that ‘critical dis-course analysis appears to have shied away from new media research in the bulk of its research’ (p. 292). Critical social media discourse analysis is a critical digital method for the study of how ideologies are expressed on social media in light of society’s power structures and contradictions that form the texts’ contexts.
    3. t has, for example, been common to study contemporary revolutions and protests (such as the 2011 Arab Spring) by collecting large amounts of tweets and analysing them. Such analyses can, however, tell us nothing about the degree to which activists use social and other media in protest communication, what their motivations are to use or not use social media, what their experiences have been, what problems they encounter in such uses and so on. If we only analyse big data, then the one-sided conclusion that con-temporary rebellions are Facebook and Twitter revolutions is often the logical conse-quence (see Aouragh, 2016; Gerbaudo, 2012). Digital methods do not outdate but require traditional methods in order to avoid the pitfall of digital positivism. Traditional socio-logical methods, such as semi-structured interviews, participant observation, surveys, content and critical discourse analysis, focus groups, experiments, creative methods, par-ticipatory action research, statistical analysis of secondary data and so on, have not lost importance. We do not just have to understand what people do on the Internet but also why they do it, what the broader implications are, and how power structures frame and shape online activities
    4. Challenging big data analytics as the mainstream of digital media studies requires us to think about theoretical (ontological), methodological (epistemological) and ethical dimensions of an alternative paradigm

      Making the case for the need for digitally native research methodologies.

    5. Who communicates what to whom on social media with what effects? It forgets users’ subjectivity, experiences, norms, values and interpre-tations, as well as the embeddedness of the media into society’s power structures and social struggles. We need a paradigm shift from administrative digital positivist big data analytics towards critical social media research. Critical social media research combines critical social media theory, critical digital methods and critical-realist social media research ethics.
    6. Big data analytics’ trouble is that it often does not connect statistical and computational research results to a broader analysis of human meanings, interpretations, experiences, atti-tudes, moral values, ethical dilemmas, uses, contradictions and macro-sociological implica-tions of social media.
  5. Jan 2020
    1. Sociedad Quimica y Minera de Chile (NYSE:SQM): Known as SQM, this company also extracts potash from the Atacama Desert and is Chile’s largest fertilizer producer.

      Potassium mines in 3rd world countries.

    1. Annotation extends that power to a web made not only of linked resources, but also of linked segments within them. If the web is a loom on which applications are woven, then annotation increases the thread count of the fabric. Annotation-powered applications exploit the denser weave by defining segments and attaching data or behavior to them.

      I remember the first time I truly understood what Jon meant when he said this. One web page can have an unlimited number of specific addresses pointing into its parts--and through annotation these parts can be connected to an unlimited number of parts of other things. Jon called it: Exploding the web! How far we've come from Vannevar Bush's musings...

    1. The Web Annotation Data Model specification describes a structured model and format to enable annotations to be shared and reused across different hardware and software platforms.

      The publication of this web standard changed everything. I look forward to true testing of interoperable open annotation. The publication of the standard nearly three years ago was a game changer, but the game is still in progress. The future potential is unlimited!

    1. A final word: when we do not understand something, it does not look like there is anything to be understood at all - it just looks like random noise. Just because it looks like noise does not mean there is no hidden structure.

      Excellent statement! Could this be the guiding principle of the current big data boom in biology?

  6. Dec 2019
  7. plaintext-productivity.net plaintext-productivity.net
    1. Avoiding complicated outlining or mind-mapping software saves a bunch of mouse clicks or dreaming up complicated visualizations (it helps if you are a linear thinker).

      Hmm. I'm not sure I agree with this thought/sentiment (though it's hard to tell since it's an incomplete sentence). I think visualizations and mind-mapping software might be an even better way to go, in terms of efficiency of editing (since they are specialized for the task), enjoyment of use, etc.

      The main thing text files have going for them is flexibility, portability, client-neutrality, the ability to get started right now without researching and evaluating a zillion competing GUI app alternatives.

  8. burnsoftware.wordpress.com burnsoftware.wordpress.com
  9. burnsoftware.wordpress.com burnsoftware.wordpress.com
    1. Plain text is software and operating system agnostic. It's searchable, portable, lightweight, and easily manipulated. It's unstructured. It works when someone else's web server is down or your Outlook .PST file is corrupt. There's no exporting and importing, no databases or tags or flags or stars or prioritizing or insert company name here-induced rules on what you can and can't do with it.
    1. greater integration of data, data security, and data sharing through the establishment of a searchable database.

      Would be great to connect these efforts with others who work on this from the data end, e.g. RDA as mentioned above.

      Also, the presentation at http://www.gfbr.global/wp-content/uploads/2018/12/PG4-Alpha-Ahmadou-Diallo.pptx states

      This data will be made available to the public and to scientific and humanitarian health communities to disseminate knowledge about the disease, support the expansion of research in West Africa, and improve patient care and future response to an outbreak.

      but the notion of public access is not clearly articulated in the present article.

    1. Practical highlights in my opinion:

      • It's important to know about data padding in PG.
      • Be conscious when modelling data tables about columns ordering, but don't be pure-school and do it in a best-effort basis.
      • Gains up to 25% in wasted storage are impressive but always keep in mind the scope of the system. For me, gains are not worth it in the short-term. Whenever a system grows, it is possible to migrate data to more storage-efficient tables but mind the operative burder.

      Here follows my own commands on trying the article points. I added - pg_column_size(row()) on each projection to have clear absolute sizes.

      -- How does row function work?
      
      SELECT pg_column_size(row()) AS empty,
             pg_column_size(row(0::SMALLINT)) AS byte2,
             pg_column_size(row(0::BIGINT)) AS byte8,
             pg_column_size(row(0::SMALLINT, 0::BIGINT)) AS byte16,
             pg_column_size(row(''::TEXT)) AS text0,
             pg_column_size(row('hola'::TEXT)) AS text4,
             0 AS term
      ;
      
      -- My own take on that
      
      SELECT pg_column_size(row()) AS empty,
             pg_column_size(row(uuid_generate_v4())) AS uuid_type,
             pg_column_size(row('hola mundo'::TEXT)) AS text_type,
             pg_column_size(row(uuid_generate_v4(), 'hola mundo'::TEXT)) AS uuid_text_type,
             pg_column_size(row('hola mundo'::TEXT, uuid_generate_v4())) AS text_uuid_type,
             0 AS term
      ;
      
      CREATE TABLE user_order (
        is_shipped    BOOLEAN NOT NULL DEFAULT false,
        user_id       BIGINT NOT NULL,
        order_total   NUMERIC NOT NULL,
        order_dt      TIMESTAMPTZ NOT NULL,
        order_type    SMALLINT NOT NULL,
        ship_dt       TIMESTAMPTZ,
        item_ct       INT NOT NULL,
        ship_cost     NUMERIC,
        receive_dt    TIMESTAMPTZ,
        tracking_cd   TEXT,
        id            BIGSERIAL PRIMARY KEY NOT NULL
      );
      
      SELECT a.attname, t.typname, t.typalign, t.typlen
        FROM pg_class c
        JOIN pg_attribute a ON (a.attrelid = c.oid)
        JOIN pg_type t ON (t.oid = a.atttypid)
       WHERE c.relname = 'user_order'
         AND a.attnum >= 0
       ORDER BY a.attnum;
      
      -- What is it about pg_class, pg_attribute and pg_type tables? For future investigation.
      
      -- SELECT sum(t.typlen)
      -- SELECT t.typlen
      SELECT a.attname, t.typname, t.typalign, t.typlen
        FROM pg_class c
        JOIN pg_attribute a ON (a.attrelid = c.oid)
        JOIN pg_type t ON (t.oid = a.atttypid)
       WHERE c.relname = 'user_order'
         AND a.attnum >= 0
       ORDER BY a.attnum
      ;
      
      -- Whoa! I need to master mocking data directly into db.
      
      INSERT INTO user_order (
          is_shipped, user_id, order_total, order_dt, order_type,
          ship_dt, item_ct, ship_cost, receive_dt, tracking_cd
      )
      SELECT true, 1000, 500.00, now() - INTERVAL '7 days',
             3, now() - INTERVAL '5 days', 10, 4.99,
             now() - INTERVAL '3 days', 'X5901324123479RROIENSTBKCV4'
        FROM generate_series(1, 1000000);
      
      -- New item to learn, pg_relation_size. 
      
      SELECT pg_relation_size('user_order') AS size_bytes,
             pg_size_pretty(pg_relation_size('user_order')) AS size_pretty;
      
      SELECT * FROM user_order LIMIT 1;
      
      SELECT pg_column_size(row(0::NUMERIC)) - pg_column_size(row()) AS zero_num,
             pg_column_size(row(1::NUMERIC)) - pg_column_size(row()) AS one_num,
             pg_column_size(row(9.9::NUMERIC)) - pg_column_size(row()) AS nine_point_nine_num,
             pg_column_size(row(1::INT2)) - pg_column_size(row()) AS int2,
             pg_column_size(row(1::INT4)) - pg_column_size(row()) AS int4,
             pg_column_size(row(1::INT2, 1::NUMERIC)) - pg_column_size(row()) AS int2_one_num,
             pg_column_size(row(1::INT4, 1::NUMERIC)) - pg_column_size(row()) AS int4_one_num,
             pg_column_size(row(1::NUMERIC, 1::INT4)) - pg_column_size(row()) AS one_num_int4,
             0 AS term
      ;
      
      SELECT pg_column_size(row(''::TEXT)) - pg_column_size(row()) AS empty_text,
             pg_column_size(row('a'::TEXT)) - pg_column_size(row()) AS len1_text,
             pg_column_size(row('abcd'::TEXT)) - pg_column_size(row()) AS len4_text,
             pg_column_size(row('abcde'::TEXT)) - pg_column_size(row()) AS len5_text,
             pg_column_size(row('abcdefgh'::TEXT)) - pg_column_size(row()) AS len8_text,
             pg_column_size(row('abcdefghi'::TEXT)) - pg_column_size(row()) AS len9_text,
             0 AS term
      ;
      
      SELECT pg_column_size(row(''::TEXT, 1::INT4)) - pg_column_size(row()) AS empty_text_int4,
             pg_column_size(row('a'::TEXT, 1::INT4)) - pg_column_size(row()) AS len1_text_int4,
             pg_column_size(row('abcd'::TEXT, 1::INT4)) - pg_column_size(row()) AS len4_text_int4,
             pg_column_size(row('abcde'::TEXT, 1::INT4)) - pg_column_size(row()) AS len5_text_int4,
             pg_column_size(row('abcdefgh'::TEXT, 1::INT4)) - pg_column_size(row()) AS len8_text_int4,
             pg_column_size(row('abcdefghi'::TEXT, 1::INT4)) - pg_column_size(row()) AS len9_text_int4,
             0 AS term
      ;
      
      SELECT pg_column_size(row(1::INT4, ''::TEXT)) - pg_column_size(row()) AS int4_empty_text,
             pg_column_size(row(1::INT4, 'a'::TEXT)) - pg_column_size(row()) AS int4_len1_text,
             pg_column_size(row(1::INT4, 'abcd'::TEXT)) - pg_column_size(row()) AS int4_len4_text,
             pg_column_size(row(1::INT4, 'abcde'::TEXT)) - pg_column_size(row()) AS int4_len5_text,
             pg_column_size(row(1::INT4, 'abcdefgh'::TEXT)) - pg_column_size(row()) AS int4_len8_text,
             pg_column_size(row(1::INT4, 'abcdefghi'::TEXT)) - pg_column_size(row()) AS int4_len9_text,
             0 AS term
      ;
      
      SELECT pg_column_size(row()) - pg_column_size(row()) AS empty_row,
             pg_column_size(row(''::TEXT)) - pg_column_size(row()) AS no_text,
             pg_column_size(row('a'::TEXT)) - pg_column_size(row()) AS min_text,
             pg_column_size(row(1::INT4, 'a'::TEXT)) - pg_column_size(row()) AS two_col,
             pg_column_size(row('a'::TEXT, 1::INT4)) - pg_column_size(row()) AS round4;
      
      SELECT pg_column_size(row()) - pg_column_size(row()) AS empty_row,
             pg_column_size(row(1::SMALLINT)) - pg_column_size(row()) AS int2,
             pg_column_size(row(1::INT)) - pg_column_size(row()) AS int4,
             pg_column_size(row(1::BIGINT)) - pg_column_size(row()) AS int8,
             pg_column_size(row(1::SMALLINT, 1::BIGINT)) - pg_column_size(row()) AS padded,
             pg_column_size(row(1::INT, 1::INT, 1::BIGINT)) - pg_column_size(row()) AS not_padded;
      
      SELECT a.attname, t.typname, t.typalign, t.typlen
        FROM pg_class c
        JOIN pg_attribute a ON (a.attrelid = c.oid)
        JOIN pg_type t ON (t.oid = a.atttypid)
       WHERE c.relname = 'user_order'
         AND a.attnum >= 0
       ORDER BY t.typlen DESC;
      
      DROP TABLE user_order;
      
      CREATE TABLE user_order (
        id            BIGSERIAL PRIMARY KEY NOT NULL,
        user_id       BIGINT NOT NULL,
        order_dt      TIMESTAMPTZ NOT NULL,
        ship_dt       TIMESTAMPTZ,
        receive_dt    TIMESTAMPTZ,
        item_ct       INT NOT NULL,
        order_type    SMALLINT NOT NULL,
        is_shipped    BOOLEAN NOT NULL DEFAULT false,
        order_total   NUMERIC NOT NULL,
        ship_cost     NUMERIC,
        tracking_cd   TEXT
      );
      
      -- And, what about other varying size types as JSONB?
      
      SELECT pg_column_size(row('{}'::JSONB)) - pg_column_size(row()) AS empty_jsonb,
             pg_column_size(row('{}'::JSONB, 0::INT4)) - pg_column_size(row()) AS empty_jsonb_int4,
             pg_column_size(row(0::INT4, '{}'::JSONB)) - pg_column_size(row()) AS int4_empty_jsonb,
             pg_column_size(row('{"a": 1}'::JSONB)) - pg_column_size(row()) AS basic_jsonb,
             pg_column_size(row('{"a": 1}'::JSONB, 0::INT4)) - pg_column_size(row()) AS basic_jsonb_int4,
             pg_column_size(row(0::INT4, '{"a": 1}'::JSONB)) - pg_column_size(row()) AS int4_basic_jsonb,
             0 AS term;
      
    1. Remarkably, studies receiving mainly public funding can, a quarter of a century on, still survive without making their data available in a useful way. In the UK a series of studies—the Avon Longitudinal Study of Parents and Children (ALSPAC) (100), UK Biobank (101), and Born in Bradford (102), among others—have surely been exemplary in promoting data accessibility.

      Critical points!

    1. I'll give a little bit of the history to provide context. My own involvement in this started around 2008 after we had shipped our key-value store. My next project was to try to get a working Hadoop setup going, and move some of our recommendation processes there. Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms. So began a long slog. We originally planned to just scrape the data out of our existing Oracle data warehouse. The first discovery was that getting data out of Oracle quickly is something of a dark art. Worse, the data warehouse processing was not appropriate for the production batch processing we planned for Hadoop—much of the processing was non-reversable and specific to the reporting being done. We ended up avoiding the data warehouse and going directly to source databases and log files. Finally, we implemented another pipeline to load data into our key-value store for serving results. This mundane data copying ended up being one of the dominate items for the original development. Worse, any time there was a problem in any of the pipelines, the Hadoop system was largely useless—running fancy algorithms on bad data just produces more bad data. Although we had built things in a fairly generic way, each new data source required custom configuration to set up. It also proved to be the source of a huge number of errors and failures. The site features we had implemented on Hadoop became popular and we found ourselves with a long list of interested engineers. Each user had a list of systems they wanted integration with and a long list of new data feeds they wanted. ETL in Ancient Greece. Not much has changed.

      A great anecdote / story on the (pains) of data integration

    2. Effective use of data follows a kind of Maslow's hierarchy of needs. The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts). This data needs to be modeled in a uniform way to make it easy to read and process. Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc. It's worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater. Once data and processing are available, one can move concern on to more refined problems of good data models and consistent well understood semantics. Finally, concentration can shift to more sophisticated processing—better visualization, reporting, and algorithmic processing and prediction. In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques. This is completely backwards. So the question is, how can we build reliable data flow throughout all the data systems in an organization?
  10. Nov 2019
    1. TrackMeNot is user-installed and user-managed, residing wholly on users' system and functions without the need for 3rd-party servers or services. Placing users in full control is an essential feature of TrackMeNot, whose purpose is to protect against the unilateral policies set by search companies in their handling of our personal information.