Advertisement
Mayo Clinic Proceedings Home

E-mail Address Harvesting on PubMed—A Call for Responsible Handling of E-mail Addresses

      To the Editor: PubMed “comprises over 20 million citations for biomedical literature from MEDLINE, life science journals, and online books,” a database that can be queried using the Entrez search engine.
      • US National Library of Medicine
      • National Institutes of Health
      Since January 1996, e-mail addresses for first authors, when available, have been added to the MEDLINE record as they appear in the journals.
      • US Department of Health and Human Services (DHS)
      The NLM Technical Bulletin. No 287.
      Search and retrieval of PubMed results may occur through the PubMed Web page and also with use of several Entrez Programming Utilities that “provide access to Entrez data outside of the regular web query interface.”
      • National Center for Biotechnical Information (NCBI)
      With regard to these latter tools, corresponding documentation and an educational course with examples are available to the public online.
      • Sayers E
      • Wheeler D
      • US National Library of Medicine
      • National Institutes of Health (NIH)
      Building customized data pipelines using the Entrez Programming Utilities (eUtils).
      Electronic spam can be defined as unsolicited e-mail sent to a large number of addresses. Individuals sending spam harvest e-mail addresses from the Internet using a variety of techniques, including automated use of software to search Web pages for strings of text recognized as e-mail addresses, as well as manual efforts to gain access to large collections of addresses (eg, by subscribing to mailing lists to collect the addresses of other users). Techniques for avoiding spam are many and include avoiding the online publication of e-mail addresses in text form (as opposed to providing an image of the address) and preventing those with malicious intent from accessing large sources of addresses.
      PubMed is extremely vulnerable to e-mail address harvesting. When available, e-mail addresses for first authors are included within citations in text form, making them easily retrieved by software in an automated fashion. However, more concerning is the ability to quickly generate listings containing thousands of e-mail addresses using the Entrez Programming Utilities. With regard to this latter vulnerability, having only basic computer programming knowledge, within 30 minutes of discovering the aforementioned utilities, I was able to generate a listing of more than 7000 addresses. Therefore, clearly more responsible handling of e-mail addresses in PubMed is needed, which may be accomplished by eliminating publication of e-mail addresses in text form and restricting the return of e-mail addresses when results are fetched outside of the regular Web query interface.

      REFERENCES

        • US National Library of Medicine
        • National Institutes of Health
        PubMed Help: FAQs.
        (Accessed March 3, 2011.)
        • US Department of Health and Human Services (DHS)
        The NLM Technical Bulletin. No 287.
        (Accessed Feburary 8, 2011.)
        • National Center for Biotechnical Information (NCBI)
        Entrez programming utilities. NCBI Website.
        (Updated February 16, 2009. Accessed February 8, 2011.)
        • Sayers E
        • Wheeler D
        • US National Library of Medicine
        • National Institutes of Health (NIH)
        Building customized data pipelines using the Entrez Programming Utilities (eUtils).
        (Accessed February 8, 2011.)
        • Spam
        Merriam-Webster Online Web site.
        ([1] Accessed February 8, 2011.)