Home > Research > Clean Up Those Rotten File Shares With the Wonder Tool You Already Own

Clean Up Those Rotten File Shares With the Wonder Tool You Already Own

We hear a common challenge every day: I have too much stuff on my file shares! These information sources are crucial but start to resemble a hoarder’s garage after decades of use. Get control of file shares through some understanding of information governance, an information audit, and a thorough cleaning.

The Dangers of Too Many Files

Enterprises typically experience a core set of problems when their file shares have grown out of control:

  • Weak document control leading to compliance risks.
  • Productivity loss due to the inability to find important documents and the mistaken use of old or dated information.
  • Potential litigation risks due to potentially unaccounted for records or toxic documents.

Addressing these risks requires a general approach and the use of common tools.

Evaluate the Risk and Value Inherent in the Document Collection

The first step in a clean-up initiative is to recognize that the content of a file share isn’t homogenous. Every document varies with respect to:

  • Its value to the organization.
  • The risk it poses to the organization.
  • High-risk documents consist mostly of records – documents that an organization must maintain for regulatory, historical, or continuity purposes (according to ISO 15489).
  • High-value documents are generally those that have many inputs or are involved with a variety of different processes. Examples include standard process documentation and working versions of budgets. Treat documents that are neither high-risk nor high-value as kipple: low-value digital stuff that seems magically to reproduce itself. This model provides a theoretical set of quadrants:

Determine How to Identify Records

High-risk information primarily consists of records. There is a fundamental difference between a record and a non-record. Records are those things that an organization has to maintain for a specific period to meet particular compliance or regulatory requirements. Ideally, a retention schedule created by legal counsel dictates the requirements of records management. Organizations that fail to distinguish between records and documents face considerable litigation and compliance risk and struggle to organize information. If there is no distinction between records and documents, IT has to treat everything as a record with infinite retention requirements and can’t delete anything!

Create a General Plan Based on the Risk-Value Quadrants

The general approach is to:

  • Maintain high-risk information in a records management system. Ideally, this system is a formal records management system or other system of record, but it could be an instance of SharePoint configured as a records center, the use of in-place records management in SharePoint, or it could even be a set of specific file shares with appropriate protections.
  • Provide controls and automation for low-risk/high-value information. Many organizations use tools like SharePoint to improve workflow and findability for these documents. There is some recognition in the organization that these documents are important, so manage them appropriately.
  • If possible, shred the kipple. Low-risk/low-value information needs to be removed or isolated to minimize its impact. Ideally, kipple is digitally shredded with a three-pass wipe in general accordance with NIST 800-88r1, CSE ITSP.40.006, or equivalent. Unfortunately, some organizations and users really don’t want to delete anything. In these situations, take the least expensive approach to managing this information. Options include: on off-site tape; on personal storage with no IT SLAs for backup; on user OneDrive accounts; and on a low cost filer.

Use FCI to Identify High-Risk Information

The most important task in a file share cleanup is identifying records and other high-risk information. Fortunately, most Info-Tech clients possess a little-known tool that greatly simplifies this process. The File Classification Infrastructure (FCI) is a component of File Storage Resource Manager (FSRM) on Windows Server. It gives administrators the ability to apply metadata to files either manually or programmatically. Furthermore, Microsoft offers a Solution Accelerator called the Data Classification Toolkit as well as specific accelerators for PCI-DSS and NIST SP 800-53. The Data Classification Toolkit creates classification properties that are commonly used in records management schemes, for example:

  • Information Privacy (Personally Identifiable Information, Protected Health Information)
  • Information Security (Confidentiality, Required Clearance)
  • Legal (Compliance, Discoverability, Immutable, Intellectual Property)
  • Records management (Retention, Retention start date)
  • Organizational (Impact, Department, Project, Personal Use)

In practice, an IT administrator could use FSRM and FCI in the following way:

  • Use FSRM to create storage and usage reports for specific folders in the file share. In particular, the administrator wants to identify folders that are infrequently accessed, contain excessively large documents, or contain document types that aren't consistent with acceptable use policies (e.g. MP3, MPEG, RAR, etc.).
  • Use FCI to identify potentially high-risk information. FCI will scan files and folders for strings of characters that are consistent with records. For example, the administrator can create regular expressions to identify potential credit card numbers, social security/insurance numbers, health policy information, contract numbers, etc. FCI could then tag these documents for review.
  • Work with business owners to identify high-value and high-risk information. Ultimately, the information belongs to the business unit. An administrator can sit down with the information steward or owner – ideally with the FSRM and FCI reports and the retention schedule – and manually tag folders or files with the appropriate metadata to identify high-risk information.
  • Protect and preserve high-risk information. Identified information can be protected by IT via in-place mechanisms such as Microsoft's Dynamic Access Control (DAC) or by moving it to an appropriate records repository such as SharePoint. Once tagged in Windows Server, SharePoint will inherit these descriptors when administrators migrate the content.

It is incredibly important to be rigorous in identifying high-risk information. Once it is isolated and protected, the enterprise is free to address the low-risk information in any way it wants. There is no regulation, for example, preventing administrators from shredding everything low risk. This course of action, however, will be incredibly unpopular with business users!

Build a Taxonomy for High-Value Information

Value is a considerably more subjective consideration than risk. Ideally, high-value non-records have appropriate descriptors or tags to facilitate findability. Enterprise search is different from web-based search. Filtering mechanisms are much more important. A user looking for an “operations budget” for example will not be satisfied with a 10-year-old budget for a different part of the business on the far side of the world. Managing high-value information requires the application of appropriate descriptors controlled by a standard enterprise taxonomy. Many IT professionals severely underestimate the rigor of taxonomy design. It’s not complicated, but the process should follow the basic rules of ANSI/NISO Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and focus on creating business-unit-specific terms for a few core facets:

  • Time or period (e.g. FY2019, Q3, End-of-Semester)
  • Location (e.g. San Francisco, Meeting Room 43b, Workstation 21-4)
  • Business process (e.g. vendor management, accounts payable, facilities management)
  • Document business genre (e.g. proposals, receipts, budgets)
  • Terms of general topical relevance (e.g. Cana2 Foundations, sharks, tanto blades)

With appropriate terms, a user looking for something like the recent proposal for a foundations project can perform a general search and then apply appropriate filters (proposals, FY2019, San Francisco, Cana2 Foundations) to find all relevant proposals for 2019 related to Cana2 Foundations in San Francisco. Info-Tech’s approach for developing this kind of taxonomy is described in the following blueprint: Move Away From File Shares and Organize Enterprise Information.

Use FCI to Apply the Taxonomy to High-Value Information

It is difficult to apply taxonomy descriptors programmatically via FCI. For example, proposals generally don’t contain recognizable character strings that are amenable to regular expressions. Instead, the IT administrator will have to work with the business steward to ensure that metadata is applied effectively. This task, however, is made easier by the tendency to group similar documents together. Extending our example, administrators might just find a folder exclusively containing proposals from FY2019 or all Cana2 Foundations documentation. FCI enables the administrators to apply metadata at the folder level, which is then inherited by the underlying documents.

Execute Controls and/or Migration

The final step is to use the metadata to manage the information appropriately via appropriate systems of record, SharePoint, or by identifying kipple for destruction.

Prepare for the Exceptions

There will inevitably be documents that are difficult to appropriately assess and classify. For example, there will be documents that don’t have any of the characteristics of records or high-value information but have some inherent trait that make them potentially valuable. Humans are very bad at making these kinds of judgement calls according to the tenets of Prospect Theory. One strategy is to place these Documents of Ambiguous Value (DoAV) into a kind of escrow account. For example, move the documents to a protected file share. To restore these files, users must file a request via the helpdesk. Documents in this escrow account could have an appropriate retention period from the date of movement (e.g. two or five years). As the retention period expires, IT shreds the documents.

Apply the Same Approach to SharePoint

Cleaning up SharePoint, particularly old SharePoint, is a different kind of challenge. The same methodology is generally applicable to SharePoint, but there are a few challenges when it comes to identifying high-risk information. FCI capability isn't available for SharePoint. Modern deployments can leverage the DLP capability of Office 365 to create rules and identify various types of potentially high-risk information. Older versions have more limited capabilities:

  • In SharePoint 2010 and 2013 it's possible to use SharePoint Search or Windows Search Server to identify potentially high-risk information. Microsoft's search syntax – KQL – offers only limited support for regular expressions. Instead, administrators must reply on content enrichment to extract relevant data.
  • SharePoint Foundation presents another set of problems. Standard SharePoint audit logs are only available via PowerShell. These logs will provide some information on the historical usage of SharePoint content but provide little guidance on what is actually contained within the documents. Adopt the search strategy articulated above or evaluate a CodePlex tool designed for older deployments of SharePoint services and MOSS sites.

Bottom Line

Clean up those file shares. Use a structured process by rating documents based on their inherent level of risk and value. Use Microsoft’s File Classification Infrastructure to facilitate the process.