Commit 99de1bbf authored by Jason Rhinelander's avatar Jason Rhinelander

series/quantiles: source-based quantile support

Old behaviour:
==============

creativity-series previously removed non-finite values, and sorted the
rest, producing something like:

t,1st,2nd,3rd,...
2,10,11,12
3,3

(where t=0 and t=1 have no finite values, t=3 has only one non-finite,
etc.).  This discards any association with the files they come from,
however, which means any sort of source-based confidence exclusion is
impossible.

creativity-series-quantiles then used these pre-sorted values to produce
another .csv of quantile values.

createivity-series-graphs accepted either one--if the series file, it
calculated quantiles on the fly; if the quantiles, it used the
pre-calculated quantiles.

New behaviour:
==============

creativity-series now just generates something like:

t,source1,source2,source3
0,nan,nan,nan
1,nan,nan,nan
1,12,10,11
2,nan,3,nan

which creativity-series-quantiles and creativity-series-graphs now
understand: they does the nan trimming and sorting, then calculate
quantiles.

creativity-series-graphs also gains an entirely new ability, with flag
--source-confidence (but only when given a creativity-series file): to
select a confidence region by excluding (1-x)% of source files, then
plotting the minimum and maximum values remaining after removal of the
most non-median values.  This is done by calculating (p-0.5)^2 for the
inverse quantile p for each time period, then summing these up across
time periods: the source files with the largest scores are excluded.
(see --help for the gory details).
parent 58cbdbe5
......@@ -36,6 +36,8 @@ void SeriesGraphs::addOptions() {
("legend-position,L", value(legend_position_input), "The general position of the graph legend, one of 'none', 'inside', 'right', 'left', 'top', 'bottom' where 'none' suppresses the legend, 'inside' puts it inside the graph, and the rest put it outside on the indicated side of the graph.")
("legend-x", range<0,1>(legend_rel_x), "Relative x position (for --legend-position={inside,top,bottom}): 0 is left-most, 1 is right-most.")
("legend-y", range<0,1>(legend_rel_y), "Relative y position (for --legend-position={inside,left,right}): 0 is top-most, 1 is bottom-most.")
("time-confidence", "If given, calculate confidence intervals per-time period.")
("source-confidence", "If given, calculate confidence intervals by excluding the most extreme source files. See --help for details.")
;
po::options_description input_desc("Input files");
......@@ -61,6 +63,13 @@ void SeriesGraphs::postParse(boost::program_options::variables_map &vars) {
throw po::invalid_option_value("--t-max value must be larger than --t-min value");
if (std::isfinite(graph_min_value) and std::isfinite(graph_max_value) and graph_max_value <= graph_min_value)
throw po::invalid_option_value("--y-max value must be larger than --y-min value");
if (vars.count("time-confidence")) {
if (vars.count("source-confidence")) throw po::invalid_option_value("--time-confidence and --source-confidence are mutually exclusive");
per_t_confidence = true;
}
if (vars.count("source-confidence")) {
per_t_confidence = false;
}
if (vars.count("same-t-scale") or (graph_min_t >= 0 and graph_max_t >= 0))
same_horizontal_scale = true;
if (vars.count("different-y-scale")) {
......@@ -82,12 +91,53 @@ std::string SeriesGraphs::usage() const {
}
std::string SeriesGraphs::help() const {
return CmdArgs::help() + "Input files:\n" +
" FILE [FILE ...] One or more series input files from which to\n" +
" calculate quantiles. At least one must\n" +
" be given.\n\n" +
"This program takes files as produced by the creativity-series program as inputs\n" +
"and retains the existing sample quantiles for each period in the series files.\n\n";
return CmdArgs::help() + u8R"HELP(Input files:
FILE [FILE ...] One or more series or quantile input
files from which to graph series. At
least one must be given.
This program takes files as produced by the creativity-series or
creativity-series-quantiles programs as inputs and retains the existing sample
quantiles for each period in the series files.
The plot behaviour has two modes. The default mode, --time-confidence, and the
only mode available when using quantiles input files, plots by calculating the
request sample quantile as recorded in the quantiles file (which must already
exist in the file) or calculated from the finite values in the series file. For
example, for period t=20, a 95% confidence interval will result in the
confidence band at t=20 throwing away the top and bottom 2.5% of simulations
observed period t=20.
The other mode, --source-confidence, available only when using series files,
applies the requested confidence interval(s) to the source files. In this mode,
source file data is included only for the 95% (or whatever level) of source
files have the "smallest" quantiles. Specifically, each file is assigned a
score of:
∑ (pₜ - 0.5)²
t
where pₜ is the source file's inverse sample quantile at time t (that is, 0 for
the smallest observation, 1 for the largest, and 0.5 for the median
observation). Note that when a period t has NaN and infinite values, these
values are calculated with pₜ=0, and the smallest and largest finite values for
period t receive values of pₜ = #/2 and 1-#/2, where # is the number of
non-finite values in period t. In the case of duplicate values, the pₜ value is
the most favourable value (for example, for the data 1 1 2 3 4, both
observations with value 1 have a pₜ value of 0.25, not 0.).
Once the score has been calculated across all observations for all data files,
the x% of data files with the highest score are excluded; the minimum and
maximum of the remaining (finite) values then form the confidence interval end
points.
Note that this procedure is guaranteed to produce confidence intervals at least
as wide as the --time-confidence intervals, and almost always wider. Also note
that, when drawing a median, the actual line is the path of the series which
received the lowest aggregate score, *not* the median of each time period's
values.
)HELP";
}
std::string SeriesGraphs::versionSuffix() const {
......
......@@ -29,6 +29,9 @@ class SeriesGraphs : public CmdArgs {
*/
std::string levels = "median,0.5,0.9,0.95";
/// If true, we're in per-t confidence mode. If false, we're in per-source file confidence mode.
bool per_t_confidence = true;
/** The input files (series or quantiles) to plot. Plot axes scales will be identical
* across files, so typically this should be called with series files for the same (or at
* least directly comparable) variables.
......
......@@ -4,8 +4,7 @@
namespace creativity { namespace data {
const std::regex
quantile_field_regex("m(?:in|ax|edian)|q\\d*[1-9]"),
ordinal_regex("(?:\\d*[02-9])?(?:1st|2nd|3rd)|\\d*1[123]th|\\d*[04-9]th");
quantile_field_regex("m(?:in|ax|edian)|q\\d*[1-9]");
std::string quantile_field(double quantile) {
if (quantile == 0) return "min";
......
......@@ -24,8 +24,5 @@ std::string quantile_field(double quantile);
*/
extern const std::regex quantile_field_regex;
/** A std::regex object that matches ordinal values as produced by creativity-series. */
extern const std::regex ordinal_regex;
}}
......@@ -24,19 +24,21 @@ using namespace eris;
// When outputting CSV values, we want exact precision, but don't necessarily need the full
// max_digits10 digits to get there. For example, 0.1 with max_digits10 becomes 0.10000000000000001
// but 0.1 also converts to the numerically identical value. 0.10000000000000002, on the other
// hand, needs every decimal digit of precision.
// hand, needs every decimal digit of precision, because it isn't equal to the double value 0.1
//
// This function produces tries first at the requested precision, then the requested precision less
// one, then less two; if the subsequent values are numerically identical to the given double value,
// the shortest is returned.
std::string double_str(double d, unsigned precision) {
double round_trip;
for (unsigned prec : {precision-2, precision-1}) {
std::stringstream ss;
ss.precision(prec);
ss << d;
ss >> round_trip;
if (round_trip == d) { ss << d; return ss.str(); }
if (std::isfinite(d)) {
double round_trip;
for (unsigned prec : {precision-2, precision-1}) {
std::stringstream ss;
ss.precision(prec);
ss << d;
ss >> round_trip;
if (round_trip == d) { ss << d; return ss.str(); }
}
}
std::stringstream ss;
ss.precision(precision);
......
This diff is collapsed.
......@@ -105,30 +105,22 @@ int main(int argc, char *argv[]) {
std::ostringstream output_data;
output_data << output_header;
try {
if (parser.fields().size() == 0) throw std::invalid_argument("no data fields found (not even t)");
if (parser.fields()[0] != "t") throw std::invalid_argument("first field != t");
for (size_t i = 1; i < parser.fields().size(); i++) {
if (not std::regex_match(parser.fields()[i], ordinal_regex))
throw std::invalid_argument("field " + std::to_string(i) + " (" + parser.fields()[i] + ") is not an ordinal value");
}
// Require the t value, but nothing else is strictly required:
parser.allow_missing_values = parser.header().size() - 1;
for (auto &row : parser) {
if (row.size() == 1) continue; // If we have just the time index but no data at all for that index skip it.
if (parser.fields().size() < 2) throw std::invalid_argument("file contains no data (only a t column was found)");
// Make sure the data is sorted:
for (int i = 2; i < row.size(); i++) {
if (row[i] < row[i-1]) throw std::invalid_argument("file contains unsorted data (line "
+ std::to_string(parser.lineNumber()) + ", columns " + std::to_string(i-1) + "--" + std::to_string(i) + ")");
}
for (const auto &row : parser) {
// Extract finite values (ignore any NaNs or infinities)
std::vector<double> finite_values;
std::copy_if(row.begin()+1, row.end(), std::back_inserter(finite_values), [](const double &v) { return std::isfinite(v); });
if (finite_values.empty()) continue; // Completely skip time periods with zero finite values (often initial rows, possibly others)
std::sort(finite_values.begin(), finite_values.end());
std::ostringstream output_row;
output_row << row[0];
output_data << row[0];
for (auto q : quantiles) {
output_row << "," << double_str(data::quantile(row.tail(row.size()-1), q), args.double_precision);
output_data << "," << double_str(data::quantile(finite_values.begin(), finite_values.end(), q), args.double_precision);
}
output_data << output_row.str() << "\n";
output_data << "\n";
}
}
catch (const std::invalid_argument &e) {
......
......@@ -26,15 +26,20 @@ using namespace creativity;
using namespace creativity::state;
using namespace eris;
using Locker = std::lock_guard<std::mutex>;
// The series we want to calculate, as given to --series
std::unordered_set<std::string> series_wanted;
// Results: "books_written" -> array of periods -> array of values
std::unordered_map<std::string, std::vector<std::multiset<double>>> values;
unsigned int values_count, error_count;
std::mutex values_mutex; // Guards values, output_count, error_count
decltype(cmdargs::Series::input.cbegin()) input_it, input_it_end;
std::mutex input_it_mutex;
// Values container. values["field_name"][t][fileindex] = value
std::unordered_map<std::string, std::vector<std::vector<double>>> values;
// File list, in same order as [fileindex] above:
std::vector<std::string> files;
size_t error_count = 0;
std::mutex values_mutex; // Guards values, files, error_count
std::mutex input_mutex; // Guards all of the below:
size_t input_index;
// These have to agree across files:
eris_time_t periods = 0, piracy_begins = 0, public_begins = 0;
bool allow_unused_periods = false; // Will only be true if --periods is explicitly given
......@@ -47,39 +52,28 @@ void thr_parse_file(
const cmdargs::Series &args,
const std::vector<data::datum> &data) {
input_it_mutex.lock();
while ((error_count == 0 or args.ignore_errors) and input_it != input_it_end) {
std::string source(*input_it++);
std::unique_lock<std::mutex> input_lock(input_mutex);
size_t input_i;
while ((error_count == 0 or args.ignore_errors) and (input_i = input_index++) < args.input.size()) {
// Hold the lock if this is the first file and we weren't given all of the needed simulation
// period values: we have to set periods, piracy_begins, public_begins (the lock also
// protects assigning to these variables).
if (not need_parameters) input_it_mutex.unlock();
// protects assigning to these variables) before unlocking; this essentially means that only
// one thread runs until the first thread has determined that initial info.
if (not need_parameters) input_lock.unlock();
values_mutex.lock();
std::cout << "Processing " << source << "\n";
values_mutex.unlock();
const auto &source = args.input[input_i];
#define FAIL(WHY) \
values_mutex.lock(); \
std::cerr << "Error reading `" << source << "': " << WHY << "\n"; \
error_count++; \
values_mutex.unlock(); \
continue;
{ Locker l(values_mutex); std::cout << "Processing " << source << "\n"; }
#define FAIL(WHY) { Locker l(values_mutex); std::cerr << "Error reading `" << source << "': " << WHY << "\n"; error_count++; continue; }
std::ostringstream output;
output.precision(args.double_precision);
auto creativity = Creativity::create();
// Filename input
try {
creativity->fileRead(source);
}
catch (std::ios_base::failure&) {
FAIL("I/O error: " << std::strerror(errno));
}
catch (std::exception &e) {
FAIL("An exception occured: " << e.what());
}
try { creativity->fileRead(source); }
catch (std::ios_base::failure&) FAIL("I/O error: " << std::strerror(errno))
catch (std::exception &e) FAIL("An exception occured: " << e.what())
auto locked_storage = creativity->storage();
auto &storage = *locked_storage.first;
......@@ -105,7 +99,7 @@ void thr_parse_file(
std::cout << "Inferred --public-sharing-begins " << public_begins << "\n";
}
need_parameters = false;
input_it_mutex.unlock();
input_lock.unlock();
}
// If we need more than is available, we can't use this file.
......@@ -138,23 +132,26 @@ void thr_parse_file(
}
// Copy values we read into the overall values container
values_mutex.lock();
for (const auto &v : local_values) {
auto &vstore = values[v.first];
if (vstore.size() < v.second.size()) vstore.resize(v.second.size());
for (unsigned t = 0; t < v.second.size(); t++) {
// Only copy finite values (this mainly excluded NaNs as nothing should be capable
// of generating infinity).
double value = v.second[t];
if (std::isfinite(value)) vstore[t].insert(value);
{
Locker locker(values_mutex);
auto fileindex = files.size();
files.push_back(source);
// values["field_name"][t][fileindex] = value
for (const auto &v : local_values) {
auto &vstore = values[v.first];
if (vstore.size() < v.second.size()) vstore.resize(v.second.size());
for (unsigned t = 0; t < v.second.size(); t++) {
vstore[t].reserve(args.input.size());
if (vstore[t].size() <= fileindex) vstore[t].resize(fileindex+1, std::numeric_limits<double>::quiet_NaN());
vstore[t][fileindex] = v.second[t];
}
}
}
values_count++;
values_mutex.unlock();
input_it_mutex.lock();
input_lock.lock();
}
input_it_mutex.unlock();
input_lock.unlock();
}
int main(int argc, char *argv[]) {
......@@ -223,14 +220,25 @@ int main(int argc, char *argv[]) {
exit(1);
}
input_it = args.input.cbegin();
input_it_end = args.input.cend();
if (input_it == input_it_end) {
if (args.input.empty()) {
std::cerr << "\nError: no simulation input files specified\n\n";
exit(1);
exit(50);
}
// Check for files specified more than once, and abort if found
{
std::unordered_set<std::string> seen;
for (const auto &source : args.input) {
auto ins = seen.insert(source);
if (not ins.second) {
std::cerr << "\nError: simulation input files contains duplicate entry `" << source << "'; aborting\n\n";
exit(51);
}
}
}
files.reserve(args.input.size());
if (args.threads == 0) {
thr_parse_file(args, data);
}
......@@ -257,32 +265,40 @@ int main(int argc, char *argv[]) {
}
}
if (values_count == 0) {
if (files.empty()) {
std::cerr << "Error: no usable data sources!\n";
exit(1);
}
std::cout << "Successfully processed " << values_count << "/" << (values_count+error_count) << " simulation files.\n\n";
std::cout << "Successfully processed " << files.size() << "/" << (files.size()+error_count) << " simulation files.\n\n";
for (auto &v : values) {
// Figure out how many "nth" headers we need for this series:
unsigned n = 0;
for (auto &time : v.second) {
if (time.size() > n) n = time.size();
std::string header;
{
// Maps actual source names into CSV-compatible source names by removing potentially problematic
// characters, and appending a unique number to resolve any duplicates.
std::unordered_set<std::string> source_used;
std::ostringstream headeross;
headeross << "t";
for (const auto &source : files) {
std::string fixed = data::csv_fix(source);
std::string try_s = fixed;
int append = 2;
while (not source_used.insert(try_s).second) {
try_s = fixed + "-" + std::to_string(append++);
}
headeross << "," << try_s;
}
headeross << "\n";
header = headeross.str();
}
for (auto &v : values) {
// Write an output file
std::string output_file = args.output_dir + "/series-" + v.first + ".csv";
std::ofstream out(output_file, std::ios::out | std::ios::trunc);
out << "t";
for (unsigned i = 1; i <= n; i++) {
out << "," << i << (
(i % 10 == 1 and not i % 100 == 11) ? "st" :
(i % 10 == 2 and not i % 100 == 12) ? "nd" :
(i % 10 == 3 and not i % 100 == 13) ? "rd" :
"th");
}
out << "\n";
out.exceptions(out.failbit | out.badbit);
out << header;
for (eris_time_t t = 0; t < v.second.size(); t++) {
out << t;
for (const auto &val : v.second[t]) {
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment